RSSSteve’s intuition about the relative contributions of nature versus nurture is pretty much right on the money. In a survey of virtually every twin study conducted over the last 50 years – covering 17,804 different traits and 14.5 million twin pairs – investigators found an average heritability* of 49%. Every single trait showed at least some degree of heritability and no trait showed complete heritability. In other words, all human traits are influenced by both genes (nature) and the environment (nurture).
https://www.nature.com/articles/ng.3285
For the twin study and heritability aficionados out there, the results are made accessible via a web app: http://match.ctglab.nl/#/home
*Technical note: Heritability is the percent of phenotypic variability for a trait explained by genetic variability, ie a measure of the relative influence of genes (nature) on a given trait in a given population. In the strict sense, heritability doesn’t tell us anything about group differences, as it is a within population parameter, although it may be considered to serve as a prior on expectations that group differences in a trait are also influenced by genes.
As usual, a fascinating take on a complex topic by Steve.
However, the greatest genetic divide in the human race is not between sub-Saharans and everyone else. The greatest genetic divide in the human race is between sub-Saharan Khoe-San groups and everyone else, including other sub-Saharan populations. The Khoe-San lineage diverged from other African groups ~250,000 years ago (although these African lineages have admixed over the last ~1,000 years owing to the Bantu expansion). As such, there is far more human genetic diversity below the Sahara than there is in the rest of the world combined.* West Africans, East Africans and Europeans are more closely related to each other than either is to the Khoe-San, having diverged ~150,000 and ~75,000 years ago, respectively (see image).
*What the implications of this are for phenotypic divergence is a matter for interesting speculation.
Schlebush et al. 2017 Science 358: 652-655 https://science.sciencemag.org/content/358/6363/652
I wonder what exactly they found in PCs 21-30 offering discriminating power. I wish they had said something about the composition of that 1.3% discordance (or did they? your point indicates some specifics). In the Thompson thread I mentioned Figure S7 from https://www.biorxiv.org/content/biorxiv/suppl/2017/07/20/166298.DC1/166298-1.pdf
Comparing the assignment using 30 PCs versus 20 PCs revealed a discordance rate of 1.3%. This level of consistency is not surprising, because as higher PCs tend to describe finer-level populations structure, they are less informative for the four major HARE groups.
I tried to respond to your comment yesterday, but my response went to the general thread. You can see my response as comment #94
To add an embedded image just include the link. I'm not sure what the exact rules for surrounding text are, but I usually include the link on a line alone with a blank line above (and below if there is text following).
[I can’t figure out how to add images to comments like you do]
With respect to the question of whether race is a social construct … as a famous politician once said “it depends on what the meaning of is is”. Lol (sorry). The issue is a complicated one (obviously). Race can be considered as a social construct given that there are a socially agreed upon set of criteria that we use to assign individuals to racial groups (or that individuals use to self-identify). People define these criteria, so they are socially constructed sensu strictu. These criteria can change from one group to another, as you allude to, and they can change among different countries. So the criteria are not set in stone, and they include cultural and other non-genetic attributes. But historically, these criteria have focused primarily on visible traits that are markers of ancestry, and for this reason they co-vary with genetic diversity. In this sense, the assertion that race is a social construct cannot be taken to mean that it is unrelated to genetics, as it too often the case. It is clearly highly correlated with genetics. This has important implications for biomedical research and for clinical practice.
Ethnicity is a different matter, depending on your definition, and I tend to think of it in terms of the OMB definition as a marker of shared culture (Hispanic origin specifically). The history of the adoption of the use of the term ‘Hispanic’ in this regard is enlightening. There is a book called “Making Hispanics” by G. Cristina Mora that explains how the creation of this pan-ethnic label was an intentional decision by bureaucrats and activists in the 70s to create a new unified group to consolidate power an influence. It was quite controversial at the time, and many distinct Hispanic populations pushed back against this, but the label was ultimately adopted, and one could argue, changed the course of US history. You can see a similar dynamic playing out these days, where groups that are currently classified as white want to create their own officially recognized groups. The previous administration set the wheels in motion on this (see https://obamawhitehouse.archives.gov/omb/fedreg_race-ethnicity), but it has not been adopted by the current administration.
It is indeed quite interesting that they get additional resolution from those last 10 PCs, given how fast the fraction of variation explained typically falls off with each new PC. I read a recent paper where they classified structure of white British in the UK Biobank using the first 100 PCs – seems like overkill but who knows when you are getting down to very fine levels.
FYI – you can find free copies of many scientific articles on sci-hub https://whereisscihub.now.sh/ The Fang et al. paper available there (but no Supplement). I had a look at the Supplement and there is no figure like the one you found for the UK Biobank. They do show a Table with the correspondence between self-identified race/ethnicity and ancestry for a subset of the data, which doesn’t add much other than showing the dispersal of self-identified Native Americans among ancestry groups. [I can’t figure out how to add images to comments like you do]. The rest of the Supplement mainly concerns exploring parameter space for various models and GWAS settings (important but not super illuminating).
With respect to Obama, I agree that his African ancestry is certainly not typical of African Americans’ given that his father was from Kenya. I had a look on Google and found that his father was from the Luo tribe, which is part of the Nilo-Saharan language family. In this comprehensive paper on African genetic diversity, you can see that the Nilo-Saharan populations are clearly distinct from the West African populations that were linked to the trans-Atlantic slave trade https://elifesciences.org/articles/15266. Good point! Nevertheless, the African ancestry component is so distinct compared to the European and Native American components, that it probably wouldn’t matter for the purposes of classifying African descendants as distinct from other groups in the US using machine learning (PC1 would drive the vast majority of the signal there).
2. They include a multi-ethnic height GWAS.
While earlier studies focused on populations of predominantly European descent, recent efforts have aimed to substantially expand racial and ethnic diversity. The Million Veterans Program1 (MVP) represents a multi-ethnic cohort, which has enrolled more than 750,000 veteran volunteers, completed genotyping in more than 350,000 participants to date, and includes a wealth of phenotypes and health outcomes.
This is the first large sample multi-ethnic GWAS I have seen (does anyone know of others?). Figure 5 has a Venn diagram of Genome-wide Significant Height Loci in Each HARE Group. It is notable how few loci appear only among non-white groups. Though that is probably partly an artifact of smaller sample sizes for those groups.Replies: @Genome Voyager
Using HARE as the stratifying variable, we investigate the effectiveness of detecting ethnicity-specific trait loci through simulation as well as analysis of height as a model trait in the MVP.
There are scores of multi-ethnic GWAS available by now, albeit not at such large scale. The NIH’s dgGaP database has an application that lets you visualize the relationship between the self-identified race/ethnicity of cohort members compared to their genetic ancestry – it is called GRAF-pop https://www.g3journal.org/content/9/8/2447.long
A search of the NHGRI-EBI GWAS Catalog https://www.ebi.ac.uk/gwas/home using the terms trans-ethnic or multi-ethnic turns up a number of multi-ethnic GWAS, and the PAGE study is a great example of a recent study of this kind https://www.ncbi.nlm.nih.gov/pubmed/31217584
It looks like much of the mapping failure is concentrated in the Hispanic group (see panel D).Replies: @Genome Voyager
(D) Individual ancestry of TCGA patients inferred by STRUCTURE. Each color represents one of the ancestry reference groups. Each patient is represented by a column partitioned into different colors corresponding to the genetic ancestry composition. Patients are ordered following a hierarchical clustering by Ward's methods on distance matrix calculated as cosine dissimilarity of genetic composition. SIRE and genetic ancestry categorization as estimated by EIGENSTRAT for each patient are shown in the same order at the bottom.
(E) Three-dimensional visualization of reference populations with three patients (TCGA-06-0167, TCGA-PE-A5DD, and TCGA-VS-A9V2) used as examples for genetic ancestry (AA, EAA, and NA, respectively).
A few comments regarding your response to my initial comment (and one other interesting detail):
1. In the 2018 Yuan et al. paper where they analyzed cancer patients, they only measured the correspondence between self-identified race and genetic ancestry. They left ethnicity, ie the Hispanic group, out of their analysis. This makes perfect sense when you consider that Hispanic is a pan-ethnic label that encompasses groups with very different ancestries, eg Puerto Ricans versus Mexicans. Consistent with this, and as you very correctly point out, you can see that the Hispanic group is more broadly distributed along the STRUCTURE plot in Figure 1D.
2. But the 2019 Fang et al. paper actually included a Hispanic group in their analysis, and they were still able to very accurately map them to an ancestry cluster. How can this be? Well, it turns out that the power and resolution of these new analyses rests on the marriage of principal components analysis (PCA) with machine learning clustering. PCA is used to characterize genetic ancestry at a much finer level of resolution than what you can see in the STRUCTURE plot, where there are only four ancestry components shown. In other words, the ancestry of each person is modeled as a combination of four ancestry fractions. Fang et al. used the first 30 principal components with their machine learning algorithm, so each person is modeled as a combination of 30 ancestry components. This is far more detail than they eye can make out (eg you can only visualize 3 principal components in Figure 1A of the Yuan et al. paper), but the machine learning algorithms can of course handle it. When you analyze 30 ancestry components, you can make out more subtle differences such as the difference between British versus Spanish European ancestry and/or the difference between Native American ancestry from Mesoamerica (seen in Mexicans) versus the Amazon (seen for Puerto Ricans). This level of resolution allows for distinction of Hispanics from self-identified whites even when they have very similar continental ancestry fractions (European and Native American).
3. One final caveat regarding Steve’s point about Obama is that the ancestry clusters identified by these methods are, for the most part, not coherent or ‘pure’ (for lack of a better term) ancestry groups. The specific ancestry groups can actually be quite spread among the ancestry components, as would be the case for African Americans (including Obama) and Hispanics, but they are nevertheless captured as clearly distinct by machine learning on multiple principal components.
4. Finally, there is an interesting nugget buried in the details of both the Yaun et al. and Fang et al. papers. There is in fact one particular US population group in these two cohorts for which there is little or no correspondence between self-identified race/ethnicity and genetic ancestry – Native Americans. Yuan et al. show that self-identified Native Americans have an average of 22% Native American ancestry, substantially less than what is typically seen for Hispanics. Fang et al. were not able to accurately map Native Americans to a distinct ancestry group. For these cohorts, most individuals that identify as Native American have far fewer Native American than European ancestors.
I wonder what exactly they found in PCs 21-30 offering discriminating power. I wish they had said something about the composition of that 1.3% discordance (or did they? your point indicates some specifics). In the Thompson thread I mentioned Figure S7 from https://www.biorxiv.org/content/biorxiv/suppl/2017/07/20/166298.DC1/166298-1.pdf
Comparing the assignment using 30 PCs versus 20 PCs revealed a discordance rate of 1.3%. This level of consistency is not surprising, because as higher PCs tend to describe finer-level populations structure, they are less informative for the four major HARE groups.
Race and ethnicity are overwhelmingly correlated with genetic ancestry in the United States. Recent, large-scale studies of ~11,000 cancer patients and ~202,000 military veterans found that individuals’ self-identified race and ethnicity showed 95.6% (cancer) and 99.5% (veterans) correspondence to genetic ancestry clusters.
Yuan et al. (2018) Cancer Cell. 34: 549–560 https://www.sciencedirect.com/science/article/pii/S1535610818303799
Fang et al. (2019) Am J Hum Genet. 105:763-772
https://www.cell.com/ajhg/fulltext/S0002-9297(19)30338-6
It looks like much of the mapping failure is concentrated in the Hispanic group (see panel D).Replies: @Genome Voyager
(D) Individual ancestry of TCGA patients inferred by STRUCTURE. Each color represents one of the ancestry reference groups. Each patient is represented by a column partitioned into different colors corresponding to the genetic ancestry composition. Patients are ordered following a hierarchical clustering by Ward's methods on distance matrix calculated as cosine dissimilarity of genetic composition. SIRE and genetic ancestry categorization as estimated by EIGENSTRAT for each patient are shown in the same order at the bottom.
(E) Three-dimensional visualization of reference populations with three patients (TCGA-06-0167, TCGA-PE-A5DD, and TCGA-VS-A9V2) used as examples for genetic ancestry (AA, EAA, and NA, respectively).
Race and ethnicity are overwhelmingly correlated with genetic ancestry in the United States. Recent, large-scale studies of ~11,000 cancer patients and ~202,000 military veterans found that individuals’ self-identified race/ethnicity showed 95.6% (cancer) and 99.5% (veterans) correspondence to genetic ancestry clusters.
Yuan et al. (2018) Cancer Cell. 34: 549–560 https://www.sciencedirect.com/science/article/pii/S1535610818303799
Fang et al. (2019) Am J Hum Genet. 105:763-772
https://www.cell.com/ajhg/fulltext/S0002-9297(19)30338-6
I typically agree with your sentiment regarding the clichéd and dogmatic use of the term ‘diversity’, and I also found the article from Birney et al. far less than convincing overall. However, the authors do make a valid and important point regarding the extent of genetic diversity in Africa, and it actually holds up quite well whether you are considering Africans on the African continent or African descendants in the Americas. So I want to take a stab at answering your question as to why scientists are excited about African genetic diversity. In short, when it comes to genetic diversity, African and African descendant genomes are a phenomenally rich, and largely untapped, source for discovery of disease genes and/or druggable targets. In this case, the overuse of the term ‘diversity’ unfortunately obscures a really interesting scientific point. Because modern humans (Homo sapiens sapiens) emerged in Africa between 250-300,000 years ago and only left within the last 50-100,000 years, we evolved for close to 2/3 of our existence as a distinct (sub) species within Africa. There has simply been much more time to accumulate genetic divergence within Africa than out of Africa; also, migration out of Africa and around the globe entailed numerous serial bottlenecks that sequentially reduced variation in non-African populations (this is most pronounced for Indigenous populations in the Americas). So a natural genetic classification of humans would indeed split most deeply within Africa as the authors indicate. I think you could make a good argument about the direct relevance of this excess of African genetic diversity for overall human phenotypic diversity, particularly given the bottleneck events that I just mentioned, physical isolation of non-African populations, and local adaptation, all of which served to accelerate phenotypic divergence. And of course, owing to the historical dynamics of the trans-Atlantic slave trade, you are absolutely correct that the most deeply diverging human lineages in Africa, the Bushmen and Pygmies, are barely represented in the genomes of African Americans. We do see a very small ancestry contribution from these populations, but it is probably a result of admixture with migrating, or marauding, Bantus (the Yamnaya of Africa!) as they expanded within Africa prior to the slave trade. So yes, African descendant populations in the Americas are mainly composed of ancestry from coastal West Africa (Senegambia and the Gold Coast Nigeria/Ghana region) and the more Southwest Bantu region, with relative proportions varying from country to country based on the colonial powers that dominated the slave trade in each (e.g. the African descendants in the US are far more Nigerian and less Bantu whereas Brazil is substantially more Bantu). But even so, the African descendant genomes in the Americas show more diversity, with more rare variants, than non-African genomes. The extent of genetic diversity in Africa and among African descendants in the Americas has two direct implications for human diversity, particularly as it relates to disease-gene mapping and the search for druggable targets: (1) there is more overall diversity within and among African populations, and (2) African genomes have very distinct patterns of linkage disequilibrium (LD). The latter point is a technical detail but very important for characterizing the genetic architecture of traits with genome-wide association studies. The long amount of time that humans spent in Africa yielded higher diversity but also allowed for more recombination between chromosomes and thus shorter LD blocks (shorter genomic intervals containing variants that have been inherited together). This means that genome-wide association studies that include African genomes are far more likely to uncover causal variants than studies conducted in non-African populations with longer LD blocks, ie larger genomic intervals that contain the causal variant(s).
Since you like examples, here is just one cool example regarding the connection between African genetic diversity and genetic drug discovery: https://www.nytimes.com/2013/07/10/health/rare-mutation-prompts-race-for-cholesterol-drug.html
P.S. We now refer to Pygmies as ‘Rainforest Hunter-Gatherers’ … it’s more polite 😉