The Unz Review - Mobile
A Collection of Interesting, Important, and Controversial Perspectives Largely Excluded from the American Mainstream Media
Email This Page to Someone

 Remember My Information

Authors Filter?
Razib Khan
Nothing found
 TeasersGene Expression Blog
1000 Genomes

Bookmark Toggle AllToCAdd to LibraryRemove from Library • BShow CommentNext New CommentNext New Reply
🔊 Listen RSS


I’m currently trying to figure out how best to integrate 1000 Genomes data, along with Estonian Biocentre, and HGDP. First, I converted the VCF files from the 1000 Genomes into pedigree format. I’ll put that up on GitHub in the next few days. Then I filtered the results for SNPs which are found in the HGDP. Finally, I intersected the results with the Estonian Biocentre data sets. I was left with ~250,000 markers after quality control (e.g., removing markers which are missing in more than 0.1% of the 5000+ samples).

In particular I’m curious about the population structure of the South Asian data. When you sample Chinese or Japanese or the English you need to be geographically diverse, but you don’t have to worry about social stratification too much.* Not so with South Asia. You have to be careful who and where you sample, because the variation doesn’t cleanly follow geography. In the Diaspora for example wealthier and higher status groups tend to be represented. In the mid-2000s Noah Rosenberg’s lab published Low Levels of Genetic Divergence across Geographically and Linguistically Diverse Populations from India. In the paper itself the authors cautioned their samples were from the United States, so one should be careful about accepting the idea that they might represent the geographic variation in South Asia well. In hindsight it seems likely that their selection bias was too great for them to overcome to make robust conclusions, even with over 700 microsatellites.

Above is a PCA plot I generated for South Asians. I’m not quite sure of the coding of some of the Estonian Biocentre populations, so don’t take that as gospel. I was more curious about the distribution of the 1000 Genomes samples, since they are likely to be widely used in the near future.

First, let’s focus on the Bengalis from Bangladesh:


I was frankly surprise how genetically homogeneous this group is. The two overlapping black dots are my parents. It seems clear that my family comes from a region of Bangladesh which likely has more East Asian ancestry than is the norm. This makes geographic sense, my family’s roots are in the eastern part of eastern Bengal. Though it is hard to see on this plot a small group of Bengali individuals, specifically six, reside in a tight cluster amidst samples from Tamil Nadu. The fact that they aren’t randomly scattered indicates to me that there’s some genuine structure here. I suspect that there is evidence here of a group which has been assimilated, but retained its separate caste-community identity.

But overall there is a major contrast between the Bengali samples from Bangladesh, and the previous Gujarati samples, now also in the 1000 Genomes (ou can see the Patel cluster on the other side of the Bengalis, as it bulges out). The non-Patel Gujaratis were genetically varied, some very similar to individuals from Pakistan. In contrast there isn’t that sort of cline among the Bengali samples (it doesn’t look like they sampled any Bengali Brahmins in this data set, at least those of full heritage). The Punjabi samples were collected from Lahore, and they range from many individuals who are little different from Pathans to some whose genetic background resembles those from middle castes in Southern India. I don’t know what’s going on here, but there has been some back migration of laborers into the Punjab historically. I believe this is the origin of some low caste groups who are now Christian. Both the Telegu and Tamil samples have a few Brahmins in them. This is clear in the following plot:


The two Brahmins are from Tamil Nadu. You notice that several of the 1000 Genomes Tamil and Telegu samples are rather close to them. South Indian Brahmins tend to be genetically very similar, so almost certainly that’s what these individuals are if they are placed here on the PCA. Though the Tamil samples are relatively tightly clustered, the Telegu break out into several groups. One of the major 1000 Genomes groups overlaps perfectly with Velamas, a middle caste from Andhra Pradesh. The individuals who are Telegu speakers between the Velamas and Brahmins may be of mixed heritage. I don’t know.

Ultimately I’d like to do some TreeMix and pairwise comparisons between these populations. But to do that I’m going to have to clean them up a bit so that they make sense as…populations.

* The outcaste group in Japan only crystallized during the Tokugawa period. Not long enough to be genetically that distinct from the broader Japanese population.

• Category: Science • Tags: 1000 Genomes, Genetics 
🔊 Listen RSS

I don’t have time for this, but I’m sure some readers do. 1000 Genomes has put a tutorial up. Breakdown:

1. Description of the 1000 Genomes Data, Gabor Marth pdf|pptx

2. How to access the Data, Paul Flicek pdf|pptx

3. Lessons in variant calling and genotyping, Hyun Min Kang pdf|pptx

4. Structural Variants, Ryan Mills pdf|pptx

5. Imputation in GWAS studies, Bryan Howie pdf|pptx


All Slides in PDF

(Republished from Discover/GNXP by permission of author or representative)
• Category: Science • Tags: 1000 Genomes, Genomics, Human Genomics 
🔊 Listen RSS
The Pith: You are expected to have 30 new mutations which differentiate you from your parents. But, there is wiggle room around this number, and you may have more or less. This number may vary across siblings, and explain differences across siblings. Additionally, previously used estimates of mutation rates which may have been too high by a factor of 2. This may push the “last common ancestor” of many human and human-related lineages back by a factor of 2 in terms of time.

There’s a new letter in Nature Genetics on de novo mutations in humans which is sending the headline writers in the press into a natural frenzy trying to “hook” the results into the X-Men franchise. I implicitly assume most people understand that they all have new genetic mutations specific and identifiable to them. The important issue in relation to “mutants” as commonly understood is that they have salient identifiable phenotypes, not that they have subtle genetic variants which are invisible to us. Another implicit aspect is that phenotypes are an accurate signal or representation of high underlying mutational load. In other words, if you can see that someone is weird in their traits, presumably they are rather strange in their underlying genetics. This is the logic behind models which assume that mutational load has correlates with intelligence or beauty, and these naturally tie back into evolutionary rationales for human aesthetic preferences (e.g., “good genes” models of sexual selection).

Variation in genome-wide mutation rates within and between human families:

J.B.S. Haldane proposed in 1947 that the male germline may be more mutagenic than the female germline…Diverse studies have supported Haldane’s contention of a higher average mutation rate in the male germline in a variety of mammals, including humans…Here we present, to our knowledge, the first direct comparative analysis of male and female germline mutation rates from the complete genome sequences of two parent-offspring trios. Through extensive validation, we identified 49 and 35 germline de novo mutations (DNMs) in two trio offspring, as well as 1,586 non-germline DNMs arising either somatically or in the cell lines from which the DNA was derived. Most strikingly, in one family, we observed that 92% of germline DNMs were from the paternal germline, whereas, in contrast, in the other family, 64% of DNMs were from the maternal germline. These observations suggest considerable variation in mutation rates within and between families.

From what I gather there’s a straightforward reason why the male germline, the genetic information which is transmitted by sperm to a male’s offspring, is more mutagenetic: sperm are produced throughout your whole life, and over time replication errors creep in. This is in contrast to a female’s eggs, where the full complement are present at birth. The fact that mutations creep in through sperm is just a boundary condition of how mutations creep in to the germline in the first place, errors in the DNA repair process. This is good on rare occasions (in that mutations may actually be fitness enhancing), more often this is bad (in that mutations are fitness detracting), and, oftentimes it is neutral. Remember that in terms of function and fitness a large class of mutations don’t have any effect. Consider the fact that 1 out of 25 people of European descent carry a mutation which can cause cystic fibrosis in the general population if it manifests in a homozygote genotype. But the vast majority of cystic fibrosis mutations are present in people who are heterozygote, and have a conventional functional gene which “masks” the deleterious allele.* And there are many mutations which are silent even in homozogyote form (e.g., if there is a change in a base at a synonymous position).

As noted in the letter above until recently estimating mutation rates was a matter of inference. On the broadest canvas one simply looked at differences between two related lineages which had been long separated (e.g., chimpanzee vs. human), and so accumulated many differential mutations, and assayed the differences. It may have been a fine-grained inference in the case of individuals who manifested a disease which exhibited a dominant expression pattern, so that one de novo mutation in the offspring could change the phenotype. For most humans this is thankfully not a major issue, and mutations remain cryptic for most of our lives. But no longer. With cheaper sequencing at some point in the near future most of us will have accurate and precise copies of our genomes available to us, and we will be able to see exactly where we have unique mutations which differentiate us from our parents and our siblings.

In this paper the authors took two “trios,” parent-child triplets, and compared their patterns of genetic variation at the scale of the full genome to a very high level of accuracy. Accuracy obviously matters a great deal when you might be looking for de novo mutations which are going to be counted on the scale of hundreds when base pairs are counted in billions. In the future when we have billions and billions of genomes on file and omnipotent computational tools I suspect there will be all sorts of ways to ascertain “typicality” of regions of your genome, but in this paper the authors naturally compared the parents to the children. If a mutation is de novo it should be underivable from the genetic patterns of the parent. But, sequencing technologies are not perfect, so there’s going to be a high risk for false positives when you are looking for the de novo mutations “in the haystack” (e.g., an error in the read of the offspring can be picked up as a mutation).

So they started with ~3,000 candidate de novo mutations (DNMs) for each family trio after comparing the genomes of the trios, but narrowed it down further experimentally as they filtered out the false positives. You can read the gory details in the supplements, but it seems that they focused on the identified candidates to see if they were: germline DNMs, non-germline DNMs, variant inherited from the parents, or a false positive call. So it turns out that half of the preliminary DNMs were somatic and about 1% turned out to be germline. Remember that the difference is that the germline mutations are going to be passed on to one’s offspring, while the somatic mutations only have impact on one’s physiological fitness over one’s life history. For the purposes of evolution germline mutations are much more important, though over your lifetime somatic mutations are going to be very important as you age.

After the methodological heavy-lifting the results themselves are interesting, albeit of somewhat limited generalizability because you are focusing on two trios only. Before we examine the results here’s a figure which illustrates the study design:

From what I can gather there are two primary findings in this paper:

1) Variance in the sex-mediated nature of DNMs across trios. One of the pairs was much closer to expectation. The male germline contribution was responsible for the vast majority of DNMs.

2) A more precise estimate of human mutational rates which might have implications for “molecular clock” estimates used in evolutionary phylogenetics.

Here are the findings in a figure which shows the 95% confidence intervals around estimated mutation rates:

CEU refers to the sample of white Utah Mormons commonly used in medical genetics, while YRI refers to Yoruba from Nigerians. Remember, these are two families only. That severely limits the power of the insights which you can draw, but already you see that while the CEU trio shows the expected imbalance between male and female contribution to DNMs, the YRI trio does not. But, both of the trios do suggest a lower mutation rate than found in previous studies which inferred the value from species divergence. Here is the portion which is relevant for human evolution: “These apparently discordant estimates can be largely reconciled if the age of the human-chimpanzee divergence is pushed back to 7 million years, as suggested by some interpretations of recent fossil finds.” I wouldn’t put my money on this quite yet, going by just this one study, but I’ve been hearing that this paper doesn’t come to this number in a scientific vacuum. Other researchers are converging upon a similar recalibration of mutational rates which might push back the time until the last common ancestor of many divergent hominoid and hominin lineages (including modern humans).

Moving the lens back to the present and of more personal genomic relevance:

Mutation is a random process and, as a result, considerable variation in the numbers of mutations is to be expected between contemporaneous gametes within an individual. If modeled as a Poisson process, the 95% confidence intervals on a mean of ~30 DNMs per gamete (as expected from a mutation rate of ~1 × 10−8) ranges from 20 to 41, which is a twofold difference. Truncating selection might act to remove the most mutated gametes and thus reduce this variation among gametes that successfully reproduce, however, any additional heterogeneity in stem-cell ancestry or environment (for example, variation in the number of cell divisions leading to contemporaneous gametes) would likely increase inter-gamete variation in the number of mutations.

Using the much smaller marker set obtained from 23andMe I found that two of my siblings are nearly 3 standard deviations apart in in identity-by-descent when it comes to the distribution of full-siblings. In the near future we might be able to ascertain the realized, not just theoretical, extent of mutational load across a family. As noted by the authors much of this might be a function of paternal age. Rupert Murdoch has children who are younger than many of his grandchildren, so there are many, many, “natural experiments” out there, as males are having offspring over 40 years apart.

On a societal level we may be able to estimate the exact cost in terms of public health costs of rising mean age of fathers. Personally we may also be able to note the correlations within families between high levels of DNMs and traits of interest such as intelligence and beauty. Compared to more fine-grained tools of ancestry inference I presume this is going to be dynamite. But it isn’t as if we didn’t know siblings varied before.

Citation: Donald F Conrad, Jonathan E M Keebler, Mark A DePristo, Sarah J Lindsay, Yujun Zhang, Ferran Casals, Youssef Idaghdour, Chris L Hartl, Carlos Torroja, Kiran V Garimella, Martine Zilversmit, Reed Cartwright, Guy A Rouleau, Mark Daly, Eric A Stone, Matthew E Hurles, & Philip Awadalla (2011). Variation in genome-wide mutation rates within and between human families Nature Genetics : 10.1038/ng.862

* In a random mating population the proportions are defined by the Hardy-Weinberg Equilibrium, p2 + 2pq + q2 = 1, so where q = 0.04, q2 = 0.0016 and 2pq = 0.0768. Heterozygote genotypes of CF outnumber homozygote ones 50 to 1.

Bloggy addendum: The first author of this letter is Don Conrad who is a contributor to Genomes Unzipped.

(Republished from Discover/GNXP by permission of author or representative)
🔊 Listen RSS I was semi-offline for much of last week, so I only randomly heard from someone about the “Science paper” on which Molly Przeworski is an author. Finally having a chance to read it front to back it seems rather a complement to other papers, addressed to both man and beast. The major “value add” seems to be the extra juice they squeezed out of the data because they looked at the full genomes, instead of just genotypes. As I occasionally note the chips are marvels of technology, but the markers which they are geared to detect are tuned to the polymorphisms of Europeans.

Classic Selective Sweeps Were Rare in Recent Human Evolution:

Efforts to identify the genetic basis of human adaptations from polymorphism data have sought footprints of “classic selective sweeps” (in which a beneficial mutation arises and rapidly fixes in the population). Yet it remains unknown whether this form of natural selection was common in our evolution. We examined the evidence for classic sweeps in resequencing data from 179 human genomes. As expected under a recurrent-sweep model, we found that diversity levels decrease near exons and conserved noncoding regions. In contrast to expectation, however, the trough in diversity around human-specific amino acid substitutions is no more pronounced than around synonymous substitutions. Moreover, relative to the genome background, amino acid and putative regulatory sites are not significantly enriched in alleles that are highly differentiated between populations. These findings indicate that classic sweeps were not a dominant mode of human adaptation over the past ~250,000 years.

Figure 2 shows the top-line result. There are certain mutations which are “non-synonymous,” in that they change the amino acid encoded by the codon. Others are “synonymous,” insofar as changing the base pair has no direct functional impact. Since natural selection “sees” function the expectation is that it would impact the two types of substitutions differently. More specifically, synonymous bases should be relatively “neutral” in terms of their rate of change vis-a-vis non-synonymous bases, which may be affected by both positive and negative selective forces.

A “classic sweep” is a very easy dynamic to imagine. Single mutations arise which are very favored and so are driven to “fixation,” ~100%, within the population rather rapidly by positive directional selection. Since the mutation is embedded in the broader genome natural selection will also “catch” other variants associated with the mutant of interest, in direct proportion to parameters such as distance and rates of recombination. Selective sweeps then produce regions of relative homogenization as a whole block of the ancestral background genome around the favored mutant is dragged upward in frequency. The interesting point in this paper is that the authors show that there’s relatively little difference in the pattern between functionally significant and non-significant regions of the genome. As the classic sweep models are predicated upon strong positive selection operating upon a favored variant, something seems off.

What does this mean? Selective sweeps are a tractable dynamic. If they’re not so ubiquitous then human evolutionary genetics becomes a rather more complex game, with different varieties of natural selection operative. By analogy, perhaps this is similar to the unfortunate reality that the “common disease-common variant” seems to be only marginally fruitful.

Now, it does turn out that some traits do seem to have been driven by conventional sweeps. Pigmentation, infectious disease resistance, and lactase persistence. No surprise that these are traits whose genetic architectures have also been relatively well elucidated. Finally, I find this passage intriguing:

…This, to dissect the genetic basis of human adaptations and assess what fraction of the genome was affected by positive selection, we need new tests to detect other modes of selection, such as comparisons between closely related populations that have adopted to drastically different environments….

I have a candidate dyad in mind: Papuans and Australian Aborigines. They separated as distinctive populations within the past 10 to 20,000 years, and have diverged greatly in their mode of existence with the spread of horticulture in the highlands of New Guinea.

Citation: Hernandez RD, Kelley JL, Elyashiv E, Melton SC, Auton A, McVean G, 1000 Genomes Project, Sella G, & Przeworski M (2011). Classic selective sweeps were rare in recent human evolution. Science (New York, N.Y.), 331 (6019), 920-4 PMID: 21330547

(Republished from Discover/GNXP by permission of author or representative)
🔊 Listen RSS

Tishkoff et al.

Reading Peter Bellwood’s First Farmers: The Origins of Agricultural Societies, I’m struck by how much of a difference five years has made. When Bellwood was writing the ‘orthodoxy’ of the nature of the expansion of farming into Europe leaned toward cultural diffusion. Today the paradigm is in flux, as a new generation of genomic studies using ancient DNA, wider sets of markers, and a broader sampling of populations, makes untenable solid old truths. I’m reading Bellwood’s work in part because from what I have read elsewhere it seems as if his model seems less and less ridiculous in light of the new information bubbling out of human genomics. The swell of data in this field is such that it’s hard to keep up. You never know what you’re going to wake up to in the morning. The assertions of archaeologists and pre-historians such as Bellwood have clear implications and offer up specific predictions about the shape of the tree of human phylogenetics. Now the results are getting robust enough that the models can be tested, and alternatives refuted or accepted. But sometimes you need to take stock. Many of my posts make the assumption that you have a lot of the background information in hand, but I know that’s not always possible. With that, I’d like to bring your attention to a paper in Human Molecular Genetics, Fine-scale population structure and the era of next-generation sequencing:

Fine-scale population structure characterizes most continents and is especially pronounced in non-cosmopolitan populations. Roughly half of the world’s population remains non-cosmopolitan and even populations within cities often assort along ethnic and linguistic categories. Barriers to random mating can be ecologically extreme, such as the Sahara Desert, or cultural, such as the Indian caste system. In either case, subpopulations accumulate genetic differences if the barrier is maintained over multiple generations. Genome-wide polymorphism data, initially with only a few hundred autosomal microsatellites, have clearly established differences in allele frequency not only among continental regions, but also within continents and within countries. We review recent evidence from the analysis of genome-wide polymorphism data for genetic boundaries delineating human population structure and the main demographic and genomic processes shaping variation, and discuss the implications of population structure for the distribution and discovery of disease-causing genetic variants, in the light of the imminent availability of sequencing data for a multitude of diverse human genomes.

E(x) = 4th cousins

The paper reviews all the different ways in which human populations are related, the evolutionary forces which they’re shaped by, and, nested layers of population structure which we’re now just starting to explore. A few months ago I blogged that geneticists have found that they could differentiate population clusters on the scale of nearby villages in Europe! This is incredible, as recently as 15 years ago scientists would have struggled for a dozen markers to differentiate populations separated by continents. In the paper on Sami genomics which I covered earlier in the week I didn’t even bother to mention that the Sami apparently exhibit internal population structure. Not too surprising given the fragmented nature of low density marginal ecologies, but nevertheless a reality check on the fact that we tend to perceive such people as a homogeneous whole. The Bushmen of South Africa are among the most diverse people in the whole world. SNP-chips fine-tuned with European genetic variation are likely missing many variants peculiar to Bushmen, so our estimates are surely low bounds.

coloredIn the paper above they suggest that a next step in the exploration of human genomic variation will be on a finer scale more broadly. Baden-Baden vs. Baden-Württemberg. Patterns of similarities across chromosomal segments which indicate near-familial relations because of clear identical-by-descent DNA. For example, they note that the average Ashkenazi Jew has a genetic distance from another random Ashkenazi Jew on the order of 4th cousins. Most human variation is found within populations, but there are still thousands of markers which exhibit a great deal of inter-population variance, and serve as a distinctive record of the evolutionary history of a given group. They also suggest that generally the focus as been on broad-scale population differences which have a deep time depth, on the order of tens of thousands of years. In contrast, the origin of the Ashkenazi Jews likely goes back no earlier than 1,000 years. Over the past 1,000 years they’ve coalesced into a culturally and genetically coherent people.

But why stop with small endogamous groups? 23andMe has ancestry paintings which show you your “Asian,” “European” and “African” ancestry along the chromosomes, using the three ancestral reference populations. But in the figure above from the paper you see a Cape Coloured whose ancestry has been broken down between the two African populations who are dominant in the ancestral makeup of that ethnic group. Dodecad tells me I’m about ~15% East Asian, while 23andMe tells me I’m ~45% Asian. The difference is that 23andMe doesn’t break out the indigenous South Asian from exogenous East Asian. At some point I assume I’ll be able to get an ancestry painting which shows the two ancestral categories separately.

A major reason that they seem to think the focus will be on fine-scale ancestry is that it will smoke out recessive diseases found in cryptically endogamous groups. From this I conclude that they lean to the side of those who have asserted that “Jewish diseases” are well known and prominent in part because of the focus given to that group. South Asians would be a good target of such new focus; according to some researchers there are clear patterns of genetic endogamy in this set of populations (for which we have anthropological cause). They also note that these recent rare variants are naturally going to break down along population lines, because they haven’t had time to spread.

Speaking of lacking time to spread, the evolutionary parameter of natural selection may also exhibit a lot of between population difference. Strong “sweeps” due to positive selection tend to be partitioned across continental races. This is partly a function of time. By the time a sweep goes from one end of a continent to another, it may run into another sweep which renders its selective effective irrelevant. It has famously been found that light pigmentation in western and eastern Eurasia have different genetic architectures. In other words, the phenotype is arrived at by different genotypic means in the two groups. Additionally, even within broader groups selection because of local adaptation can differentiate demes. For example, the Tibetans and the Han, or differences in lactase persistence in Spain.

Finally, they note that there will be continued advances in understanding of admixed ancestry, as well as a special focus on exonic regions (coding). The latter should be especially interesting, as it might give us more insight into functional differences and similarities. The tentative possibilities of the 1000 Genomes, HapMap, and HGDP, are also outlined. Here is their conclusion:

New sequencing technology enables genetic studies with larger and larger sample sizes, increasing our power to detect associations between genetic variants and medically relevant traits, especially in the case of rare variants. Understanding patterns of admixture and population structure is an important part of maximizing this detection power and reducing confounding factors in genome sequencing-to-phenotype association studies (50-54). As we have seen in the analysis of dense array-genotype data, the same sequencing technology that enables sequencing/phenotype mapping will also enable us to improve our knowledge of population structure at a fine scale. This improved knowledge should be of assistance not only in identifying structure in association studies, but also in the description of human history and genetic adaptation, and in the development of personalized medicine tools.

Oh, and the bar plot is nice & crisp. I’ve reedited it a bit, but check out (left to right, K = 9 to K = 5):

Citation: Brenna M. Henn, Simon Gravel, Andres Moreno-Estrada, Suehelay Acevedo-Acevedo, & Carlos D. Bustamante (2010). Fine-scale population structure and the era of next-generation sequencing Hum. Mol. Genet. : 10.1093/hmg/ddq403

(Republished from Discover/GNXP by permission of author or representative)
• Category: Science • Tags: 1000 Genomes, Genetics, Genomics, Population Genetics 
No Items Found
Razib Khan
About Razib Khan

"I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. If you want to know more, see the links at"