The Unz Review: An Alternative Media Selection
A Collection of Interesting, Important, and Controversial Perspectives Largely Excluded from the American Mainstream Media
Email This Page to Someone

 Remember My Information

Authors Filter?
Razib Khan
Nothing found
 TeasersGene Expression Blog

Bookmark Toggle AllToCAdd to LibraryRemove from Library • BShow CommentNext New CommentNext New ReplyRead More
ReplyAgree/Disagree/Etc. More... This Commenter This Thread Hide Thread Display All Comments
These buttons register your public Agreement, Disagreement, Thanks, LOL, or Troll with the selected comment. They are ONLY available to recent, frequent commenters who have saved their Name+Email using the 'Remember My Information' checkbox, and may also ONLY be used three times during any eight hour period.
Ignore Commenter Follow Commenter
🔊 Listen RSS

Dienekes Pontikos has just released DIY Dodecad, a DIY admixture analysis program. You can download the files yourself. It runs on both Linux and Windows. Since I already have tools in Linux I decided to try out the Windows version, and it seems to work fine. It is somewhat limited in that you start out with the parameters which Dienekes has set for you, but if you don’t want to write your own scripts and get familiar with all the scientific programs out there, I think this is a very good option. Additionally, it seems to run rather fast, so you won’t spend days experimenting with different parameters.

Dienekes has already run me, but I put my parents’ genotype files through the system. Here are the results:

Population Razib Mother Father
East_European 6.9 6.5 4.3
West_European 1.7 3.1 5.5
Mediterranean 6.3 5.6 5.9
Neo_African 0 0 0
West_Asian 0 2 3.9
South_Asian 65.9 59.6 60.4
Northeast_Asian 2.9 3.8 3.6
Southeast_Asian 15.8 16.6 15.5
East_African 0 0 0.2
Southwest_Asian 0.5 2.5 0.7
Northwest_African 0 0 0
Palaeo_African 0 0.3 0

The main thing to notice is that my mother has more total East Asian ancestry than my father, and, that she has a Southwest Asian component which is at a few percent. These are always consistent findings in the dozens of ADMIXTURE runs I’ve done with various parameter settings and reference population mixes, so it’s nice that DIY Dodecad replicates those findings. Though the population sets seem a bit Eurocentric to me, so I would recommend it most for those with West Eurasian ancestry.

🔊 Listen RSS

Over the past few months I was hoping more people would start doing what Zack Ajmal, Dienekes, and David, have been doing. There are public data sets, and open source software, so that anyone with nerdy inclination can explore their own questions out of curiosity. That way you can see the power and the limitations of genomics on your own desktop. I wonder if one of the biggest reasons that more people haven’t started doing this is formatting. It can be a pain to convert matrix formatted files into pedigree format, for example. But the data gusher isn’t ending, look at what’s coming out (and has come out) in the 1000 Genomes project!

I’ve been thinking I need to write up a post which is a “soft landing” for people so that we can reduce the “activation energy” for this sort of thing…once you get hooked, you only go deeper. Luckily an anonymous tipster has sent me the link to a URL with a huge data set which has been merged, already pedigree formatted. Here are the populations:

!Kung Buryats Hausa Mada Punjabi Arain Totonac
Adygei Cambodian Hazara Makrani Pygmy Tu
African Americans Chinese Hema Malayan Romanians Tujia
Algeria Chinese Americans Hezhen Mandenka Russian Tunisia
Altaians Chukchis Hungarians Maya Sahara Occ Turks
Alur Chuvashs Iban Mbuti Sakilli Tuscans
Ap Brahmin Cochin Jews Igbo Melanesian Samaritians Tuvinians
Ap Madiga Colombian Iranian Jews Mexicans Samoan Urkarah
Ap Mala Cypriots Iranians Miao San Utahn Whites
Armenians Dai Iraq Jews Mongola San Nb Uygur
Armenians B Daur Irula Mongolians Sandawe Uzbekistan Jews
Ashkenazy Jews Dogon Italian Moroccans Sardinian Uzbeks
Azerbaijan Jews Dolgans Japanese Morocco Jews Saudis Vietnamese
Balochi Druze Jordanians Morocco N Selkups Greenlanders
Bambaran Greenlanders Kaba Morocco S Sephardic Jews Xhosa
Bamoun Egypt Kalash Mozabite She Xibo
Bantukenya Egyptans Karitiana N European Sindhi Yakut
South Africa Ethiopian Jews Kets Naxi Singapore Chinese Yemen Jews
Basque Ethiopians Khmer Nepalese Singapore Indians Yemenese
Bedouin Evenkis Kongo Nganassans Singapore Malay Yi
Beijing Chinese Fang Koryaks Nguni Slovenian Yoruba
Belorussian French Kurd North Kannadi Sotho/Tswana Yukaghirs
Biaka Fulani Kyrgyzstani Orcadian Spaniards
Bnei Menashe Georgia Jews Lahu Oroqen Stalskoe
Bolivian Georgians Lebanese Palestinian Surui
Brahui Gujaratis Lezgins Paniya Syrians
Brong Gujaratis B Libya Papuan Thai
Bulala Hadza Lithuanians Pathan Tamil Brahmin
Burusho Han Luhya Pedi Tamil Dalit
Buryat Han Nchina Maasai Pima Tongan

The data set has ~4,000 individuals, and ~30,000 markers. The binary file is ~25 MB. The download has four files. The .bed, .bim, and .fam, are pedigree formatted. The .csv is a “master list” of the information on each individual (population, region, etc., tied to a specific identification number). This is important because once you have some output files…you need to figure out what it means, and visualize it, and that’s only informative if you have a master list with more than just family and individual information.

Here is the link to the file to download with all the above populations. I’ve pulled it down and run it, so I know it’s not malware.

So what now? The post will be divided into three portions.

1) Running this data in ADMIXTURE

2) Visualizing it in R

3) Manipulating this data in Plink

#1 is not contingent on #2 and #3, so I’ll do that first. You don’t need to read #2 and #3. In fact some of you might be really good at manipulating spreadsheet formatted data, so it might not be needful to go to #2. But in the R section I’ll also have a easier spreadsheet output for you, so even if you don’t care for R’s visualization, you’ll at least have a better to manage set of .csvs. #3 matters if you want to constrain your data set, and also add your own 23andMe file to the end of it.

#1 Running the data in ADMIXTURE

First, you need Linux or MacOS. If you are on Windows, the Wubi application allows you have to have a dual boot. It runs Ubuntu Linux next to Windows, and you can uninstall it as if it is a Windows application.

I am doing this on Ubuntu Linux, for your information. Assuming you have the right operating system, now you need ADMIXTURE. You can put the folder anywhere.

You need to use the terminal to go to the folder where you have ADMIXTURE. The image to the left shows me doing so. You need to click the terminal application, and ender the “cd” command to get to the appropriate folder. My ADMIXTURE program is on the Desktop, within the “GA” folder, and the “admix2” subfolder. So I typed what you see. The “cd” command moves you around the folders, up and down. Google it if it confuses you, though without knowing what it does it should be fine if you just extract ADMIXTURE to the Desktop, and you type “cd Desktop”. This will clutter up your desktop in the future…but if you need to get some stuff done ASAP without knowing how to navigate in Linux, that should work.

So now you have ADMIXTURE, and the files which ADMIXTURE is going to analyze. What do you do? You need to make sure that ADMIXTURE and your files are in the same folder/location. So if ADMIXTURE is on the Desktop, just extract the files to the Desktop. Now you need to run a command. You see a screenshot of me running ADMIXTURE. You may need to omit the ./ (i.e., “admixture” vs. “./admixture”). You see the file name. The option -j2 is due to the fact that I have two cores. If you don’t know what that means, just omit it. It speeds up the run though. The last number is the K. So this is for K = 4.

Now the program will run. How long depends on the size of the file, and the number of K’s. I often run the program overnight for larger K’s. If you want to get fancy and do stuff like cross-validation, it will take even longer. Be warned. The screenshot to the left is typical of what you’ll run in to as ADMIXTURE does its thing. No worries, the algorithm is running. If you watch long enough you’ll get a sense of what values on the screen point to a high likelihood that it’s almost done, and you can start anticipating the output files from which you can make inferences.

Completion! To the right is what you’ll see when ADMIXTURE is done. As noted, there are output files. This is what is really interesting & useful, but even on this screen there’s goodness. The primitive matrix shows you Fst distances between putative ancestral populations. Fst is measuring the proportion of variance within the data set which can be attributed to between population variance. The smaller the value, the less the magnitude of differences between two populations. On this screen you see four populations, since I set K = 4. The Fst is generated from ancestral allele frequencies, which are within the output files. Remember, these are distances between abstract populations, not real ones.

The original files were euraocean.bed, euraocean.bim, and euraocean.fam. So the output files are like so:


The 4 represents the K. The first file has a list of the proportions for putative ancestral populations for each individual in the data set, the individuals being on separate lines. The second file has all the allele frequencies for the ancestral populations, generated by the parameter K.

What do you do with this? euraocean.4.Q is related to euraocean.fam, which has family and individual IDs line by line. I don’t know how to use spreadsheets in anything but a primitive way, so I assume there are ways to merge the files and get each line to have ancestry proportions as well as more detailed IDs. Generating mean values for populations also seems essential.

But I use R to do this dirty work.

#2 Visualizing the output with R

If you don’t have R, you need to install it. If you don’t know how to start, control-f sudo. That should yank it down for you. Once R is installed, make sure to be in the folder where you have ADMIXTURE. Then type “R” (no quotes when you type a command!). Now you are in R, what do you do? Here are the specifics of what you need to do:

1) Take the Q file, pump it into a data frame

2) Take the master list, pump it into a data frame

3) Take the .fam file, pump it into a data frame

4) Mix & match

5) Calculate mean proportions, output populations, etc.

6) Visualize!

If you needed to know how to install R, you probably don’t know how to do this. When I first started playing around with ADMIXTURE output files I wrote a quick & dirty script. I barely remember what I am doing with this script now, as I don’t care about the details. But it is now at your service. Still, first you need to do one thing: use a master list which is formatted slightly differently from the one that you downloaded. Here is the revised master list.

Put it in the same folder as ADMIXTURE. Then start R, again, by typing “R.” Run the command you see above. This creates an “HGDPMaster” data frame. That’s necessary for the script I’m giving you to run.

The script is here. If it doesn’t download, copy & paste, and create a file “Rstuff.R”, in the same folder as ADMIXTURE. There are a few variables which you have to manipulate. Here is the relevant section:

# change these
### outputfiles

#### sets the number of populations to through
#lowest K
#highest K

You need to change the file name to the one you have output. If you did do any manipulation, it should be ref.2.Q for K = 2, so the name is “ref.” You also need to put in the number of K’s. I often run many simultaneously, which I have output files for in the morning. So I often start with 2 and end with 12. If you just want to output one, for example, 2, change Start_K to 2, and End_K to 2. These are the only variables you need to change. But there is a lot more you could do. R “comments” with #, so there is a section which I commented out where you can limit the output to particular populations to make the bar plot less busy. You’ll see what I mean if you look at the script, just remove all the #’s, and reedit as to your taste. Please note that casing matters, so make sure to keep it lower case when possible (if you looked at the master list, you understand). The script does have a string to upper case function, but that’s only for the output. There’s also a small section where you can reedit the names to your taste.

To run the script, do like so:


It should output out bar plots, as well as generating some spreadsheet files. There’s a lot more you can do…but if you can do a lot more, you wouldn’t be reading this post. Let’s move to the next issue. So now you wonder: is there any way I can change the data file, or add myself to it? Read on….

#3 Using Plink to manipulate the data file

Now you need Plink. I usually put it within the same larger folder as a subfolder parallel with ADMIXTURE. You run the Plink command like so: “./plink” or, “plink.” Depends on the environment (remember, the quotes are only for the post!). There are many things you can do with Plink. I will show you how to do two things.

#1 remove individuals from the data set

#2 add yourself (or someone whose 23andMe file you have) to the data set

#1 is important because the plots get busy with too much variance. Additionally, Africans, and genetic isolates which have gone through population bottlenecks, tend to overwhelm ADMIXTURE. You probably want to remove them. To do this you need to use the remove option. You need to remove individuals.

Here’s one option with the file you’ve got:

./plink --bfile ref --remove removelist.txt --make-bed --out refRemoved

What’s going on above? You’re using a binary pedigree file, so you have the –bfile option on. You do the deed with –remove, and then you create a second binary pedigree file, refRemoved. So you’ll have refRemoved.bed, refRemoved.bim, and refRemoved.fam. Obviously removelist.txt has what you want to remove. Each line has a family ID and individual ID, separated by a space, of those who you want to remove. The easiest way is probably to open up the master list. For the one I gave you above the last column is the family ID and the first is the individual ID. Cut & paste the first column after the last, delete the other columns, and save. I usually get rid of quotations and tabs, change it to a .txt file, and there you have it.

But what about your 23andMe file? You need to convert it to pedigree. I have created a quick & dirty perl script to do so. You can find it here. Download or cut & paste it. You need to remove the comments at the top of the 23andMe file. That is, you need to remove everything before the first SNP. Assuming that’s done, do this at the command line within the folder where you put the script (you get to that folder with “cd” recall):

perl "YourFileName" "001" "001"

The script fires, gets the file name from the first parameter, and outputs two files, YourFileName.ped and What about the two other parameters? They’re generating your family ID and individual ID. They’d be FAM001 and ID001 in this case. You need to enter these into the master list! Otherwise you won’t come out on the bar plots. Also enter your ethnicity, etc. Or, just your name if you want to be your own slice of the bar plot.

Note that you have .ped, not .bed, files. These are big. Now you need to convert the text to binary pedigree. Move the YourName files to the plink folder. Make binary:

./plink --file YourFileName --make-bed --out YourFileName

Now you have YourFileName.bed YourFileName.bim YourFileName.fam. It is best to limit your SNPs to the same as those in the reference data set. So get those from the reference:

./plink --bfile ref --write-snplist --out SNPs

You should have a file, SNPs.snplist. Use them to filter your 23andMe file.

./plink --bfile YourFileName --extract SNPs.snplist --make-bed --out YourFileNameFiltered

Now you want to merge:

./plink --bfile ref --bmerge YourFileNameFiltered.bed  YourFileNameFiltered.bim  YourFileNameFiltered.fam --make-bed --out ref

You are now appended to the reference data set! If you open up the ref.fam file your family ID and individaul ID should be at the end of the list.

If you’ve slogged through this far, I thought it would be nice to end with something which shows what this is all about. Below I’ve filtered the reference data set of most African and New World populations, and run it from K = 2 to K = 12. It took about ~10 hours to complete. I’ve also limited the populations to display using the script above so that it isn’t too clustered. Here are the spreadsheets generated from the runs (they will be in folder where you run the R script, and have the form “K =2” and such for names).

[zenphotopress album=273 sort=sort_order number=11]

🔊 Listen RSS

Since I know plenty of friends are getting, or just got, their V3 results, I thought I’d pass this on, Open-ended submission opportunity for 23andMe data (#2):

Who is eligible

Everyone who is of European, Asian, or North African ancestry and all four of his/her grandparents are from the same European, Asian, or North African ethnic group or the same European, Asian, or North African country.

Also, Zack has more than 30 individuals in HAP. The “cow belt” is still way underrepresented. The only Bengalis in the data set are my parents.

🔊 Listen RSS

Dienekes did another run of his data with K = 64. He posted a huge plot with the two largest dimensions of variation. He also posted an accompanying spreadsheet with the coordinates of where the Dodecad samples were. So I found my own position pretty quickly. Before going to that, I thought I’d repost a comparison between myself, the HapMap Gujaratis, the North Kannadi sample, and the HGDP Uygurs. This is at K = 10 in ADMIXTURE from Dodecad.

OK, with that in mind, here’s the full MDS with the two largest components of genetic variation. I’ve added large labels. Also, click the image for a larger file so you can read the small labels.

One thing that jumps out at me is the tight clustering of very populous groups such as Europeans. The East Asians and Yoruba samples aren’t as representative of their macro-region, so that makes some sense. But the Dodecad Ancestry Project has a lot of West Eurasian groups, so the affinity there is still striking. I am basically a touch off the “North Kannadi” cluster, a little toward the Uygurs. In the clustering which is the main focus of Dienekes’ post I also fall into a North Kannadi cluster. Interestingly, in Zack’s preliminary run with the South Asian data set I’m 71% with Nepalis, and 29% with part of the Singapore Indians (most of whom I assume are Tamil). Note the close position of the Uygurs to the North Kannadi, despite the fact that geographically the Uygur are much closer to Pakistani populations. It just goes to show you what happens when you throw a whole lot of genetic variation into the pot, and then focus on the two largest components of variance. The axis between Europe and East Asia is spanned by South Asians. But some South Asian groups, such as the North Kannadi sample, have an ancestry component somewhat more like East Asians than West Eurasians, so they get placed closer to East Asians on the two dimensional plots. This is what Dienekes terms the “South Eurasian” element, which has been submerged almost everywhere by a West and East Eurasian element.

Here’s a close up of the South Asian region of the plot. You can see how close the Uygurs are to the North Kannadi sample, and how close I am to the North Kannadi. But two of the North Kannda samples are out of the cluster in the MDS. I assume they’re the individuals with a lot of the purple ancestral component, what Dienekes’ termed “West Asian.” The individual between the Gujarati and North Kannadi clusters is probably the one with the slight orange “East Asian” component. And that gives you insight what’s going on with me. If you removed the orange component from my ancestry I’d probably be in the Gujarati cluster. I’m “pulled” to the North Kannadi cluster as a direct proportion of my East Asian ancestral component. The MDS plot isn’t “wrong,” it is visualizing the data correctly with the constraints imposed by our own abilities to process information intuitively. But without the ADMIXTURE plot you’d probably make the wrong inference about my population assignment. With that information the likely hypothesis would be that I’m from a liminal population which has interactions with East Asian groups (e.g., Nepali, Assamese, or Bengali).

Note: Removing the Africans from the sample, or visualizing different combinations of dimensions, would also certainly clear up the confusion in this case. But again, these sorts of steps require a human understanding of what the techniques are presenting to you.

• Category: Science • Tags: Admixture, Dodecad, Genetics, Genomics 
🔊 Listen RSS

I have spoken of my somewhat atypical, for a South Asian, genetic results before. Recently Dienekes performed some cluster analysis which confirmed the initial findings, while adding a little detail:

I am DOD075. The Southeast Asian component is modal in Malays, while the East Asian component is modal in the North Chinese. Vietnamese and Cambodians are mixed, with the former biased toward East Asian, and the latter Southeast Asian. My own proportions are more balanced, but there might be some noise in there. That being said, from what I have read of Southeast Asia it is highly likely that Burmese ethnicities will be between the Cambodians and Vietnamese in proportions. The Burmans were more shaped by the indigenous Mon-Khmer people than the Vietnamese were, though like the Vietnamese they seem to hail from southern China. My family is traditionally from eastern Bengal, and has been at various points the subjects of the kingdom of Tripura.

Here’s the Dodecad Indians, HapMap Gujaratis, and Behar et al. North Kannadans. The orange is Asian. Can you tell which one I am?

I’m pretty sure I’m second from the left. Not only am I atypically Asian, but I often show trace levels of other ancestral components in Dienekes’ ADMIXTURE results (I suspect this is due a rather cosmopolitan great-grandfather who was from Delhi). In any case, so far I’ve had pretty general pointers of what’s going on here. Unlike most people who get this stuff done, I found something interesting, though not too surprising (more on this later). But I ran into something which makes the case for specific Burmese origin even stronger.

I got involved in BGA thanks to the urging of my friend Paul. Recently David of BGA sent me matches for various HGDP and other populations, as well as his own project samples, for extended haplotype blocks on the first 3 chromosomes. These are long stretches of correlated markers which haven’t been broken apart by recombination. They may be indications of recent common ancestry between two individuals who share regions of genomic affinity. I decided to look at my matches. Also, there are several other South Asian individuals in the project. I don’t know who they are, but it’s clear they’re South Asian. I was curious to compare myself to them in terms of my matches.

First, I removed all the project samples. So basically I limited it to populations whose names I could perceive easily. Then I limited it to blocks of at least 100 bases. Below are the number of hits in the populations ordered. I matched some Pathans more than others, but I threw them all into a big pool. You can see some of the Genomes Unzipped guys, and Ilana Fisher & Kate Morely too.

Group Razib # hits
Pathan 72
Burusho 58
Hazara 47
Han 26
Naxi 19
North Russian 18
She 18
Yizu 16
Buryat 15
Miaozu 14
Chuvash 13
Adygei 12
Altai 12
Uygur 12
Tuva 11
Tujia 9
Mongol 8
Xibo 8
Daur 7
Yakut 7
Athabask 5
Evenk 5
Oroqen 5
Hezhen 4
Ilana Fisher 4
Maya 3
Chukchi 2
Ket 2
Nganassan 2
Vincent Plagnol 2
Daniel MacArthur 1
Joe Pickrell 1
Kate Morley 1
Komi 1
Luke Jostins 1

The raw number of hits isn’t really that informative. But obviously I’m going to be more open about my data than other peoples’. So here’s what I did: I took the # of hits that I had for these populations, and calculated the ratio with the mean for all other South Asians. For example, the mean South Asian # of hits for Pathans without me was 75. I was 72. So my ratio was 0.96. Here’s a table of the higher hit # groups for me, and my ratio with the South Asian mean:

Group Me vs. Mean
Pathan 0.96
Burusho 0.69
Hazara 1.77
Han 2.04
Naxi 10.86
North Russian 0.77
She 5.4
Yizu 4.27
Buryat 3
Miaozu 2.8
Chuvash 0.85
Adygei 0.52
Altai 1.71
Uygur 1.09
Tuva 2.2
Tujia 2.4
Mongol 2.13

I bolded what I noticed to be way out of the norm. The mean number of hits for South Asians with the Naxi, aside from me, was 1.75. I had 19. The She and Yizu were also atypical. These are three HGDP populations. The Naxi and Yizu reside predominantly in Yunnan. The She are based further east in southern China. The connection with my East Asian ancestry seems pretty straightforward then. Here’s a section on the origins of the Bamars in Wikipedia, the dominant group in Burma: “They migrated from the present day Yunnan in China into the Ayeyarwady river valley in Upper Burma about 1200–1500 years ago. Over the last millennium, they have largely replaced/absorbed the Mon and the earlier Pyu, ethnic groups that originally dominated the Ayeyarwady valley.” I believe that ~1/6th of my ancestry had something to do with the massive Völkerwanderung which brought the various ethnic groups dominant in Burma, Thailand, Laos and Vietnam to their current locations. Only Cambodia managed to maintain a native elite culture, though the modern Khmer polity was being crushed between Thailand and Vietnam when the French colonial authorities froze the current borders roughly in place.

~1/6th is not trivial, but it isn’t quite one grandparent either. So how did I come by this? It could just be a natural component of the eastern Bengali genetic landscape. I wasn’t too surprised by the results because so many of my extended family members do resemble people from Southeast Asia. I myself have a few characteristics which are not typical for South Asians (e.g., very little body hair). But I have reason to suspect that there might be some recent admixture. All my grandparents were Muslim, but I know the original Hindu caste origins of two of them (one of them was from a family who converted when she was an infant). They were unlikely to have had recent admixture. My maternal grandmother was only half-Bengali in terms of recent ancestry. But her father was from Delhi, from that city’s Muslim elite. That leaves my paternal grandfather, who died just before I was born. No photos of him were ever taken, but I was always told that his physical appearance was not typical for a Bengali. He was tall and relatively light in complexion. His title as a Khan came down through his paternal lineage, and was a legacy of the late Mughal (really, de facto post-Mughal by then from what I can gather) period. My paternal Y chromosomal lineage is R1a1a*, not an eastern haplogroup at all. I had always assumed that like my great-grandfather on my mother’s side my paternal grandfather was from the mixed heritage Muslim upper class family. I now strongly suspect that his background was more exotic than my family has let on, or at least more than they suspected.

It turns out that his family may have come to Bengal from Assam, to the northeast. Assam has an even stronger Tibeto-Burman presence that eastern Bengal. My paternal grandfather may have been from a family which mixed with the Tibeto-Burman tribal people in Assam, some of whom converted to Islam and assimilated to a Bengali identity. I find this rather interesting, and am curious as to the omission. My own personal experience discussing the eastern element of my ancestry is that many South Asians are just confused by the whole idea. Persian or Scythian ancestors they can grok. Burmese, not so much.

I will know more soon. I have had my parents typed, and will be able to ascertain if the eastern element is more from my father as I suspect, or whether it is from both parents. If the latter is the case, then there need be no exotic story. Rather, eastern Bengal is simply on the clinal continuum of allele frequencies which differentiate South Asia from Southeast Asia.

• Category: Science • Tags: BGA, Dodecad, Genetics, Genomics, Personal Genomics 
🔊 Listen RSS

In the open thread someone asked: “Any recent stuff on the genetics of Ethiopians.” That prompted me to look around, because I’m curious too. Poking around Wikipedia I couldn’t find anything recent. A lot of the studies are older uniparental lineage based works (NRY and mtDNA). Ethiopia is interesting because unlike almost all other Sub-Saharan African nations it has a long written history. Culturally and linguistically it has both Sub-Saharan African, and non-Sub-Saharan African, affinities. The languages of highland Ethiopia are clearly Semitic. Those of lowland Ethiopia are Cushitic, a branch of the broader Afro-Asiatic language family concentrated around the Horn of Africa (Somali is a Cushitic language, though most Ethiopian nationals who speak a Cushitic dialect are of the Oromo group).

From a human evolutionary genetic perspective, Ethiopia also has specific interest. It is likely that the main recent pulse of humans Out of Africa traversed this region. Additionally, there is some evidence of deep time connections between the groups ancestral to Ethiopians and the Khoisan of southern Africa. It may be that Ethiopians and Khoisan are reservoirs of ancient genetic variation in Sub-Saharan Africa which as been overlain by Bantu in most other regions outside of West Africa. Finally, Ethiopians are known to have high altitude adaptations. This could be due to long term residence in the region, or, assimilation of favorable alleles from the long term residents by later populations.

Fortunately we can get a sense of the genetic affinities of Ethiopians thanks to a paper published last spring, The genome-wide structure of the Jewish people. The focus was clearly on Jews, but they surveyed Amhara & Tigray (Semitic speaking highlanders), Ethiopian Jews (similar ethnically to the Amhara & Tigray, but religiously non-Christian), and Oromo. In the PCA the Oromo and Semitic speaking populations are pretty obviously distinct clusters.

This just means that when you take worldwide genetic variation, and pull out the biggest independent dimensions, and then visualize individuals on the two largest dimensions in terms of how they explain variance, the Oromo and other Ethiopians don’t really intersect. Interestingly the Amhara and Tigray are almost indistinguishable, but the Ethiopian Jews are in their own cluster. There are, for the record, 7 Oromo, 7 Amhara, 5 Tigray, and 13 Ethiopian Jews in the sample.

Now let’s look at the genetic variation in ADMIXTURE. Remember this assigns the genomes of individuals in proportions to K ancestral units. As an example, if you had African Americans, Yoruba, and White Americans, in a total pool, and did K = 2, you might have a tendency where Yoruba and White Americans are in two totally different ancestral populations of K, while African Americans are 80% in one ancestry and 20% in another. The interpretation of this is straightforward, but when it comes to populations whose backgrounds we don’t know as well, one should be careful. The selection of a particular value for K is going to be really important, and we shouldn’t confuse the method from the reality which the method is trying to plumb.

First, K = 8 from Behar et al. I’ve reedited to highlight populations which might inform the variation of Ethiopians.

Now let’s look at a series of K’s. Note the changes.

Luckily for us, we don’t need to stop here. Dienekes included Behar’s Ethiopians (non-Jews) for Dodecad. Additionally, he included the Masai population from the HapMap. This turns out to be important because he found that Ethiopian Sub-Saharan ancestry is similar to that of the Masai, not the other African groups.

Dienekes also provided individual outputs. I’ve stitched together Ethiopians with Egyptians and Saudis. The color coding is the same as above.

You should be able to tell where the three groups start and stop pretty easily. I’m 99% sure that the six individuals with more East African and less Southwest Asian ancestry are all Oromo. Ethiopians, in particular highland Ethiopians, seem to me likely an ancient stabilized hybrid population between a population from Arabia, and a local Sub-Saharan population. This population seems unlikely to have been related to the peoples of West-Central Africa, who are associated with the Bantus across eastern and southern Africa. The Bantu agricultural toolkit runs into ecological constraints in various regions, and it is in those regions that non-Bantu populations have persisted. Ethiopia, with its unique climate and topography, naturally remains non-Bantu (as well as the Horn of Africa as a whole). The possible connections between Khoisan and Ethiopia may be a function of the fact that these areas harbor genetic variants which have disappeared in the intervening regions because of the Bantu expansion. I have a hard time accepting that the Bantu expansion was particular eliminationist, but I am starting to suspect that outside of Ethiopia population densities were very, very, low.

The antiquity of this ancient hybridization event to me is attested by the fact that Ethiopians lack any of the other Middle Eastern components besides the one modal in Saudi Arabia. There is a great deal of intra-population variance in the Saudi data set. Why? Part of this must be the slave trade, as well as pilgrims who remained in places like Mecca. But, I think part of the untold story here is that there may have been a larger genetic impact on Arabia after the rise of Islam from the Levant than vice versa! Probably the gene flow precedes Islam, as Arabia was hooked into worldwide trade and population movements, which Ethiopia was relatively insulated from. The Saudi data set has several people who are “pure” Southwest Asian, but also several who have a great deal of West Asian + South European. These seem likely to be people who have some background in the Fertile Crescent.

🔊 Listen RSS

Over at his blog Dienekes Pontikos has taken some public data sets and his own Dodecad samples and generated a massive MDS plot of West Eurasian populations. The MDS is fine as it goes. It illustrates clearly that when you visualize an individual on a plot defined by the two largest dimensions of variation in the total data set clusters naturally emerge. Some of of them are totally expected. For example, the cluster of Ashkenazi Jews. But, some of the relationships need to be interpreted with care. The similar position of Sicilians with Ashkenazi Jews does not mean that these two populations are identical. Rather, their ancestral components exhibit similarities in such a manner that in a representation constrained to a few dimensions they shake out similarly. You can view the full thumbnail by clicking it, but I thought that for purposes of intuitive comprehension it would be useful to “cut out” the outlines of the distributions, and label them by geography. I added Ashkenazi Jews because I thought readers would be interested, but omitted the other Jewish groups.

• Category: Science • Tags: Dodecad, Genetics, Genomics, Human Genetics 
🔊 Listen RSS

To the left you see a zoom in of a PCA which Dienekes produced for a post, Structure in West Asian Indo-European groups. The focus of the post is the peculiar genetic relationship of Kurds, an Iranian-speaking people, with Iranians proper, as well as Armenians (Indo-European) and Turks (not Indo-European). As you can see in some ways the Kurds seem to be the outgroup population, and the correspondence between linguistic and genetic affinity is difficult to interpret. For those of you interested in historical population genetics this shouldn’t be that surprising. West Asia is characterized by of endogamy, language shift, and a great deal of sub and supra-national communal identity (in fact, national identity is often perceived to be weak here). A paper from the mid-2000s already suggested that western and eastern Iran were genetically very distinctive, perhaps due to the simple fact of geography: central Iran is extremely arid and relatively unpopulated in relation to the peripheries.

But this post isn’t about Kurds, rather, observe the very close relationship between Turks and Armenians on the PCA. The _D denotes Dodecad samples, those which Dienekes himself as collected. This affinity could easily be predicted by the basic parameters of physical geography. Armenians and Anatolian Turks were neighbors for nearly 1,000 years. Below is a map which shows the expanse of the ancient kingdom of Armenia:

Historic Armenia was centered around lake Van in what is today eastern Turkey. The modern Republic of Armenia is very much a rump, and an artifact of the historic expansion of the Russian Empire in the Caucasus at the expense of the Ottomans and Persians. Were it not for the Armenian genocide there may today have been more Armenians resident in Turkey than in the modern nation-state of Armenia,* just as there are more Azeri Turks in Iran than in Azerbaijan. Many areas once occupied by Armenians are now occupied by Kurds and Turks. But a bigger question is the ethnogenesis of the Anatolian Turkish population over the past 1,000 years.

Dienekes has already shed light on this topic earlier, adding the Greek and Cypriot populations to the mix as well as Turks and Armenians. The disjunction between Kurds and the Armenian-Turk clade suggests to us that Turks did not emerge out of the milieu of Iranian tribes in the uplands of southeast Anatolia and western Persia. Like the Armenians the Kurds are an antique population, claiming descent from the Medes, and referred to as Isaurians during the Roman and Byzantine period.

Below is a reformatted K = 15 run of ADMIXTURE with Eurasian population. I’ve removed the labels for the ancestral components, but included in populations which have a high fraction of a given ancestral component. The geographical labels are for obscure populations. I’ve underlined the four populations of interest:

First, let’s get out of the way the fact that Turkish samples have non-trivial, though minor, northeast Asian ancestry. The Yakut themselves are a Turkic group situated to the north of Mongolia. The more southerly and central Asian affinities the nomadic ancestors of the Anatolia Turks may have picked up in their sojourns over the centuries between their original homeland in east-central Siberia and Mongolia and West Asia. The rest of ancestry is rather typical of northern West Asian groups. In particular, Armenians! Here is the ancestral breakdown for the four groups I want to focus on using Dienekes’ labels:

Population Greek Cypriots Turks Armenians
West Asian 37.6 54.1 47.2 56.3
Central-South Asian 5.3 8.6 18.2 18.4
North European 25.1 5.6 12 12.3
South European 27.4 20.8 9.4 8.4
Arabian 3.4 8 4.3 3.4
Altaic 0.3 0 2.6 0.1
East Asian 0.3 0.2 2.2 0
Central Siberian 0.1 0.2 1.4 0.2
Chukchi 0 0 1.1 0.2
South Indian 0 0.1 0.8 0.3
Nganasan 0.1 0 0.4 0.2
Koryak 0.1 0 0.2 0.1
East African 0 0.4 0.1 0
West African 0 0 0.1 0
Northwest African 0.3 1.9 0.1 0

And now the correlations between the populations by ancestral components:

Greek Cypriots Turks Armenians
Greek * 0.863 0.823 0.813
Cypriots * * 0.941 0.946
Turks * * * 0.997
Armenians * * * *

Let’s remove the East Eurasian and African components, and recalculate the proportions by taking what remains as the denominator:

Population Greek Cypriots Turks Armenians
West Asian 38.1 55.7 51.8 57.0
Central-South Asian 5.4 8.9 20.0 18.6
North European 25.4 5.8 13.2 12.4
South European 27.7 21.4 10.3 8.5
Arabian 3.4 8.2 4.7 3.4

And the recomputed correlations:

Greek Cypriots Turks Armenians
Greek * 0.747 0.640 0.647
Cypriots * * 0.901 0.908
Turks * * * 0.999
Armenians * * * *

With all the ~0 ancestral components which were common across these four populations removed the correlations have gone down. Except in the case of the Armenian-Turk pair, because I’ve removed the ancestries which differentiate them.

So what’s a plausible interpretation? A straightforward one would be that the Muslim Turk population of Anatolia has a strong bias toward having been assimilated Armenians, rather than Greeks. The cultural plasticity of Armenians in late antiquity and the early medieval period was clear: individuals of ethnic Armenian to origin rose the pinnacles of the status hierarchy of the Orthodox Christian Greek Byzantine Empire. The Macedonian dynasty of the Byzantines under which the civilization reached its mature peak were descended from Armenians who had resettled in Macedonia. Just as plausible to me is that eastern Anatolia as a whole exhibited little genetic difference between Greeks and Armenians, and the former were wholly assimilated or migrated, while the Armenians remained. One way to test this thesis would be type the descendants of Greeks who left eastern Anatolia during the population exchange between Greece and Turkey in the 1920s. But the difference between Greeks and Cypriots also points us to another possibility: perhaps the Greeks of Greece proper (as opposed to Anatolia) were much more strongly impacted by the arrival of Slavs? One need not necessarily rely solely on the Scalveni migrations either, water tends to be a major dampener to conventional isolation-by-distance gene flow, so the Greek mainland may always have been subject to more influence from the lands to the north.

Whatever the details of ethnogenesis may be, it will be interesting to see how things shake out as we increase sample sizes and get better population coverage. These results may be due to regional selection bias. One might expect that the descendants of Rumelian Turks be more “European” than Anatolian Turks. But, these data do seem to suggest on face value that Armenians are the population which Anatolian Turks have the most genetic affinity with.

* My main hesitation would be that Armenians are a very mobile population, and their numbers within a modern Turkey may have declined simply through emigration, just as those of Christian Arabs have over the 20th century.

🔊 Listen RSS

School girls in Hunza, Pakistan

A few days ago I observed that pseudonymous blogger Dienekes Pontikos seemed intent on throwing as much data and interpretation into the public domain via his Dodecad Ancestry Project as possible. What are the long term implications of this? I know that Dienekes has been cited in the academic literature, but it seems more plausible that this sort of project will simply distort the nature of academic investigation. Distort has negative connotations, but it need not be deleterious at all. Academic institutions have legal constraints on what data they can use and how they can use it (see why Genomes Unzipped started). Not so with Dienekes’ project. He began soliciting for data ~2 months ago, and Dodecad has already yielded a rich set of results (granted, it would not be possible without academically funded public domain software, such as ADMIXTURE). Even if researchers don’t cite his results (and no doubt some will), he’s reshaping the broader framework. In other words, he’s implicitly updating everyone’s priors. Sometimes it isn’t even a matter of new information, as much as putting a spotlight on information which was already there. Below is a slice of a bar plot from Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation. It uses STRUCTURE with K = 7. To the right of the STRUCTURE slice are two plots of individual data on French and French Basque from the same HGDP data set using ADMIXTURE at K = 10 from Dodecad.


Repeated runs and higher K’s make it clear that the French Basque lack a “West Asian” aspect which other French, and Iberians as well, have. Some of this is clear in the paper I referenced above as well…the key is you have to look at the supplements at K = 6. Because the Basque are the only native non-Indo-European speakers in Western Europe, their origin and relationship to nearby populations has always been of interest (they also have the highest Rh- frequency of world populations). Granted, the French Basque are very similar genetically to the French as a whole. But, it is obviously highly informative that they lack an ancestral component in totality which seems to exist at low but consistent levels across Western European populations. The only other European population at K = 15 who lack the West Asian component in totality are Finns (the Lithuanians come very close).

This is all preamble to a discussion of a post Dienekes put up today, A solution to the problem of Indo-Aryan origins. Remember that Dienekes has been “playing” with ADMIXTURE for only a few months. To claim to have found a ‘solution’ to a problem as intellectually and politically intractable and explosive as this is rather bold. The crux of the matter is that at a certain confluences of K’s and population sets Dienekes has discovered a distinctive signature of ancestry which seems to be modal on the north slope of the Caucasus, and spans India and Europe. He terms this “Dagestani,” due to the fact that among a population sample from this province in Russia this ancestral component is overwhelmingly dominant. The patterns of Dagestani admixture in Europe and India are curious and suggestive.

1 – In Europe the frequencies are low, but irregularly distributed (excepting around the North Caucasus). Scandinavians and British have appreciable fractions, Finns and Southern Europeans do not. Here’s Dienekes:

Interpreting this pattern is not easy, but it does seem that this component seems to have a V-like distribution, achieving its maximum in Caucasus and its environs, then undergoing a diminution, and achieving a secondary (lower) frequency mode in NW Europe.

The surprising appearance of the homonymous Dagestan component in India suggests a widespread presence of a common ancestry element. The West Asian element, by comparison seems to have a more normal /-like distribution around its center in Anatolia-Caucasus-Iran region. It does reach the Atlantic coast, but is lacking in Scandinavia and Finland, and also in India itself.

2 – South Indian Brahmins have appreciable fractions, but non-Brahmins in the same region do not. In contrast, those who come from Indo-Aryan speaking backgrounds do seem to have Dagestani ancestral components, irrespective of other aspects of ancestry. For example Pakistanis don’t have that much more Dagestani than South Indian Brahmins or Gujaratis. Also compare the relatively narrow window of Dagestani ancestry variance among Dodecad South Asians (I’m DOD075). DOD088 is from what I recall a Reddy from Andhara Pradesh, a non-Brahmin but non-low caste. It is interesting that they have a high proportion of “Pakistan,” but no Dagestani. I have ~10% Dagestani, but no Pakistani.

Below is K = 10 for a selection of populations. Dienekes has now included in two non-Indo-European speaking Pakistani populations: the Brahui (Dravidian) and Burusho (linguistic isolate in the mountains of Pakistan):

Some general patterns are evident. The light blue is indicative of generic “Indian” ancestry. It is not found in appreciable proportions outside of subcontinental populations (or those of recent subcontinental origin). The same with the red, and light orange. For your reference the dark orange is a “Northern European” component, modal in Lithuania. The light and dark Green are both East Asian components. The dark blue is a “West Asian” component modal in Georgia, and prominent across Europe with declining as a function of distance from the eastern shore of the Black Sea (this is surely the West Asian which distinguishes the French from the French Basque). I believe that the light purple dominant in the Brahui and the light red dominant in the Burusho probably form as a compound the aforementioned Pakistani component. The dark purple is the Dagestani.

587px-Dravidische_SprachenFirst, a word on the Brahui. These are a group of tribes who reside in northern Balochistan in Pakistan. A small number are even to be found in Afghanistan. Historically they have had close relations with the Baloch, an Iranian speaking cluster of tribes who totally envelop the Brahui. The Brahui do speak a Dravidian language, of a family dominant in South India and found in isolated regions of Central and Eastern India. There are two broad models for the existence of a Dravidian language in Pakistan. The first is that the Brahui are remnants of more widely spoken Dravidian languages which date back to the Indus Valley civilization. The second is that the Brahui arrived during the medieval period from another region of South Asia where Dravidian languages were more common. Assuming either model, it has long been presumed that their involution by the Baloch has had a strong impact on the Brahui genetically; the two groups are very close. This is evident in Dienekes’ results as well. But observe that the Baloch are the group which seems more cosmopolitan in ancestry than the Brahui. If the Brahui were Dravidians from deep in India it seems that they would have a greater residual component of India-specific ancestry (light blue and orange). This is not so. In fact the Baloch have more of the Indian ancestral component than the Brahui. The Brahui component is found across Pakistan, and into India, albeit at lower proportions. Naturally, the Baloch have the second highest fraction. I believe these results should shift us toward the position that the Brahui are indigenous in relation to the Baloch, and that the Baloch ethnic identity emerged through the shift of a Brahui substrate, as evidenced by the greater cosmpolitanism of the Baloch. Additionally, Dienekes observes that the Brahui have a lower proportion of the Dagestani component than most other Pakistani groups, and several Indo-Aryan groups in India proper.

The Burusho are event more interesting than the Brahui. Unlike the Brahui the Burusho are very isolated in the mountainous fastness of Baltistan in northern Pakistan. Additionally, their language, Burushashki, is a linguistic isolate. Others of the class are Basque and Sumerian. In general it is assumed that linguistic isolates were once part of broader families of languages which have gone extinct. Burushashki probably persists in large part because of the geography which its speakers inhabit. Mountainous areas often preserve ethnic and linguistic diversity because the terrain allows for the persistence of local variety. I believe it is plausible that the Burusho have been far more isolated than the Brahui. This seems to show up in the ADMIXTURE plot, the Burusho have a greater proportion of their modal ancestral component than the Brahui. Additionally, the Burusho have even an smaller component of Dagestani than the Brahui.

Below is a chart Dienekes constructed ordered by proportion of Dagestani for his South Asian populations. Next to it I’ve placed a chart from a PCA which has some of the same population samples. Compare & contrast:


The PCA is looking at between population variation in totality. So naturally the Dagestani component isn’t going to be predictive of that. Rather, it speaks to the possibility which Dienekes is mooting: that the Dagestani component spread in the India subcontinent with the Indo-Aryans specifically, overlying the local resident substrate. In South India this meant that Brahmins brought this, mixing with the indigenous Dravidian population. In Pakistan the Indo-Aryan, and Iranians, were overlain on a substrate which were the ancestors of the Burusho and Brahui. The dominant signal of genetic relationship has to do with the substrate, not the Indo-Aryans. So that’s what’s going to show up on the PCA. In other PCA plots the model where South Indian Brahmins are a linear combination of a Pakistani-like population and a Dravidian population becomes clearer. But when you look at ancestry using something like ADMIXTURE you have the potential to tease apart different components, and so uncover relationships which may have been obscured when looking at aggregate variation.

dieDienekes’ model seems to post three steps in rapid succession ~4,000 years ago. A background variable which must be mentioned is that one must account for the Mitanni, a dominant Syrian power circa 1500 BC where a non-Indo-European language was the lingua franca, and yet a definite Indo-Aryan element existed within the elite. Indo-Aryan specifically because the Indo-European element within the Mitanni was not Iranian, but specifically Indo-Aryan. An easy explanation for this is that the Indo-Aryan component of the Indo-Iranian branch of the Indo-European languages crystallized outside South Asia, and independently reached Syria and India. In Syria it went extinct, while in India it obviously did not. By Dienekes’ model the Mitanni would be rather closer to the urheimat of the Indo-Aryans.

An aspect of his model which I do not understand is why it has to be Indo-Aryan, instead of Indo-Iranian. The South Asian population which the Dagestani component is modal, the Pathans, are Iranian, not Indo-Aryan. Additionally, this model seems to not speak in detail to the existence of the Dagestani element among Europeans. Here is a sorting of European populations (with Iranians included) by the Dagestani component:

Population Dagestan
Urkarah 93
Lezgins 47.9
Stalskoe 38.7
Adygei 16.4
Orcadian (Orkney) 12.6
Georgians 12.4
White_Utahns 11.2
Iranian 10.9
Scandinavian_D 10.2
Armenian_D 9.9
German_D 9.1
Turks 8.8
Armenians 8.4
French 7.9
Hungarians 7.5
Russian_D 6.3
Spanish_D 4.6
North_Italian 4.5
Spaniards 4.4
Romanian 4.1
Finnish_D 4.1
Russian 4
Greek_D 3.8
Portuguese_D 3.6
Tuscan 3.5
Tuscans 3.4
Lithuanians 2.9
S_Italian_Sicilian_D 2.8
Belorussian 2.5
Cypriots 2
Sardinian 1.5
French_Basque 0.7

There is here a strange pattern of rapid drop off from the Caucasus, and a bounce back very far away, on the margins of Germanic Northwestern Europe. This to me indicates some sort of leapfrog dynamic. A well known illustration of this would be the Ugric languages. The existence of Hungarian on what was Roman Pannonia is a function of the mobility and power of Magyar horseman, and their cultural domination over the Romance and Slavic speaking peasantry (their genetic impact seems to have been slight). No one believes that Germanic languages are closely related to Indo-Aryan (rather, if there is structure in Indo-European beyond Indo-Iranian, Celtic, etc., it would place the Indo-Iranian languages with Slavic). So what’s going on? I think perhaps the Dagestani component is part a reflection of the common Indo-European origin in that region. For whatever reason that signal is diminished in much of the rest of Europe. Perhaps Southern Europe was much more densely populated when the Indo-Europeans arrived. Additionally, it seems highly likely that in places like Sardinia, much of Spain, and Cyprus, Indo-European speech came through cultural diffusion (elite emulation) and not population movement. Or perhaps we’re seeing the vague shadows of population admixtures on the Pontic steppe, where distinct Germanic and Indo-Iranian confederations admixed with a common North Caucasian substrate.

Going back to India, let’s revisit the model of a two-way admixture between “Ancestral North Indians,” who were genetically similar to Europeans and West Asians, and “Ancestral South Indians,” who were closer to, but not very close to, East Eurasians. The ANI & ASI. The ASI were probably one of the ancient populations along the fringe of southern Eurasia, all of whom have been submerged by demographic movements from other parts of Eurasia over the past 10,000 years, excepting a few groups such as the Andaman Islanders and some Southeast Asian tribes. The model was admittedly a simplification. But taking that model as a given, and accepting that the Dagestani element is in indeed Indo-Aryan, we can infer that the ANI were not Indo-European. It is notable that the South Indian Brahmins have elevated fractions of both the Brahui and Burusho modal components. This is probably indicative of admixture of the Indo-Aryan element in the Indus Valley, prior to their expansion to other parts of India. I assume one of the languages spoken was Dravidian, though if ancient Mesopotamia was linguistically polyglot at the dawn of history I would not be surprised if the much more geographically Indus Valley civilization was as well.

Aishwarya Rai

The irony is that today when someone refers to a “Dravidian” physical type, they’re not talking about someone who looks like a Pakistani. They’re talking about someone who looks South Indian, where most Dravidian languages are spoken. But combining the inference from Dienekes’ model and the previous two-way admixture model, you reach the conclusion that lighter skin and more West Asian features among South Asians may be more due to Dravidian-speaking ancestors in the Indus Valley, not Indo-Aryans! It goes to show the wisdom of differentiating linguistic classes from biological ones when discussing historical population genetics. Unfortunately wisdom most of us interested in these topics do not show, alas.

As I like to say, interesting times….

Note: If you leave a comment, please don’t be smarter-than-thou in your tone. I have stopped publishing those sorts of comments because the reality is that most of them have not been that smart or informed. At least by my estimation. If you actually are smarter than the average-bear, and impress me with your erudition and analysis clarity, I’ll probably let your comment through no matter your attitude. But I wouldn’t bet on it if I were you, so show some class and humility. Most of us are muddling through.

Image Credit: Georges Biard, iStockPhoto

🔊 Listen RSS

Nature profiles Dodecad, the Pickrell Affair, and the emergence of amateur genomicists in a new piece. Interestingly David of BGA is going to try and get something through peer review. In particular, the relationship of Assyrians and Jews.

So we have Genomes Unzipped, Dodecad, and BGA. What next? Who next? I hope Dienekes doesn’t mind if I divulge the fact that the computational resources needed to utilize ADMIXTURE as he has is within the theoretical capability of everyone reading this post. Rather, the key is getting familiar with PLINK and writing some code to merge data sets. After you do that, to really add value you’d probably want to get raw data from more than what you can find in the HGDP, HapMap and other public resources.

But here I make an open offer: if you start a blog or a project which replicates the methods of Dodecad and BGA I’ll link to you and promote you. When Dienekes began Dodecad I actually started to play around with the data sets in ADMIXTURE, but I’ve personally held off until seeing what he and David find. What their pitfalls and successes might be. Here’s to 2011 being more interesting than we can imagine!

Update: Already had a friend with a computational background contact me about doing something on South Asian genomics. So again: if you get a site/blog set up, and start pumping out plots, I will promote you. In particular, if you need 23andMe raw data files of geographical region X it might be useful to try and get the word out via blogs and what not.

🔊 Listen RSS

I decided to take the Dodecad ADMIXTURE results at K = 10, and redo some of the bar plots, as well as some scatter plots relating the different ancestral components by population. Don’t try to pick out fine-grained details, see what jumps out in a gestalt fashion. I removed most of the non-European populations to focus on Western Europeans, with a few outgroups for reference.

Here’s a table of the correlations (I bolded the ones I thought were interesting):

W Asian NW African S Europe NE Asian SW Asian E Asian N European W African E African S Asian
W Asian * -0.01 -0.18 0.04 0.81 0.59 -0.64 0.39 0.2 0.04
NW African * * 0.19 -0.16 0.23 -0.09 -0.19 0.26 0.67 -0.11
S European * * * -0.38 -0.03 -0.27 -0.42 -0.11 -0.02 -0.36
NE Asian * * * * -0.06 0.5 0.26 -0.04 -0.1 -0.07
SW Asian * * * * * 0.21 -0.62 0.74 0.59 -0.13
E Asian * * * * * * -0.27 0.08 0 0.14
N European * * * * * * * -0.34 -0.28 -0.31
W African * * * * * * * * 0.86 -0.04
E African * * * * * * * * * -0.07

🔊 Listen RSS

After linking to Marnie Dunsmore’s blog on the Neolithic expansion, and reading Peter Bellwood’s First Farmers, I’ve been thinking a bit on how we might integrate some models of the rise and spread of agriculture with the new genomic findings. Bellwood’s thesis basically seems to be that the contemporary world pattern of expansive macro-language families (e.g., Indo-European, Sino-Tibetan, Afro-Asiatic, etc.) are shadows of the rapid demographic expansions in prehistory of farmers. In particular, hoe-farmers rapidly pushing into virgin lands. First Farmers was published in 2005, and so it had access mostly to mtDNA and Y chromosomal studies. Today we have a richer data set, from hundreds of thousands of markers per person, to mtDNA and Y chromosomal results from ancient DNA. I would argue that the new findings tend to reinforce the plausibility of Bellwood’s thesis somewhat.

The primary datum I want to enter into the record in this post, which was news to me, is this: the island of Cyprus seems to have been first settled (at least in anything but trivial numbers) by Neolithic populations from mainland Southwest Asia.* In fact, the first farmers in Cyprus perfectly replicated the physical culture of the nearby mainland in toto. This implies that the genetic heritage of modern Cypriots is probably attributable in the whole to expansions of farmers from Southwest Asia. With this in mind let’s look at Dienekes’ Dodecad results at K = 10 for Eurasian populations (I’ve reedited a bit):


Modern Cypriots exhibit genetic signatures which shake out into three putative ancestral groups. West Asian, which is modal in the Caucasus region. South European, modal in Sardinia. And Southwest Asian, which is modal in the Arabian peninsula. Cypriots basically look like Syrians, but with less Southwest Asian, more balance between West Asian and South European, and far less of the minor components of ancestry.

Just because an island was settled by one group of farmers, it does not mean that subsequent invasions or migrations could not have an impact. The indigenous tribes of Taiwan seem to be the original agriculturalists of that island, and after their settlement there were thousands of years of gradual and continuous cultural change in situ. But within the last 300 years settlers from Fujian on the Chinese mainland have demographically overwhelmed the native Taiwanese peoples.

During the Bronze Age it seems Cyprus was part of the Near East political and cultural system. The notional kings of Cyprus had close diplomatic relations with the pharaohs of Egypt. But between the end of the Bronze Age and the Classical Age Cyprus became part of the Greek cultural zone. Despite centuries of Latin and Ottoman rule, it has remained so, albeit with a prominent Turkish minority.

One thing notable about Cyprus, and which distinguishes it from mainland Greece, is the near total absence of a Northern European ancestral component. Therefore we can make the banal inference that Northern Europeans were not initially associated with the demographic expansions of farmers from the Middle East. Rather, I want to focus on the West Asian and Southern European ancestral components. One model for the re-population of Europe after the last Ice Age is that hunter-gatherers expanded from the peninsular “refugia” of Iberia and Italy, later being overlain by expansions of farmers from the Middle East, and perhaps Indo-Europeans from the Pontic steppe. I have a sneaking suspicion though that what we’re seeing among Mediterranean populations are several waves of expansion out of the Near East. I now would offer the tentative hypothesis that the South European ancestral element at K = 10 is a signature of the first wave of farmers which issued out of the Near East. The West Asians were a subsequent wave. I assume that the two groups must correlate to some sort of cultural or technological shift, though I have no hypothesis as to that.

From the above assertions, it is clear that I believe modern Sardinians are descendants of that first wave of farmers, unaffected by later demographic perturbations. I believe that Basques then are a people who emerge from an amalgamation of the same wave of seafaring agriculturalists with the indigenous populations preceding them (the indigenes were likely the descendants of a broad group of northern Eurasians who expanded after the end of the last Ice Age from the aforementioned refugia). They leap-frogged across fertile regions of the Mediterranean and pushed up valleys of southern France, and out of the Straits of Gibraltar. Interestingly, the Basque lack the West Asian minority element evident in Dienekes’ Spaniards, Portuguese, as well as the HGDP French (even up to K = 15 they don’t shake out as anything but a two way admixture, while the Sardinians show a minor West Asian component). Also, the West Asian and Southern European elements are several times more well represented proportionally among Scandinavians than Finns. The Southern European element is not found among the Uyghur, though the Northern European and West Asian one is. I infer from all these patterns that the Southern European element derived from pre-Indo-European farmers who pushed west from the Near East. It is the second largest component across much of the Northwestern Europe, the largest across much of Southern European, including Greece.

A second issue which First Farmers clarified are differences between the spread of agriculture from the Near East to Europe and South Asia. It seems that the spread of agriculture across South Asia was more gradual, or least had a longer pause, than in Europe. A clear West Asian transplanted culture arrived in what is today Pakistan ~9,000 years ago. But it does not seem that the Neolithic arrived to the far south of India until ~4,000 years ago. I think that a period of “incubation” in the northwest part of the subcontinent explains the putative hybridization between “Ancient North Indians” and “Ancient South Indians” described in Reconstructing Indian population history. The high proportion of “Ancestral North Indian,” on the order of ~40%, as well as Y chromosomal markers such as R1a1a, among South Indian tribal populations, is a function of the fact that these groups are themselves secondary amalgamations between shifting cultivators expanding from the Northwest along with local resident hunter-gatherer groups which were related to the ASI which the original West Asian agriculturalists encountered and assimilated in ancient Pakistan (Pathans are ~25% ASI). I believe that the Dravidian languages arrived from the Northwest to the south of India only within the last 4-5,000 with the farmers (some of whom may have reverted to facultative hunter-gathering, as is common among tribals). This relatively late arrival of Dravidian speaking groups explains why Sri Lanka has an Indo-European presence to my mind; the island was probably only lightly settled by farming Dravidian speakers, if at all, allowing Indo-European speakers from Gujarat and Sindh to leap-frog and quickly replace the native Veddas, who were hunter-gatherers.

Note: Here is K = 15.

* Wikipedia says there were hunter-gatherers, but even here the numbers were likely very small.

🔊 Listen RSS

Dienekes Pontikos keeps chugging along, and has cranked out a new bar plot from the ADMIXTURE program with 15 putative ancestral components. He has “69 populations, and 1,189 individuals in total.” Most of these were assembled from public data, but some of them are particular to the Dodecad Ancestry Project. He contends:

In comparison to the K=10 analysis, the increased resolution allows us to:

– South Asians belonged primarily to the South Asian and West Asian components; this South Asian component spilt over to Iran and Central Asia. Now, a new Central-South Asian component, corresponding to the Ancestral North Indian of a recent study is inferred, and a corresponding South Indian component.

– HGDP Bedouins and Behar et al. (2010) Saudis take up their own component which I labeled Arabian. This appears to be a subset of the Southwest Asian component of the K=10 analysis

– There are several components in Siberian and Central Asian populations, alread discovered in my regional analysis. These are Central Siberian, Nganasan, Koryak, Chukchi, and Altaic which replace the K=10 Northeast Asian component

Not only has he generated a bar plot, but there is a PCA showing the relationship between the 15 ancestral groups, as well as a hierarchical tree. Since he references to the ANI and ASI of Reich et al., I thought I would note that the South Indian element from Dienekes’ K = 15 is still found in appreciable portions in the Turkic groups which earlier exhibited the South Asian component. And, on the PCA and phylogenetic tree it still clusters with West Eurasians more than East Eurasians, which is not the case with ASI (or the various Indian mtDNA lineages which coalesce back to a more recent common ancestor with East Eurasians).

The bar plot is below. Of interest are the most “pure” European groups, the Sardinians and Lithuanians. Also, compare Scandinavians and Finns.


• Category: History, Science • Tags: Admixture, Dodecad, Genetics, Historical Genetics 
🔊 Listen RSS

uyafrThe figure to the left is a composite merged from two different papers. One analyzes the patterns of genetic variation within African Americans, and the other the patterns within the East Turkic ethnic group, the Uyghurs. The bar plots show the ancestral element which is similar to two parent populations which resemble Europeans and Africans or East Asians. Looking at total aggregate ancestral quanta we infer that African Americans are on the order of 15-25% European in ancestry, and 75-85% African. Uyghurs seem to be a composite in even measure of a European-like group, and an East Asian-like group. This makes total sense phenotypically; most African Americans look more African, while Uyghurs seem to exhibit a phenotype on average which spans the middle-range between West and East Eurasians.

Central_Asian_Buddhist_MonkBut we’re clearly missing something when we focus purely on a population level statistic. Each “slice” of the bar plot actually represents an individual. Note the contrast between African Americans and Uyghurs. There is relatively little intra-individual variation among Uyghurs, while there is a great deal of such variation among African Americans. Why? Population geneticists have looked at linkage disequilibrium in both African Americans and Uyghurs, and inferred that the former went through an admixture phase much more recently than the latter. Though you don’t really have to be a population geneticist to have known that about African Americans. The ethnogenesis of the group African Americans as a cultural entity occurred in the period between 1650 and 1850. Genetically they are a compound of African, European, to some extent Native American, ancestry. For the Uyghurs we have thinner textual evidence, but the visual and genetic data point to a “western” Indo-European speaking population in the Tarim basin before the arrival of the Turks sometime in the second half of the first millenium A.D. The assumption is that after the initial admixture event and the absorption of the pre-Turkic substrate there was no population substructure. Over time the two components distributed themselves evenly across the population over a period of 1,000-1,500 years.

From this we can infer that patterns of individual variation within populations, as well as between closely related populations, can tell us a great deal. Today the Dodecad Ancestry Project posted a file with the population ancestries broken down by individuals. Looking at this sort of fine-grained data patterns can jump out based on what you already know. Below is a slide show I created which highlights some patterns of interest.

[zenphotopress album=213 sort=sort_order number=4]

The first slide is confirmation of what we already know, or should expect. The Burusho are a linguistic isolate in the mountains of northern Pakistan. Their lack of inter-individual variation within the population is suggestive of long term isolation, as is common in mountainous regions. The very fact that they speak a linguistic isolate should lead us to expect this, as the flow of culture and genes often correlate. The Sindhi are the dominant Indo-Aryan speaking ethnic group of the lower Indus watershed. Because of their geographic position they have been conquered many times, being under Persian, Arab, and Turkic rule. Genetically they’re very similar to the Burusho, but observe that there are two individuals with substantial West African ancestry. The presence of black Africans in the armies of the Muslims who conquered the subcontinent is well known, and the origin of the Indian Siddi community. Some of the Sindhis also have appreciable ancestral components which are probably derived from Muslims from West Asia, the “Southern European” and “Southwest Asian” ancestral element which the Burusho lack.

418px-Kim_Kardashian_6Next you see a comparison between Assyrians and Armenians. These two groups seem very similar, and both have deep textually attested roots in the Middle East. The Armenians date to the Persian Empire, at least, while the Assyrians are clearly the descendants of the indigenous Semitic population of Mesopotamia before the arrival of the Arabs. In the Muslim period many of them retreated to mountainous areas of northern Iraq, before emigrating to the cities of modern Iraq with the relaxation of their status as marginalized dhimmis. Today the Assyrian community is scattered across the world. The portion which adheres to the Church of the East was nearly totally extirpated from Iraq early in the 20th century, while that which is in union with the Roman Catholic Church, the Chaldeas, is currently leaving Iraq en masse.

But the Armenians are a far different case in terms of their interactions with the rest of the world. They have been present as “middlemen minorities” as far east as Southeast Asia, and north into the Russian Empire, and south into the Muslim world. The most parsimonious explanation for the individuals with Northern European ancestry is that like Kim Kardashian they are products of mixed-marriages, but I wonder if the centuries of the Armenian Diaspora has resulted in a change in the gene frequencies in the Armenian homeland in part because of back-migration. With larger data sets this will be testable, as well as the hypothesis that Diaspora communities are admixed while the Armenians in Armenia proper are not.

The third slide compares Scandinavians, Finns, and Lithuanians. Scandinavia refers to the Germanic speaking lands of Norden. Lithuania has historically been just outside the arc of Nordic influence (in contrast to Estonia and Latvia), so it can serve as a Northern European control. I believe some of the Finnish samples in Dodecad are related, so one shouldn’t make too much of them. But, contrast the relatively constant level of Southern European in the Scandinavian samples, and their variance in the Finnish ones. Inversely, the Finns show relative constancy of the “Northeast Asian” proportion, while the Scandinavians vary, with some lacking it. This is likely evidence of recent population exchange, and cultural switching. Finland was under Swedish rule for most of the past 1,000 years, and there still remains a large ethnic Swedish population in Finland, and an ethnic Finnish population in Sweden. Some families in Finland likely switched from Finnish to Swedish to Finnish within the last 500 years. The Southern European and West Asian elements more prominent in the Scandinavians tend to increase as one goes south in Europe, with the former modal in Sardinia (in fact, Sardinians are nearly fixed for the Southern European component), and the latter more prominent among southeast European groups. Geography may then explain why the Lithuanians have similar amounts of the West Asian, but less of the Southern European.

UygurFinally we compare Turks, Greeks, and Cypriots. The historical ethnography strongly implies that the major component of Anatolian Turkish ancestry is Greek and Armenian. A broad similarity to the Greeks here is rather clear (with an elevated West Asian component probably from the Armenian ancestry). But notice the differences. There is a consistent East and Northeast Asian component of ancestry among Turks which is lacking in the Greeks. Since the origin of the Turks is in what would today be termed Greater Mongolia, this makes sense. What surprised me though is the presence of a South Asian component among the Turks. This is where looking at individual level results yields results; I’d assumed that like the Romanians the South Asian element was due to a few assimilated Roma. That seems unlikely now, it’s too evenly distributed. So what then? I think here looking at the Uyghur plot illuminates this for us. I don’t know what to make of the South Asian component which you can find in the Uyghur, and even to a trace extent, but again consistent, among the Chuvash, who inhabit the South Urals. Some readers have long claimed that some of the West Eurasian Uyghur ancestry was somehow connected to South Asia, and to be honest I’ve kind of seen that in other HGDP bar plots, but ignored it as of secondary importance. The Turkic group to the north and east of the Uyghurs, the Yakut, totally lack it. From what little we know it seems that the Turks pushed west to Europe, the Middle East, and South Asia, via what is today Xinjiang and Kazakhstan. The existence of this South Asian element in the Turks of Anatolia may be because of their sojourn in this region. There were Iranian speaking Indo-Europeans in Xinjiang, and certainly in Central Asia. Additionally we know historically that northwest India was connected to Xinjiang culturally, as some Indians arrived in China after a period of residence in Xinjiang. But instead of an “Out of South Asia” event I think what we may be looking at is part of the old “Ancient North Indian” genetic variation which pushed into South Asia from the north, and was eventually overlain in Central Asia with other components. I had assumed that the South Asian component among the Finns was noise or Kale, but perhaps it could be that.

Then there is Cyprus. Today the island is ethnically divided between Greek Cypriots and Turkish Cypriots. But in the Bronze Age Cyprus seems to have had a civilization with a close connection with the Near East, in particular Egypt. Sometime between the Bronze Age and the Classical Era it became an outpost of Greece. But notice the near total absence of Northern European among the Cypriots. Like the people of Sardinia, but unlike Sicily, Cyprus is relatively far from the Eurasian mainland. So how did Cyprus become Greek? If the Greeks always had a noticeable Northern European component, or at least during the Bronze Age, that would indicate that the Cypriots are a case of cultural diffusion and emulation of a small Greek elite which arrived during the migrations of the Sea Peoples. Or, the Northern European element could be due to admixture with the Slavic peoples who arrived in Greece after the collapse of East Roman frontier in the 6th century. Or it could be a combination of both. In any case, the Cypriots look most like the Syrians genetically, though the Syrians seem to have a lot more trace exogenous components.

There’s a lot more one could say. I invite readers to download the RAR file with the bar plots. I will leave you with one last comparison, without comment:


Image Credit: Tocharian Buddhist monk of European appearance, and Kim Kardashian, by Luke Ford

🔊 Listen RSS

Dienekes is now allowing people to “out” themselves in terms of their ancestry on a comment thread over at the Dodecad Ancestry Project. One of the major purposes of the project has been to survey variation in under-sampled groups which could give us insights into human genetic history. Yesterday I pointed to an analysis of Europeans from the British Isles to Russia. Basically Northern Europeans. There wasn’t anything too revolutionary about the nature of the results; rather, it confirmed some patterns we’d seen. Additionally it obviously didn’t resolve issues of timing, though it clarified hypotheses on the margin.

The main benefit of the ADMIXTURE bar plots is that it gives you a gestalt sense of relationships in a quantitative fashion. This is especially important for groups in the Eurasian Heartland, who are in some ways at the center of both genetic and cultural exchange. In the comments above some information was divulged as the provenance of two clusters of samples, Finns and Assyrians. The Assyrians here presumably represents the remnants of Mesopotamia’s Christian majority at the time of the Arab conquests in the 7th century. Prior to the Arab conquests Mesopotamia had been under the rule of the Sassanid Persian dynasty for nearly four centuries, but by early 7th century the Syriac speaking majority by and large adhered to a range of Christian sects (the balance seem to have been heterodox non-Christian Gnostics and Jews), with the ancient Church of the East dominant. Because of the social constraints which Christians were placed under within the Muslim Middle East prior to the modern era these communities may be particular informative as to the demographic impact of the Arab conquests, and the cosmopolitan and international nature of the Muslim polities and how they reshaped the genetics of the Middle East. A good approximation is that the Christian minorities are the dominant parent population of the Muslim majority, but that because of their tendency to withdraw into more isolated regions and their enforced economic marginality they would have not intermixed so much with the influx of slaves, both northern (Turk and Slav), Indian, and African, which characterized much of Mesopotamia over the past 1,400 years.

Below the fold is a slide show. I’ve reedited just a touch (removed a few populations, put the labels in larger fonts, etc.). First the total population set. Then I’ve dropped the Finns and Assyrians, respectively, into the global population set (obscure some which are less relevant).

[zenphotopress album=211 sort=sort_order number=3]

First things first : the different ancestral components are popping out of ADMIXTURE and are suggestive inferences base on the data input. They do not necessarily represent real concrete ancestral populations! As I keep pointing out, the purple South Asian element is probably a compound of at least two very genetically distinct ancient groups in about equal measures, one with strong West Eurasian/European affinities, and another a long resident indigenous South Asian group with distant, but definite, affinities to East Eurasians (it may be that the latter South Asian element gave rise to the various branches of East Eurasians and Amerindians further back in prehistory).

The “Northeast Asian” element in the ancestry of the four Finns is not that surprising (though I believe some of these are related). In 23andMe Finns often seem to show trace “Asian” ancestry, on the order of ~1%. Uniparental markers, especially Y chromosomal lineages, have long indicated ancient affinities between the Finnic peoples of Europe and various groups in Siberia. The major question has been whether the migration has been from the west to the east, or the east to the west. And yet perhaps this is the wrong way of looking at it, perhaps both these groups derive from an expansion south of the margins of the glaciers in the wake of the last Ice Age? The Finns clearly physically resemble their fellow Nordics more than the Yakuts. But perhaps this is not to be unexpected when you have mobile low density populations on the margins of more numerous conventional agriculturalists? I believe that the Mercator projection has also caused problems in assessing the plausibilities of connections between circumpolar peoples.

Next let’s move to the Assyrians. As with other such surveys the lack of African ancestry in relation to similar Muslim populations is striking. The Syrian set is probably the best point of comparison. Note the small slices from other populations in the Syrians. I would normally ignore that, but their absence in the Assyrians may be informative. This may be a function of close relatedness of the Assyrians, but I’d give it even odds that a low fraction of exogenous post-Islamic ancestry which is associated with travel within the Muslim lands explains some of the difference between the majority and minority populations (above and beyond the clear African element).

Finally, I’m going make up stories on the fly to generate some discussion (I think the stories correspond to reality more than expectation, but I have very weak confidence in them myself).

– The “Southern European” element which is maximal in Sardinia indicates the very first wave of agriculturalists. The Sardinians may not be purely descended from agriculturalists, but like the composite “South Asian” quantum this represents possibly the very first hybridization due to a rapid demographic pulse driven by agriculture which synthesized with the hunter-gatherers of Western Europe. Like the “Ancient South Indians” I doubt that the Ice Age Europeans of Western Europe are present in “pure” form anymore. The influence of this component can be found far to the east among the Assyrians, but it almost disappears in Afghanistan and Pakistan. I think this may have something to do with R1b1b2.

– The “Northern European” element which is maximal in the Lithuanians is found among the Pashtuns and non-Arab Middle Easterners, but not Arab speakers. It drops off in India very quickly. I don’t think the Lithuanians are the “purest” Indo-Europeans, and I don’t think that this element was necessarily exclusive to Indo-Europeans. But there has to have been some leap-frogging going on, because on average Semitic Middle Eastern groups are more like Europeans than Gujaratis are in total genetic distance, but Gujaratis seem to have a higher fraction of this quantum. And suggestively Punjabis seem to carry the Central Eurasian lactase persistence alleles (I know this from the literature and genome sharing on 23andMe). Again, because these ancestral quanta don’t represent real populations, but are proportions popping out of ADMIXTURE, we shouldn’t take the orange fraction as the “Aryan” ancestry in South Asia. But that seems the most plausible explanation for why it seems at far higher frequency in Indo-European speaking northwest India than it does in the Semitic speaking Fertile Crescent.

– The light-blue “West Asian” fraction gets around. It’s found at the same proportions in Tuscans as Uyghurs, and, you can find it pretty far south and east in South Asia (I have a fair amount of it, as does a Reddy from South India, and the Kannada speakers in the global Dodecad set have some of it too, though less). I assume that it is present at high frequency among the Uyghurs because of the Indo-European speakers. But it clearly doesn’t have an Indo-European origin as such. It has a high frequency among the Cypriots, but not the Sardinians, with modal proportions among various Caucasian groups. I assume it has something to do with agriculture, but seems to have less of an influence in Western Europe than further to the east. The Finns and Lithuanians have a little bit, about the same as South Indians. So again, something which probably hitch-hiked with population movements in the center of Eurasia, but I assume pre-dating the Indo-Europeans.

• Category: History, Science • Tags: Dodecad, Genetic History, Genetics, Genomics 
🔊 Listen RSS

A few days ago Dienekes opened up the Dodecad project to a wider range of Eurasians. I decided to send my 23andMe sample to Dienekes ASAP, and the results came back today. I’m DOD075. Dienekes also just put up an explanation of the 10 ancestral components he’s generating from ADMIXTURE (along with tree-like representations of their distances). Below I’ve placed myself in the more local context of populations to which I’m close to:


Here are all the populations.

Karnataka is a state in northwest South India, and can be taken as somewhat representative of the Dravidian populations. The purple component is a South Asia distinctive element. Using the terminology of Reich et al. it would be the ancient stabilized hybrid population which came out of the admixture of Ancient North Indians (ANI) and Ancient South Indians (ASI). On the margins I assume there’s just noise popping out; e.g., the “East Asian” sliver among the Kannada speakers from South India. On the other hand, the Burusho have shown evidence of East Asian admixture in other studies I’ve seen. They have a bit of the derived East Asian EDAR variant for example.

As for me, no surprise that I have a lot of “East Asian” for a South Asian. Since Dienekes is more interested in Western Eurasia he didn’t go to the point of dividing the East Asians into a northern and southern branch. I’m pretty sure I’d be in the southern branch, along with the Miaozu sample (more well known as Hmong to Americans). The bigger question is how atypical for an east South Asian I am. There is a certain basal load of East Asian ancestry among northeast South Asian Indo-Aryan speakers. Another question is whether my East Asian component can be attributed to the Mundari substrate absorbed by Indo-Aryans in northeast India, or by a more recent admixture of Tibeto-Burmans. Some of both surely, but knowing my family’s long residence on the eastern margins of the Indo-Aryan speaking domains of South Asia, cheek-by-jowl with Tibeto-Burmans, I believe I am likely to have some recent Burmese ancestry. Specifically through my paternal grandfather.

Finally, though it is just as likely to be nothing, I have a bit more “Southern European” than the other South Asians. I assume this is from my great-grandfather who was from Delhi, and part of the polyglot Muslim religious intellectual class of that city. His physical type, which my maternal grandmother inherited, was clearly West Asian. He probably had non-trivial Persian or Central Asian ancestry.

🔊 Listen RSS

Dienekes Pontikos, Introducing the Dodecad ancestry project:

1) Project goals

The Dodecad ancestry project has two goals:

– To provide detailed ancestry analysis to individuals who have tested with 23andMe; other testing companies may be included in the future.

– To build samples of individuals for regions of the world (e.g. Greeks, Finns, Albanians, Southern Italians, etc.) currently under-represented in publicly available datasets.

I neither endorse nor am I affiliated with any genetic testing company. I have chosen to base the project on 23andMe results, because (i) I perceive that quite a few people have used the service, (ii) the Illumina genotyping platform it uses has substantial overlap with the publicly available datasets on which my analysis depends.

Basically some of you need to send him your 23andMe raw data files. The potential sample space of this group is going to be in the tens of thousands from what 23andMe representatives have stated about how many of the Complete Edition kits they’ve sold. Naturally due to labor and computational constraints he only wants people from particular populations. I think that’s fine. I’m a little taken aback by how demanding and critical Dienekes’ readers have been about the choice of populations he analyzes. You can install ADMIXTURE yourself, get data sets, and manipulate them in PLINK, etc. I hope many people will participate in this project. I would have given my sample, but I’m not of an appropriate population, and even if he wanted South Asians I’m pretty sure I’m not very representative of South Asians (I have very few runs-of-homozygosity and seem to have recent admixture from other world population groups).

• Category: Science • Tags: 23andMe, Dodecad, Genetics, Genomics 
Razib Khan
About Razib Khan

"I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. If you want to know more, see the links at"