The Unz Review - Mobile
A Collection of Interesting, Important, and Controversial Perspectives Largely Excluded from the American Mainstream Media
Email This Page to Someone

 Remember My Information



=>
Authors Filter?
Guest Admin Razib Khan
Nothing found
 TeasersGene Expression Blog
/
Harappa Ancestry Project

Bookmark Toggle AllToCAdd to LibraryRemove from Library • BShow CommentNext New CommentNext New Reply
🔊 Listen RSS

My friend Zack Ajmal has been running the Harappa Ancestry Project for several years now. This is a non-institutional complement to the genomic research which occurs in the academy. His motivation was in large part to fill in the gaps of population coverage within South Asia which one sees in the academic literature. Much of this is due to politics, as the government of India has traditionally been reluctant to allow sample collection (ergo, the HGDP data uses Pakistanis as their South Asian reference, while the HapMap collected DNA from Indian Americans in Houston). Of course this sort of project is not without its own blind spots. Zack must rely on public data sets to get a better picture of groups like tribal populations and Dalits, because they are so underrepresented in the Diaspora from which he draws many of the project participants.

Once Zack has the genotype one of the primary things he does is add it to his broader data set (which includes many public samples) and analyze it with the Admixture model-based clustering package. What Admixture does is take a specific number of populations (e.g. K = 12) and generate quantity assignments to individuals. So, for example individual A might be assigned 40% population 1 and 60% population 2 for K = 2. Individual B might be 45% population 1 and 55% population 2. These are not necessarily ‘real’ populations. Rather, the populations and their proportions are there to allow you to discern patterns of relationships across individuals.

Since Zack has put his results online, I thought it would be useful to review what patterns have emerged over the past two years, as his sample sizes for some regions are now moderately significant. Though he has K=16 populations, not all of them will concern us, because South Asians do not tend to exhibit many of the components. I will focus on seven: S Indian, Baloch, Caucasian, NE Euro, SE Asian, Siberian and NE Asian. These are not real populations, but the labels tell you which region these components are modal. So, for example, the “S Indian” component peaks in southern India. The “Baloch” in among the Baloch people of southeastern Iran and southwest Pakistan. The “NE Euro” among the eastern Baltic peoples. The last three are Asian components, running the latitude from south to north to center. They only concern the first population of interest, Bengalis. I will combine these last three together as “Asian.”

Below is a table, mostly individuals from Zack’s results (though there are some aggregate results from public data sets). Comments below.

Ethnicity SIndian Baloch Caucasian NEEuro Asian
Bengali 53% 28% 2% 5% 8%
Bengali Baidya 45% 30% 3% 5% 12%
Bengali Baidya 45% 27% 3% 6% 12%
Bengali Brahmin 45% 35% 2% 11% 4%
Bengali Brahmin 44% 35% 5% 11% 4%
Bengali Brahmin 43% 35% 4% 10% 4%
Bengali Brahmin 42% 32% 4% 8% 6%
Bengali Brahmin 41% 33% 7% 8% 5%
Bengali Brahmin 40% 33% 4% 10% 4%
Bengali Brahmin 40% 30% 6% 10% 7%
Bengali Muslim 50% 25% 1% 5% 15%
Bengali Muslim 49% 28% 3% 4% 15%
Bengali Muslim 45% 27% 4% 4% 17%
Bengali Muslim 45% 26% 2% 2% 16%
Bengali Muslim 45% 24% 1% 3% 19%
Bengali Muslim 43% 25% 3% 2% 18%
Bengali Muslim 48% 27% 0% 5% 15%
Tamil Brahmin 48% 37% 6% 5%
Tamil Brahmin 48% 37% 3% 5%
Tamil Brahmin 48% 35% 5% 6%
Tamil Brahmin 47% 38% 6% 4%
Tamil Brahmin 47% 40% 3% 5%
Tamil Brahmin 46% 40% 3% 6%
Tamil Brahmin Iyengar 50% 35% 2% 8%
Tamil Brahmin Iyengar 47% 38% 6% 4%
Tamil Brahmin Iyengar 47% 35% 6% 6%
Tamil Brahmin Iyer 48% 38% 4% 5%
Tamil Brahmin Iyer 48% 38% 2% 5%
Tamil Brahmin Iyer 47% 37% 2% 5%
Tamil Brahmin Iyer 47% 37% 6% 8%
Tamil Brahmin Iyer 43% 35% 6% 5%
Tamil Muslim 58% 28% 3% 2%
Tamil Nadar 62% 30% 0% 0%
Tamil Nadar 59% 32% 3% 0%
Tamil Nadar 55% 30% 3% 0%
Tamil Vellalar 50% 35% 6% 1%
Tamil Vellalar 51% 32% 5% 0%
Tamil Vellalar (Sri Lankan) 60% 32% 5% 0%
Tamil Vellalar (Sri Lankan) 60% 33% 0% 0%
Tamil Vellalar (Sri Lankan) 56% 36% 0% 0%
Tamil Vishwakarma 70% 23% 0% 0%
Tamil Vishwakarma 66% 25% 4% 0%
Andhra Pradesh 60% 34% 2% 0%
Andhra Pradesh 54% 36% 2% 3%
Andhra Pradesh (Hyderabad) 56% 29% 5% 0%
Andhra Pradesh (Hyderabad) 47% 35% 8% 4%
Andhra Pradesh Gouda 61% 30% 2% 1%
Andhra Pradesh Kamma 51% 33% 7% 0%
Andhra Pradesh Kapu 62% 30% 2% 1%
Andhra Pradesh Naidu 51% 32% 4% 2%
Andhra Pradesh Reddy 57% 37% 1% 0%
Andhra Pradesh Reddy 54% 38% 3% 0%
Andhra Pradesh Reddy 51% 35% 4% 0%
Andhra Pradesh Reddy 50% 36% 2% 1%
Andhra Pradesh Telegu Brahmin 45% 33% 6% 4%
AP Brahmin (Xing, N = 25) 49% 36% 3% 6%
AP Naidu (Reich, N = 4) 61% 31% 1% 1%
Kannada Devanga 60% 31% 3% 1%
Karnataka Catholic Christian 56% 37% 3% 0%
Karnataka Lingayat 55% 34% 4% 0%
Karnataka 54% 36% 2% 0%
Karnataka Brahmin 51% 35% 3% 5%
Karnataka Iyengar 49% 36% 5% 5%
Karnataka Iyengar 48% 39% 3% 5%
Karnataka Iyengar 48% 37% 3% 7%
Karnataka Brahmin 47% 38% 4% 6%
Karnataka Konkani Brahmin 47% 37% 2% 6%
Karnataka Konkani Brahmin 46% 33% 6% 7%
Karnataka Kokani Brahmin 44% 34% 6% 5%
Kerala 47% 33% 7% 2%
Kerala Brahmin 43% 39% 4% 6%
Kerala Christian 53% 35% 4% 0%
Kerala Christian 50% 35% 8% 1%
Kerala Christian 45% 33% 7% 3%
Kerala Muslim Rawther 53% 35% 2% 1%
Kerala Muslim Rawther 51% 28% 4% 3%
Kerala Nair 48% 40% 4% 0%
Kerala Nair 47% 38% 5% 5%
Kerala Syrian Christian 50% 37% 6% 0%
Kerala Syrian Christian 50% 35% 9% 1%
Kerala Syrian Christian 46% 33% 5% 4%
Kerala Syrian Christian 44% 33% 6% 4%
Pathan (HGDP, N = 23) 23% 42% 16% 11%
Kalash (HGDP, N = 23) 22% 43% 18% 11%
Burusho (HGDP, N = 25) 23% 41% 12% 10%
Brahui (HGDP, N = 25) 12% 58% 12% 2%
Sindhi (HGDP, N = 24) 29% 46% 10% 6%
Kashmiri Pandit (Reich, N = 5) 32% 39% 12% 9%
Punjabi 43% 36% 5% 9%
Punjabi 39% 39% 9% 7%
Punjabi 34% 43% 7% 7%
Punjabi 34% 40% 12% 8%
Punjabi 33% 44% 5% 10%
Punjabi 31% 41% 14% 8%
Punjabi 29% 36% 11% 11%
Punjabi Arain (Xing, N = 25) 31% 44% 10% 7%
Punjabi Brahmin 35% 40% 8% 11%
Punjabi Brahmin 33% 41% 13% 10%
Punjabi Chamar 40% 33% 9% 6%
Punjabi Jatt 28% 39% 11% 10%
Punjabi Jatt 30% 44% 6% 14%
Punjabi Jatt 28% 42% 8% 13%
Punjabi Jatt 28% 46% 7% 13%
Punjabi Jatt 28% 40% 10% 15%
Punjabi Jatt 27% 44% 10% 13%
Punjabi Jatt 27% 35% 16% 11%
Punjabi Jatt Muslim 30% 39% 13% 8%
Punjabi Khatri 30% 42% 12% 12%
Punjabi Lahori Muslim 31% 44% 11% 8%
Punjabi Pahari Rajput 34% 43% 11% 7%
Punjabi Pakistan 28% 36% 16% 7%
Punjabi Ramgarhia 35% 43% 5% 9%
Haryana Jat 25% 33% 12% 17%
Haryana Jat 25% 33% 12% 17%
Haryana Jatt 28% 38% 5% 20%
Haryana Jatt 26% 39% 10% 17%
Rajasthan Marwari Jain 47% 34% 5% 6%
Rajasthani Agarwal 51% 37% 6% 1%
Rajasthani Brahmin 32% 38% 9% 15%
Rajasthani Marwari 48% 34% 6% 2%
Rajasthani Rajput 45% 38% 5% 9%
UP 40% 28% 10% 8%
UP Brahmin 41% 37% 7% 11%
UP Brahmin 40% 37% 7% 11%
UP Brahmin 37% 38% 2% 14%
UP Kayastha 47% 38% 5% 3%
UP Muslim 33% 33% 10% 9%
UP Muslim 28% 35% 12% 11%
UP Muslim Pathan 48% 36% 7% 4%
UP Muslim Syed 33% 31% 13% 7%
UP Syed 36% 37% 7% 8%
UP/Haryana Agarwal 52% 35% 6% 2%
UP/Haryana Jatt 28% 42% 7% 18%
UP/Madhya Pradesh 51% 27% 1% 7%
UP/Punjabi 40% 33% 7% 10%
UP/Punjabi Khatri 27% 43% 10% 11%
Bihari Baniya 47% 31% 5% 5%
Bihari Brahmin 39% 38% 5% 11%
Bihari Kayastha 53% 33% 1% 7%
Bihari Muslim 48% 28% 5% 8%
Bihari Muslim 42% 34% 9% 6%
Bihari Muslim 41% 36% 7% 8%
Bihari Muslim 42% 32% 7% 9%
Bihari Syed 42% 35% 4% 9%
Gujarati (HapMap, N = 63, Patel) 54% 42% 0% 1%
Gujarati (HapMap, N = 34, Non-Patel) 44% 39% 5% 7%

A recent paper suggested that there was a single pulse of admixture between South and East Asians in the environs of what is today Bangladesh which occurred ~500 A.D. The traditional accounts for the arrival of Brahmins to Bengal suggests a period around and after 1000 A.D. (Bengal was one of the last redoubts of institutional Buddhism in northern India, so presumably would have less need for the services of Brahmins). The results are easy to align with these two facts. All the Bengali non-Brahmins (Baidya are a non-Brahmin high caste in West Bengal) have substantial East Asian ancestry. The Bengali Brahmins have far less of this. Additionally, their “NE Euro” component is about double that of non-Brahmins. There is still room for the Bengali Brahmins being a synthetic community with some admixture (their East Asian fraction is still notably higher than elsewhere in South Asia), but the outlines of the traditional narrative seem to explain the broad outline of these results.

When you look at South Indians from the four Dravidian states there are four facts which strike me as of note:

- There is a distinct difference between Brahmins and non-Brahmins (most of the non-Brahmins Zack has in the Harappa data set are upper caste, though the public data sets have Dalits and tribal populations)

- There is very little difference between South Indian Brahmins by region and sect (e.g., Iyengar vs. Iyer are Tamil Brahmins divided by theological differences).

- South Indian Brahmins are genetically distinct from North Indian Brahmins. They seem to have about one half the proportion of the “NE Euro” component as North Indian Brahmins (e.g., compare to Bengali Brahmins).

- South Indian non-Brahmin upper castes have very little of the “NE Euro” component, which is found at low, but consistent fractions among non-Brahmins in the Gangetic plain (and at much higher fractions as one moves toward the Punjab)

I do not know about the nature of the origin of the Pancha-Dravida group of Brahmins, but they look to be endogamous, from the same source, and probably had some admixture with the local substrate early on. This would explain their uniformity and lower fraction of “NE Euro” in relation to North Indian Brahmins. The results above also suggest that the Syrian Christians derive from converts from the Nair community, or related communities. This should not be surprising.

Finally let’s move to North India, and the zone stretching between Punjab in the Northwest and Bihar in the East. Though in much of this region Brahmins have higher “NE Euro” fractions, this relationship seems to breakdown as you go northwest. The Jatt community in particular seems to have the highest in the subcontinent. There are inchoate theories for the origins of the Jatts in Central Asia. I had dismissed them, but am thinking now they need a second look. The reasoning is simple. The Jatts of the eastern Punjab have a higher fraction of “NE Euro” than populations to their northwest (Pathans, Kalash, etc.), and Brahmin groups (e.g., Pandits) in their area who are theoretically higher in caste status. This violation of these two trends implies something not easily explained by straightforward social and geographic processes. The connection between ancestry and caste status also seems to break down somewhat in the Northwest, as there is a wide variation in ancestral components.

Someone with more knowledge of South Asian ethnography should weigh in. But until then I invite readers of South Asian heritage to submit their results to Zack.

(Republished from Discover/GNXP by permission of author or representative)
 
🔊 Listen RSS

Two Steps Forward, Two Steps Back:

I got my daughter a netbook, so now my computer is doing Harappa Prohect work 24×7.

Also, Simranjit was nice enough to offer me the use of a server. For privacy reasons, I am not going to upload any of the participants’ data there but it is much faster than my machine and hence very useful for running Admixture on the reference data (especially with crossvalidation).

As for steps back, I downloaded the current 1000genomes data (1,212 samples, 2.4 million SNPs). It’s in vcf format. Using vcftools to convert it to ped format will take about 3 weeks. Yes you heard that right. BTW, the good stuff from a South Asian point of view will come later this year with a 100 Assamese AhomF, 100 Kayadtha from Calcutta, 100 Reddys from Hyderabad, 100 Maratha from Bombay and 100 Lahori Punjabis.

Also, I spent most of Sunday evening and night in the ER and got a diagnosis of ureterolithiasis for my efforts. All I can say is: Three cheers for Percocet!!

First, wish Zack well. Second, he has over 70 individuals in the Harappa Ancestry Project data base (in addition to the public data sets). If you’re South Asian, Iranian, Burmese, or Tibetan, here are the details of participation.

(Republished from Discover/GNXP by permission of author or representative)
 
🔊 Listen RSS

When Zack first mooted the idea of the Harappa Ancestry Project I had no idea what was coming down the pipe. I wonder if his daughter and wife are curious as to what’s happened to their computer! Since collecting the first wave of participants he’s been a result generating machine. Today he produced a fascinating three dimensional PCA (modifying Doug McDonald’s Javascript) using his “Reference 1″ data set. He rescaled the dimensions appropriately so that they reflect how much of the genetic variance they explain. The largest principal component of variance is naturally Africa vs. non-Africa, the second is west to east in Eurasia, and the third is a north to south Eurasian axis.

I decided to be a thief and take Zack’s Javascript and resize it a bit to fit the width of my blog, blow up the font size, as well as change the background color and aspects of positioning. All to suit my perverse taste. You see the classic “L” shaped distribution familiar from the two-dimensional plots, but observe the “pucker” in the third dimension of South Asian, and to a lesser extent Southeast Asian, populations.

The the topology of the first three independent dimensions of genetic variance of world populations kind of reminds me of a B-2 bomber:

(Republished from Discover/GNXP by permission of author or representative)
 
🔊 Listen RSS

Over the past few months I was hoping more people would start doing what Zack Ajmal, Dienekes, and David, have been doing. There are public data sets, and open source software, so that anyone with nerdy inclination can explore their own questions out of curiosity. That way you can see the power and the limitations of genomics on your own desktop. I wonder if one of the biggest reasons that more people haven’t started doing this is formatting. It can be a pain to convert matrix formatted files into pedigree format, for example. But the data gusher isn’t ending, look at what’s coming out (and has come out) in the 1000 Genomes project!

I’ve been thinking I need to write up a post which is a “soft landing” for people so that we can reduce the “activation energy” for this sort of thing…once you get hooked, you only go deeper. Luckily an anonymous tipster has sent me the link to a URL with a huge data set which has been merged, already pedigree formatted. Here are the populations:

!Kung Buryats Hausa Mada Punjabi Arain Totonac
Adygei Cambodian Hazara Makrani Pygmy Tu
African Americans Chinese Hema Malayan Romanians Tujia
Algeria Chinese Americans Hezhen Mandenka Russian Tunisia
Altaians Chukchis Hungarians Maya Sahara Occ Turks
Alur Chuvashs Iban Mbuti Sakilli Tuscans
Ap Brahmin Cochin Jews Igbo Melanesian Samaritians Tuvinians
Ap Madiga Colombian Iranian Jews Mexicans Samoan Urkarah
Ap Mala Cypriots Iranians Miao San Utahn Whites
Armenians Dai Iraq Jews Mongola San Nb Uygur
Armenians B Daur Irula Mongolians Sandawe Uzbekistan Jews
Ashkenazy Jews Dogon Italian Moroccans Sardinian Uzbeks
Azerbaijan Jews Dolgans Japanese Morocco Jews Saudis Vietnamese
Balochi Druze Jordanians Morocco N Selkups Greenlanders
Bambaran Greenlanders Kaba Morocco S Sephardic Jews Xhosa
Bamoun Egypt Kalash Mozabite She Xibo
Bantukenya Egyptans Karitiana N European Sindhi Yakut
South Africa Ethiopian Jews Kets Naxi Singapore Chinese Yemen Jews
Basque Ethiopians Khmer Nepalese Singapore Indians Yemenese
Bedouin Evenkis Kongo Nganassans Singapore Malay Yi
Beijing Chinese Fang Koryaks Nguni Slovenian Yoruba
Belorussian French Kurd North Kannadi Sotho/Tswana Yukaghirs
Biaka Fulani Kyrgyzstani Orcadian Spaniards
Bnei Menashe Georgia Jews Lahu Oroqen Stalskoe
Bolivian Georgians Lebanese Palestinian Surui
Brahui Gujaratis Lezgins Paniya Syrians
Brong Gujaratis B Libya Papuan Thai
Bulala Hadza Lithuanians Pathan Tamil Brahmin
Burusho Han Luhya Pedi Tamil Dalit
Buryat Han Nchina Maasai Pima Tongan

The data set has ~4,000 individuals, and ~30,000 markers. The binary file is ~25 MB. The download has four files. The .bed, .bim, and .fam, are pedigree formatted. The .csv is a “master list” of the information on each individual (population, region, etc., tied to a specific identification number). This is important because once you have some output files…you need to figure out what it means, and visualize it, and that’s only informative if you have a master list with more than just family and individual information.

Here is the link to the file to download with all the above populations. I’ve pulled it down and run it, so I know it’s not malware.

So what now? The post will be divided into three portions.

1) Running this data in ADMIXTURE

2) Visualizing it in R

3) Manipulating this data in Plink

#1 is not contingent on #2 and #3, so I’ll do that first. You don’t need to read #2 and #3. In fact some of you might be really good at manipulating spreadsheet formatted data, so it might not be needful to go to #2. But in the R section I’ll also have a easier spreadsheet output for you, so even if you don’t care for R’s visualization, you’ll at least have a better to manage set of .csvs. #3 matters if you want to constrain your data set, and also add your own 23andMe file to the end of it.

#1 Running the data in ADMIXTURE

First, you need Linux or MacOS. If you are on Windows, the Wubi application allows you have to have a dual boot. It runs Ubuntu Linux next to Windows, and you can uninstall it as if it is a Windows application.

I am doing this on Ubuntu Linux, for your information. Assuming you have the right operating system, now you need ADMIXTURE. You can put the folder anywhere.

You need to use the terminal to go to the folder where you have ADMIXTURE. The image to the left shows me doing so. You need to click the terminal application, and ender the “cd” command to get to the appropriate folder. My ADMIXTURE program is on the Desktop, within the “GA” folder, and the “admix2″ subfolder. So I typed what you see. The “cd” command moves you around the folders, up and down. Google it if it confuses you, though without knowing what it does it should be fine if you just extract ADMIXTURE to the Desktop, and you type “cd Desktop”. This will clutter up your desktop in the future…but if you need to get some stuff done ASAP without knowing how to navigate in Linux, that should work.

So now you have ADMIXTURE, and the files which ADMIXTURE is going to analyze. What do you do? You need to make sure that ADMIXTURE and your files are in the same folder/location. So if ADMIXTURE is on the Desktop, just extract the files to the Desktop. Now you need to run a command. You see a screenshot of me running ADMIXTURE. You may need to omit the ./ (i.e., “admixture” vs. “./admixture”). You see the file name. The option -j2 is due to the fact that I have two cores. If you don’t know what that means, just omit it. It speeds up the run though. The last number is the K. So this is for K = 4.

Now the program will run. How long depends on the size of the file, and the number of K’s. I often run the program overnight for larger K’s. If you want to get fancy and do stuff like cross-validation, it will take even longer. Be warned. The screenshot to the left is typical of what you’ll run in to as ADMIXTURE does its thing. No worries, the algorithm is running. If you watch long enough you’ll get a sense of what values on the screen point to a high likelihood that it’s almost done, and you can start anticipating the output files from which you can make inferences.

Completion! To the right is what you’ll see when ADMIXTURE is done. As noted, there are output files. This is what is really interesting & useful, but even on this screen there’s goodness. The primitive matrix shows you Fst distances between putative ancestral populations. Fst is measuring the proportion of variance within the data set which can be attributed to between population variance. The smaller the value, the less the magnitude of differences between two populations. On this screen you see four populations, since I set K = 4. The Fst is generated from ancestral allele frequencies, which are within the output files. Remember, these are distances between abstract populations, not real ones.

The original files were euraocean.bed, euraocean.bim, and euraocean.fam. So the output files are like so:

euraocean.4.Q
euraocean.4.F

The 4 represents the K. The first file has a list of the proportions for putative ancestral populations for each individual in the data set, the individuals being on separate lines. The second file has all the allele frequencies for the ancestral populations, generated by the parameter K.

What do you do with this? euraocean.4.Q is related to euraocean.fam, which has family and individual IDs line by line. I don’t know how to use spreadsheets in anything but a primitive way, so I assume there are ways to merge the files and get each line to have ancestry proportions as well as more detailed IDs. Generating mean values for populations also seems essential.

But I use R to do this dirty work.

#2 Visualizing the output with R

If you don’t have R, you need to install it. If you don’t know how to start, control-f sudo. That should yank it down for you. Once R is installed, make sure to be in the folder where you have ADMIXTURE. Then type “R” (no quotes when you type a command!). Now you are in R, what do you do? Here are the specifics of what you need to do:

1) Take the Q file, pump it into a data frame

2) Take the master list, pump it into a data frame

3) Take the .fam file, pump it into a data frame

4) Mix & match

5) Calculate mean proportions, output populations, etc.

6) Visualize!

If you needed to know how to install R, you probably don’t know how to do this. When I first started playing around with ADMIXTURE output files I wrote a quick & dirty script. I barely remember what I am doing with this script now, as I don’t care about the details. But it is now at your service. Still, first you need to do one thing: use a master list which is formatted slightly differently from the one that you downloaded. Here is the revised master list.

Put it in the same folder as ADMIXTURE. Then start R, again, by typing “R.” Run the command you see above. This creates an “HGDPMaster” data frame. That’s necessary for the script I’m giving you to run.

The script is here. If it doesn’t download, copy & paste, and create a file “Rstuff.R”, in the same folder as ADMIXTURE. There are a few variables which you have to manipulate. Here is the relevant section:

###############
# change these
###########
### outputfiles
fileName<-"euraocean"
fileType<-"Q"

#### sets the number of populations to through
#lowest K
Start_K<-12
#highest K
End_K<-12

You need to change the file name to the one you have output. If you did do any manipulation, it should be ref.2.Q for K = 2, so the name is “ref.” You also need to put in the number of K’s. I often run many simultaneously, which I have output files for in the morning. So I often start with 2 and end with 12. If you just want to output one, for example, 2, change Start_K to 2, and End_K to 2. These are the only variables you need to change. But there is a lot more you could do. R “comments” with #, so there is a section which I commented out where you can limit the output to particular populations to make the bar plot less busy. You’ll see what I mean if you look at the script, just remove all the #’s, and reedit as to your taste. Please note that casing matters, so make sure to keep it lower case when possible (if you looked at the master list, you understand). The script does have a string to upper case function, but that’s only for the output. There’s also a small section where you can reedit the names to your taste.

To run the script, do like so:

source("Rstuff.R")

It should output out bar plots, as well as generating some spreadsheet files. There’s a lot more you can do…but if you can do a lot more, you wouldn’t be reading this post. Let’s move to the next issue. So now you wonder: is there any way I can change the data file, or add myself to it? Read on….

#3 Using Plink to manipulate the data file

Now you need Plink. I usually put it within the same larger folder as a subfolder parallel with ADMIXTURE. You run the Plink command like so: “./plink” or, “plink.” Depends on the environment (remember, the quotes are only for the post!). There are many things you can do with Plink. I will show you how to do two things.

#1 remove individuals from the data set

#2 add yourself (or someone whose 23andMe file you have) to the data set

#1 is important because the plots get busy with too much variance. Additionally, Africans, and genetic isolates which have gone through population bottlenecks, tend to overwhelm ADMIXTURE. You probably want to remove them. To do this you need to use the remove option. You need to remove individuals.

Here’s one option with the file you’ve got:

./plink --bfile ref --remove removelist.txt --make-bed --out refRemoved

What’s going on above? You’re using a binary pedigree file, so you have the –bfile option on. You do the deed with –remove, and then you create a second binary pedigree file, refRemoved. So you’ll have refRemoved.bed, refRemoved.bim, and refRemoved.fam. Obviously removelist.txt has what you want to remove. Each line has a family ID and individual ID, separated by a space, of those who you want to remove. The easiest way is probably to open up the master list. For the one I gave you above the last column is the family ID and the first is the individual ID. Cut & paste the first column after the last, delete the other columns, and save. I usually get rid of quotations and tabs, change it to a .txt file, and there you have it.

But what about your 23andMe file? You need to convert it to pedigree. I have created a quick & dirty perl script to do so. You can find it here. Download or cut & paste it. You need to remove the comments at the top of the 23andMe file. That is, you need to remove everything before the first SNP. Assuming that’s done, do this at the command line within the folder where you put the script (you get to that folder with “cd” recall):

perl convert.pl "YourFileName" "001" "001"

The script fires, gets the file name from the first parameter, and outputs two files, YourFileName.ped and YourFileName.map. What about the two other parameters? They’re generating your family ID and individual ID. They’d be FAM001 and ID001 in this case. You need to enter these into the master list! Otherwise you won’t come out on the bar plots. Also enter your ethnicity, etc. Or, just your name if you want to be your own slice of the bar plot.

Note that you have .ped, not .bed, files. These are big. Now you need to convert the text to binary pedigree. Move the YourName files to the plink folder. Make binary:

./plink --file YourFileName --make-bed --out YourFileName

Now you have YourFileName.bed YourFileName.bim YourFileName.fam. It is best to limit your SNPs to the same as those in the reference data set. So get those from the reference:

./plink --bfile ref --write-snplist --out SNPs

You should have a file, SNPs.snplist. Use them to filter your 23andMe file.

./plink --bfile YourFileName --extract SNPs.snplist --make-bed --out YourFileNameFiltered

Now you want to merge:

./plink --bfile ref --bmerge YourFileNameFiltered.bed  YourFileNameFiltered.bim  YourFileNameFiltered.fam --make-bed --out ref

You are now appended to the reference data set! If you open up the ref.fam file your family ID and individaul ID should be at the end of the list.

If you’ve slogged through this far, I thought it would be nice to end with something which shows what this is all about. Below I’ve filtered the reference data set of most African and New World populations, and run it from K = 2 to K = 12. It took about ~10 hours to complete. I’ve also limited the populations to display using the script above so that it isn’t too clustered. Here are the spreadsheets generated from the runs (they will be in folder where you run the R script, and have the form “K =2″ and such for names).

[zenphotopress album=273 sort=sort_order number=11]

(Republished from Discover/GNXP by permission of author or representative)
 
🔊 Listen RSS

Zack Ajmal now has over 50 participants in the Harappa Ancestry Project. This does not include the Pakistani populations in the HGDP, the HapMap Gujaratis, the Indians from the SVGP. Nevertheless, all these samples still barely cover vast heart of South Asia, the Indo-Gangetic plain. Here is the provenance of the submitted samples Zack has so far:

  • Punjab: 7
  • Iran: 7
  • Tamil: 6
  • Bengal: 5
  • Andhra Pradesh: 2
  • Bihar: 2
  • Karnataka: 2
  • Caribbean Indian: 2
  • Kashmir: 2
  • Uttar Pradesh: 2
  • Sri Lankan: 2
  • Kerala: 2
  • Iraqi Arab: 2
  • Anglo-Indian: 1
  • Roma: 1
  • Goa: 1
  • Rajasthan: 1
  • Baloch: 1
  • Unknown: 1
  • Egyptian/Iraqi Jew: 1
  • Maharashtra: 1

Again, note the underrepresentation of two of India’s most populous states, Uttar Pradesh, ~200 million, and Bihar, ~100 million. Nevertheless, there are already some interesting yields from the project. Below I’ve reedited Zack’s static images (though go to his website for something more dynamic) with the labels of individuals. I’ve highlighted myself and my parents with the red pointers.

To the left is a set of plots and tables which I’ve spliced together from Zack’s various posts. What you need to know is that this at K = 12, and I’ve used the labels that Zack gave the various putative “ancestral populations” which emerged out of his ADMIXTURE runs. I’ve also displayed the participants in the Harappa Ancestry Project so far, with their ethnic labels. Finally, smack in the middle you see the Fst values, standardized by the smallest between population difference. So the values in the boxes represent the genetic distances for the inferred ancestral populations in the row and column (I also rounded, since I didn’t want to give the impression of excessive precision). This last point is important, these are not between population distance measures across real populations. Rather, they’re distance measures across the inferred allele frequencies of populations generated which emerge out of the parameters you constrain ADMIXTURE to, as well as the genetic variation which you throw into the pot for the algorithm in the first place.

In the broadest sense the first thing that jumps out at you is the high distance value between “Papuans” and everyone else. This is interesting. In fact, the genetic distance of between Papuans and other ancestral populations is greater than the genetic distance between the putative African populations and other non-Africans, except Papuans. This goes to the point that you need to be very careful in making definitive inferences from these sorts of programs. Interestingly, the population to which the Papuans exhibit the least genetic distance are the “South Asians.” What does that mean? I think this has a straightforward explanation. I believe that the South Asian cluster is a hybridized compound, as suggested by Reconstructing Indian History, and that the populations of Oceania represent a relatively “pure” eastern expansion of long resident southern Asian groups which have generally been submerged by admixture with other groups intrusive to the region. This also explains the fact that Cambodians share some of this Papuan component with various South Asian populations. Finally, I wouldn’t make too much of this, but in some ADMIXTURE runs which I’ve done the genuine Papuan population in the HGDP data set breaks into two ancestral components, of which the southern Asian groups from Pakistan to Cambodia share only one. Remember that Oceania was settled initially by Melanesians and Australians ~40-50,000 years ago, and it looks like the people of Melanesia and indigenous Australians date to this initial period. So connections between southern Asians and Papuans are likely very old, and the two groups have been distinctive for a long time.

To the South Asian individuals surveyed so far, there’s nothing that surprising. The South Asian element tends to increase as one goes south and east. This is what you’d expect. And, the Pakistan/Caucasian component which spans much of western and central Asia is what connects the Iranian samples to the South Asian ones. The Iranians have very little of the South Asian component. This makes sense if the South Asian element is simply an outcome of an admixed population, and one of the ancestral groups from which this component derives, “Ancestral South Indians,” were generally not present to the west of Pakistan. The eastern Asian components are enriched among Bengalis, as you’d expect, but they’re found in different proportions among many individuals who hail from the northern fringe of South Asia more generally. It seems clear that the further west you go, the more likely the “eastern” element is going to be Turk, while the further east (and to some extent south) the more likely it is to be more southernly in provenance. Most of the other patterns are as you would expect. Finally, I’d like to point out that I suspect that Zack is the first one to post the ancestral fractions of someone from the Nadar caste using SNP-chip markers.

Here are all the details about participation.

(Republished from Discover/GNXP by permission of author or representative)
 
🔊 Listen RSS

Zack has started to improve on static R plots with Google powered charts. Check it out. Alas, I can’t inject script tags into the body of my posts, so that’s not feasible for me. Notice on Zack’s plot that I’m more East Asian than either of my parents. The tendency first cropped up with 23andMe’s ancestry painting, and I have seen it in my own ADMIXTURE runs, so I don’t dismiss it as V2 vs. V3 chip anymore. Though I’ve ordered an upgrade myself, so we’ll see for sure. Also, though both my parents are about the same East Asian, they exhibit a different balance of East Asian subcomponents. I’ve seen this in my own ADMIXTURE runs, and I’m going to check for more fine-grained matches with the HGDP East Asian populations soon to ascertain whether their eastern ancestral mix is different. Good times.

(Republished from Discover/GNXP by permission of author or representative)
 
• Category: Science • Tags: Genetics, Genomics, Harappa Ancestry Project 
🔊 Listen RSS

Just some pointers. Dr. Daniel MacArthur has put up a guest post where I outline my own experience with personal genomics. Cool times that we live in. Also, Zack Ajmal has started posting higher K’s of HAP participants. He’s now in the second batch. My parents will be in the third. Lots of Tamils and Punjabis. The Khan’s are the only Bengalis so far. One individual to represent all of Uttar Pradesh. Here’s a list of participants so far.

Finally, I know 3-D visualization is bad form, but I went for it anyway. Below is a cube which shows the positions of Gujaratis, Chinese, Mexican Americans, and Utah whites and Tuscans from the HapMap, along with a few extra samples from friends and family. Can you tell where my parents are?


(Republished from GNXP.com by permission of author or representative)
 
• Category: Science • Tags: Genetics, Genomics, Harappa Ancestry Project 
🔊 Listen RSS

Since I know plenty of friends are getting, or just got, their V3 results, I thought I’d pass this on, Open-ended submission opportunity for 23andMe data (#2):

Who is eligible

Everyone who is of European, Asian, or North African ancestry and all four of his/her grandparents are from the same European, Asian, or North African ethnic group or the same European, Asian, or North African country.

Also, Zack has more than 30 individuals in HAP. The “cow belt” is still way underrepresented. The only Bengalis in the data set are my parents.

(Republished from Discover/GNXP by permission of author or representative)
 
🔊 Listen RSS

Zack has finally started posting results from HAP. To the left you see the results generated at K = 5 from his merged data set with the first 10 HAP members. I am HRP002. Zack is HRP001. Paul G., who is an ethnic Assyrian, is HRP010. Some others have already “outed” themselves, so I could proceed via process of elimination for the other bars. There isn’t anything very surprising here. Zack is 1/4 Egyptian, so he has a rather diverse ancestry. Jatts, who are from Northwest India, are known to have more affinity with populations to the west than those of us from the east or south of the subcontinent. With just that knowledge you can make some educated guesses as to what the “ancestral components” inferred from ADMIXTURE might correspond with in a concrete sense. After submitting to Dodecad and the BGA Project I pretty much know what to expect in relation to me. I’m a rather generic South Asian, except, I have an obvious input of “eastern” ancestry.

This is what Dienekes also found. Aggregating various ancestral components together to be analogous to what Zack produced at K = 5, you get the bar plot below from his runs:

I assume that all ancestry analyses will find that I have a substantial minority of East Eurasian ancestry. I have a similar amount of ancestry which is obviously connected to West Eurasia. And the rest of my ancestry is going to fall into the catchall which is “South Asian,” which Reich et al. in Reconstructing Indian History argued was in fact a compound between a West Eurasian-like population (“Ancestral North Indian,” ANI) and a South Eurasian population (“Ancestral South Indian,” ASI) which was more closely related to East Eurasians than West Eurasians, though distantly so at that (modern West Eurasians are interchangeable with ANI, but ASI do not exist in unadmixed form).

Finally, here’s an analysis of chromosome 1 and its affinities to various reference populations. I’ve labelled myself. No surprises:

I am HRP002 in HAP. DOD075 in Dodecad. IN8 in BGA. I am willing to submit to any of these new grassroots ancestry projects if they want me. But I doubt I’ll find anything too surprising now. They converge upon the same rough proportions (as they should).

I’m at the stage where I want to look more deeply into the details of how long ago the “eastern” admixture occurred. It seems to come down from both parents. If it was very recent there should be some linkage disequilibrium detectable because recombination should not have broken down the allelic associations distinctive to each ethnic group yet (this is noticeable in African Americans). But I am not so sure it is recent anymore, as I’d thought. I suspect a Tibeto-Burman and Munda element were absorbed by Bengali peasants in the course of demographic expansion in what became Bangladesh between 1000 and 1500 A.D., and that ancestry is well distributed across the population now.

But even though I won’t find anything out for myself, the reason HAP and projects like it are useful is that we need better coverage of the world’s variation. There are big coarse questions which we’ve tapped out, but there are still lots of gaps to fill. I’m willing to do my part in that (or, more precisely, at this point I’ve drafted my parents into the role, since they aren’t related and so represent two independent data points for Bengal).

Addendum: I know for many people of European ancestry this sort of thing doesn’t tell them anything new. Not so for me. I always suspected East Asian admixture due to the phenotype of my extended family (and to some extent, me. I did not need to shave regularly until my 20s), but I was always curious as to its extent. Additionally, for the reasons of phenotype I had assumed my mother had very little of such ancestry while my father had a great deal. It turns out that in fact my mother may marginally be more “eastern” than my father.

(Republished from GNXP.com by permission of author or representative)
 
🔊 Listen RSS

Zack has started exploring the K’s of his merged data set for HAP. A commenter suggests that:

As you have begun interpreting the reference results, let me make a friendly warning: you have to keep in mind that most of the reference populations of ethnic groups are extremely limited in sample size (with only between 2 and 25 individuals) and from very obscure sources, and you should keep away from drawing conclusions about millions of people based on such limited number of individuals.

This seems a rather reasonable caution. But I don’t think such a vague piece of advice really adds any value. These sorts of caveats are contingent upon:

- The scope of the question being asked (i.e., how fine a grain is the variation you are attempting to measure going to be)

- The sample size

- The representativeness

- The thickness of the marker set (10 autosomal markers vs. 500,000 SNPs)


This isn’t a qualitative issue, easily to divide into “right” and “wrong.” Sometimes an N = 1 is very insightful. That’s why the whole genome of one Bushman was very useful. In fact, the whole genome of any random Sub-Saharan African, and the whole genome of any random non-African (this means ancestry from before 1500 in those regions), is going to reflect clearly the differences between these two broad population sets in terms of genomic variation. Subsequent addition of individuals to generate a larger sample would be very informative of course, and allow us to answer many more questions. But the point is that even small sample sizes can answer properly framed queries.

Another issue is representativeness. The HGDP data set was biased at the outset toward more isolated and distinctive groups. There was a belief that many of these groups were going to disappear within a generation, and their genetic uniqueness should be recorded (this seems to have been correct). So apparently the clusters generated from HGDP are “cleaner” in their separation than those from the POPRES sample, which is derived from a more cosmopolitan urban set of populations. We also have the HapMap sample, and some of the ones Zack has merged into HGDP and HapMap (there are likely other public data sets, Zack was looking for those with South Asians).

After 10 years of results generated from these data sets I think we have some idea of the errors and baises introduced because of skewed representativeness and small sample size (HapMap has a thicker marker set, but HGDP has a better population coverage). In other words, we should have some intuition of where to be careful, and where not to be. For example, small tribal groups are likely to exhibit genetic distinctiveness (as well as cultural isolates, like the Roma) due to low longer term effective population size. On the other hand, if you have a set of distinct tribal groups, one presumes that the common patterns would reflect broad macro-regional genetic variation. In Zack’s combined data set he has a South Indian tribe and a Pakistani one (I mean Kalash, I understand Pathans and Baloch are tribal people, but they’re expansive and heterogeneous). Any common element between these two groups in relation to Iranians is presumably not a coincidence. Random genetic drift usually results in different allele frequencies between populations, so genetic commonalities between different isolates probably reflect common ancestry.

The main point I’m trying to make is that we’re beyond the point of generic cautions. Rather, there are specific pitfalls which we need to be cognizant of. So if you know specific ethnographic details, that is useful. If there are statistical tricks and tips, that is also useful (larger sample sizes exhibit diminishing returns in statistical power). Also, one needs to keep in mind ascertainment bias, the current generation of SNP chips are tuned to European polymorphisms, so they might miss out on the loci where other populations are polymorphic, but Europeans are not.

By analogy, unsecured credit can be problematic. Yes, I think we knew that. The key is to identify those with the means and ability to use credit responsibly. The tools and data are now available to the masses. A big “BE CAREFUL” sticker is not helpful. What is helpful are concrete and specific pointers.

For what it’s worth, I found Zack’s bar plot hard to read, so here is one I generated with larger labels (K = 6):

Yesterday Zack gave me a personal vector: 66, 1, 4, 10, 14, 0, 4, 0, 0, 3. If you’ve been reading my posts I think you know how to interpret that….

(Republished from Discover/GNXP by permission of author or representative)
 
🔊 Listen RSS

Zack is going to post the first batch of results from HAP tomorrow. It looks like he’s going to be using mostly the merged HGDP, HapMap, SVGP, and Behar data set, supplemented by a second set which also merges the Xing et al. sample (the intersection of Xing et al. with the other results is a much smaller number of SNPs, but, it includes a better coverage of various South Asian groups). He’ll initially be posting ADMIXTURE estimates as you’ve seen on Dodecad. I’m especially interested in the Anglo-Indian and Roma individuals which have sent Zack their samples. I don’t know of any genomic investigation of the former community, while the published research on Roma genetics doesn’t include SNP-chip results (usually they’re mtDNA, Y, or only a few autosomal markers). I’d be curious for possible evidence of homozygosity or linkage disequilibrium in the Roma individual due to the population bottlenecks which other studies have detected (I assume that’ll be in the future). The Roma are to a good approximation an admixture of India, West Asia, and European (often Balkan) groups, but, their history of endogamy and small founding groups experience rapid demographic expansion, are also critical to remember.

Here is the regional breakdown so far:

Punjab: 7
Tamil: 4
Iran: 3
Bengal: 2
Andhra Pradesh: 2
Bihar: 1
Anglo-Indian: 1
Roma: 1
Karnataka: 1
Kashmir: 1

He swapped my parents in for me. My father will be HRP0022, and my mother HRP0023. I am HRP0002 (there will be some results where I am included, though I won’t be in the “founder” runs since I am just a combination of my parents). Remember, 23andMe now comes in to $260 a year ($199 up front, $5/per month for 1 year). It looks like now that the V3 analyses will probably have a 4-5 week turnaround. Though the Indian Diaspora probably can afford these costs, one worry I have is an underrepresentation of Dalits and tribal people in these communities (though some of the public samples include these, the tribals especially exhibit evidence of peculiarities due to genetic drift, so extrapolating from a few tribes might be very problematic in terms of representativeness). Gujarat is covered by the HapMap, and Pakistan is well covered by the HGDP. SVGP and HAP seem to have Tamils covered well too. What is really missing are the vast swaths of the center-north, which aren’t in the public data sets or HAP.

Here’s how to get involved. And the Facebook page. Finally, I know some people have been checking in somewhat obsessively, but just subscribe to the RSS. You never know when Zack’s daughter will monopolize the box on which ADMIXTURE is running, postponing the reporting of results!

(Republished from Discover/GNXP by permission of author or representative)
 
• Category: Science • Tags: Harappa Ancestry Project, Personal Genomics 
🔊 Listen RSS

Zack has been posting his data sources, as well as how he filtered and formatted them, all this week. I assume that the first wave of results will be online soon. As of yesterday, this is what he had (I know he got some more today):

- Punjab 7
- Bengal 1
- Bihar 1
- Tamil 5
- Karnataka 1
- Anglo-Indian 1
- Roma 1
- Iran 3

Whole swaths of north-central India are missing. I am hopeful that more people will join in after the first wave of results are put out there. But, from what I have discussed with Zack it looks plausible that the very first wave will have a richer set of results because of the necessity of preliminary steps. So there’s some benefit in getting early. It’s really ridiculous to have literally 1 sample representing the 300 million people of Uttar Pradesh and Bihar. That’s 25% of South Asians represented by one person. I’ve gotten a commitment from one friend who was born U.P. to give his data up once it comes in, but there have to be others out there. (the Bengali N should go up to 2 when I swap my parents in for me)

The public data sources have Gujaratis, Tamils, Pakistanis (Punjabis, Pathans, Sindhis), and some South Indian groups (Tamil and Telugu). This leaves a blank spot on the North Indian plain.

Here’s the brief for the project again.

(Republished from Discover/GNXP by permission of author or representative)
 
🔊 Listen RSS

Last week I announced the Harappa Ancestry Project. It now has its own dedicate website, http://www.harappadna.org. Additionally, it has its own Facebook page. For Zack to get his own URL he needs about 10 more “likes,” so please like it! (if you are so disposed) Finally, from what I’ve heard the first wave of the 23andMe holiday sale results are coming online this week. Actually, one of the relatives who I purchased the kit for is in processing currently, so I know that we should have a bunch of new people in the system very, very, soon.

Speaking of people, last I heard Zack had gotten about a dozen responses. That’s enough to start an initial round of runs, but obviously he needs more people. More importantly, the goal here is to get better population coverage. One of the things we know intuitively and also from the most current research is the existence of a lot of within-region population variation in South Asia which is structured by community. In other words, a sample of 30 people, where you have 3 from 10 different communities exhibiting geographical and caste diversity is going to be far more useful right now than 300 Jatts from Indian Haryana. Getting 300 Jatts for Haryana would be interesting in that it would give you a window into intra-communal variance, but there’s diminishing returns on the inferences you could make about South Asians as a whole.

If you know someone who has done the 23andMe testing and has preponderant ancestry from South Asia, Iran, Burma, or Tibet, please forward the the URL for the Harappa Ancestry Project. If you are a 23andMe member, and involved in the forums, it might be useful to post a comment thread on this project, as the people you share genes with would see it.

(Republished from Discover/GNXP by permission of author or representative)
 
🔊 Listen RSS

A few weeks ago I hinted at a South Asian equivalent to Dodecad & Eurogenes BGA. It is now public and in the data collection phase. You can read the whole thing here:

http://www.zackvision.com/weblog/2011/01/harappa-ancestry-project

This is the feed:

http://www.zackvision.com/feed/

If your ancestry is from these nations:

  • Afghanistan
  • Bangladesh
  • Bhutan
  • Burma
  • India
  • Iran
  • Maldives
  • Nepal
  • Pakistan
  • Sri Lanka
  • Tibet

Read on! If not, “for entertainment purposes only”….


I have been griping in public and in private about the “reference” populations used for South Asian genomics for years. Because of the Permit Raj the HGDP had to use Pakistani populations. Additionally, because of the HGDP’s mandate to focus on smaller groups which might harbor genetic uniqueness you have some very obscure tribes, but only one sample set from an Indo-Aryan speaking population. And even there, it was a minority, not the Punjabi speaking majority of Pakistan.

Some of this has changed in recent years. Papers such as Reconstructing Indian History and Genetic diversity in India and the inference of Eurasian population expansion have added more populations to the mix. The current phase of the HapMap has Gujaratis from Houston. But there is always a problem when you take a small population set to be representative of a broader group. There are ~1.3 billion South Asians. Using Gujaratis from Houston, who are likely to be of a narrow range of castes, is still problematic. Because of the long history of endogamy and likelihood of fine-grained caste and geographical structure good population coverage is of the essence for South Asians. Taking the Beijing HapMap sample as representative of Han Chinese is not optimal, but this sort of thing would be far less optimal in South Asia.

So when Dienekes began the Dodecad Ancestry Project I was very curious. I had had ADMIXTURE for a while, but it prompted me to start playing around with it myself. My plan was to wait to see how Dienekes fared. In particular, what didn’t pan out in terms of fruitful use of labor. Mine is finite, like everyone’s. My medium term plan was to start up a South Asian equivalent to Dodecad at some point in the first half of 2011.

Then Zack approached me. I know Zack from the internet since 2003 through the blogs. His primary interest in blogging was about Pakistani culture and liberal politics (he’s Pakistani American and a liberal). But he also has a doctorate in electrical engineering, so he has some technical skillz. It turns out that because of Zack’s own peculiar genetic background (he’s 1/4 Egyptian) he kept asking me questions. Eventually it became clear that he was interested about starting something similar to Dodecad…and I told him my own future plans, and encouraged him to take up the torch immediately. I knew Zack had the technical chops, and also could probably devote more time and energy at the time than I could.

I immediately gave him my 23andMe sample. Since I had Dienekes already run my genome we kind of knew what to expect. And it looks like Zack has the software running well. He included a Nepali sample, and it turns out that in an MDS clustering I fell 71% into the dominant Nepali cluster. This is kind of what I expected.

In any case, the details:

Please do not send samples from close relatives. I define close relatives as 2nd cousins or closer. If you have data from yourself and your parents, it might be better to send the samples from your parents (assuming they are not related to each other) and not send your own sample.

If you are unsure if you are eligible to participate, please send me an email (harappa@zackvision.com) to inquire about it before sending off your raw data.

What to send?
Please send your All DNA raw data text file (zipped is better) downloaded from 23andme to harappa@zackvision.com along with ancestral background information about you and all four of your grandparents. Background information would include where they were born, mother tongue, caste/community to which they belonged, etc. Please provide as much ancestry information as possible and try to be specific. Do especially include information about any ancestry from outside South Asia.

Data Privacy
The raw genetic data and ancestry information that you send me will not be shared with anyone.

Your data will be used only for ancestry analysis. No analysis of physical or health/medical traits will be performed.

The individual ancestry analysis published on this blog will be done using an ID of the form HRPnnnn known to only you and me.

What do you get?
All results of ancestry analysis (individual and group) will be posted on this blog under the Harappa Ancestry Project category. This will include admixture analysis as well as clustering into population groups etc.

I suggest you read about Dienekes’ analysis on South Asians for an idea about what to expect.

You can access all blog posts related to this project from the Harappa Ancestry Project link on the navigation menu on every page of my website. You can also subscribe to the project feed.

If you’re South Asian, Iranian, Burmese, or Tibetan, and have a 23andMe genotyping done, you know what to do. If you know someone from these groups who have had that done, please forward this one.

(Republished from Discover/GNXP by permission of author or representative)
 
• Category: Science • Tags: Genetics, Genomics, Harappa Ancestry Project 
No Items Found
Razib Khan
About Razib Khan

"I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. If you want to know more, see the links at http://www.razib.com"