The Unz Review: An Alternative Media Selection
A Collection of Interesting, Important, and Controversial Perspectives Largely Excluded from the American Mainstream Media
 TeasersGene Expression Blog
ADMIXTURE, African Ancestry Project, and Confirmation Bias
🔊 Listen RSS
Email This Page to Someone

 Remember My Information



=>

Bookmark Toggle AllToCAdd to LibraryRemove from Library • BShow CommentNext New CommentNext New ReplyRead More
ReplyAgree/Disagree/Etc. More... This Commenter This Thread Hide Thread Display All Comments
AgreeDisagreeThanksLOLTroll
These buttons register your public Agreement, Disagreement, Thanks, LOL, or Troll with the selected comment. They are ONLY available to recent, frequent commenters who have saved their Name+Email using the 'Remember My Information' checkbox, and may also ONLY be used three times during any eight hour period.
Ignore Commenter Follow Commenter
Search Text Case Sensitive  Exact Words  Include Comments
List of Bookmarks

I’ve been running the African Ancestry Project for a while now on the side on Facebook. But it’s getting unwieldy, so I finally set up the website. The main reason I started it up is that there have been complaints for a while now of problems with the 23andMe “ancestry painting” and such for some African groups. For example, a Nubian might be 70% “European.” One might argue that this is due to Arab admixture, but this is clearly not so if you look at the PCA plot. What’s going on? Probably a problem with the reference populations (only Yoruba for Africa), ascertainment bias in the chip (they’re tuned to European variation), and the fact that African genetic variance can cause some issues. I don’t know. But the problem has been persistent, and since most of the other genome blogging projects exclude Africans because they’re so genetically diverse I decided to take it on.

Three groups of people have submitted:

– People of the African Diaspora in the New World

– People from Africa, disproportionately Northeast Africans (Horn of African + Nubia, etc.)

– People of some suspected or known minor component of African ancestry

I’m at ~70 participants now. As one reference population set I’ve been using a subset of Henn et al. as well as some populations from Behar et al. I call this my “thin” set since there are only ~40,000 SNPs. A “thick” set has on the order of 300-400 thousand markers. But fewer populations. I’ve been putting the AAP members through ADMIXTURE in batches of 10, but I also run them all together sometimes for apples-to-apples comparisons. Yesterday I ran AF001 to AF070 from K = 2 to K = 14, unsupervised, with the thin reference. If you want to see all the results, go here. Doing all this myself over and over has given me some intuition as to the pitfalls in this sort of analysis. Especially in the area of confirmation bias.

This is how it happens. Let’s say you have a lot of individuals from dozens of populations and hundreds of thousands of markers. Obviously you can modulate the parameters a bit. The number of individuals, the weighting of the various populations, and how thick your marker set is going to be. There’s a practical reason to make your marker set thinner, the algorithm runs much faster. But as you reduce the number of markers the outcomes become much noisier. That’s evident when you look at individuals results, and not the population pooled ones. Varying the population set also matters a lot. If you have a sample with 75 Yoruba and 25 Druze vs. 50 Yoruba and 50 Druze, that can produce different results over the same number of Ks. Finally, obviously reducing the number of individuals causes problems with representativeness. Here the results become “noisy” at the population level, as a regional bias can distortion your perception of a given population.

How does this work with confirmation bias? If you are proactively searching for patterns which align with a particular model or expectation you can often simply modulate the parameters until you obtain “reasonable” results. An exact same issue crops up with multiple regression. And this need not be conscious. In the course of regular science workers often ignore aberrant results and seek out positive ones. What we’re talking about is a general human bias. Researchers have been known to run experiments and keep tweaking them until the p-value reaches statistical significance. First, this treats the p-value as a “magical” number. That’s really not how it should be viewed, but that’s how it plays out in the course of attempting to get published. Second, the p-value itself is going to vary as well, which is why running an experiment over and over can get you the “right” result. The same general problems can crop up with ADMIXTURE. If you have a dedicated computer you can keep running the algorithm with a range of parameters until you get a “reasonable” result. You may also see bizarre results, and dismiss them out of hand as the program acting wonky. I’ve done it myself. But who knows, perhaps some of the “bizarre” results are stumbling upon a novel insight?

I’m not making a postmodernist pure constructionist argument. These algorithms often give predictable and regular results. And some sought results are harder to attain than others (i.e., you have to keeping fishing in the pool longer until you finally get a “bite”). But, be very careful of relying on one chart or graph as a clincher in any argument. This includes stuff I’m presenting. Attempts at replication are important, but there’s only so much time. That’s why I’m encouraging readers to play with these programs themselves.

Speaking of confirmation of a model. I thought I’d do a little experiment. Below are the ~70 participants in the AAP at K = 10. I’m not showing you the reference populations. I will tell you that:

1) The largest number of participants are of New World African Diaspora descent

2) Second in number are those of mostly non-African descent who have some reputed or known minority African ancestry

3) A bit more than half a dozen individuals are of Northeast African ancestry, in full or part

4) One individual has recent Japanese ancestry and another has recent Maya ancestry

5) A minority of the New World Africans have origins in the Caribbean.

6) There are only a few individuals of West African national origin in the data set, but they are there

First, look at this image and make your guesses (leave them in the comments, please don’t spoil it for others by identifying who is what by ID after you confirm):

All the plots with reference results are here. Explicit self-identification is here.

(Republished from Discover/GNXP by permission of author or representative)
 
Hide 5 CommentsLeave a Comment
Commenters to Ignore...to FollowEndorsed Only
Trim Comments?
  1. I’d reckon that the colors mean the following:

    1. Orange = West African = Yoruba
    2. Light Green = = European = Tuscan
    3. Purple = East Africa
    4. Red = Pygmy
    5. Light Blue = North African

  2. Well, let’s see. A game. The New Worlders would be likely to show significant Euro signals, as would the NE Africans, and those with some reputed African ancestors. This seems likely to be the predominant signal since the large majority fall into these groups. So:
    Orange is Euro.
    Interesting is the Japanese/Carib/Mayan, whom I suspect all show the same signal. The Carib would be very minor, since they mostly just died in the plagues, but there should be some minor bit. So, The ‘Asian’ component would be quite strong in two and a very minor component in any others. I peg that as the dark blue, but that may be because my older eyes don’t clearly distinguish the two colors of blue.
    West Africa would be the light green.
    East Africa is purple, and shows up in the New Worlders.
    The minor bits I can’t guess.
    Distilled:
    Orange= Euro
    Dark Blue= Asian
    West African= Light Green
    East Africa= Purple

  3. Cool stuff Razib, your new Nubian sample is interesting, possibly the first time I’ve seen autosomal samples from Nubia (North Sudan). I wonder if Southern Egyptians will be similar to this sample.

  4. Oops, a typo or something there. Actually a brain-short. I flipped East and West in my mind somehow. West would be what shows up in New Worlders.

  5. Extremely pertinent discussion as I am struggling with Atkinson’s linguistic founder effect model of language expansion (Atkinson, Q.D (2011). Phonemic Diversity Supports a Serial Founder Effect Model of Language Expansion from Africa. Science 332, 346. DOI: 10.1126/science.1199295.)
    Q1: Does Bayesian Information Criterion mitigate confirmation bias?
    Q2: Are there quantative methods which would clarify a confirmation bias?

Comments are closed.

Subscribe to All Razib Khan Comments via RSS