**Case Sensitive**

**Exact Words**

**Include Comments**

Kobe Bryant is an exceptional professional basketball player. His father was a “journeyman”. Similarly, Barry Bonds and Ken Griffey Jr. both surpassed their fathers as baseball players. Both of Archie Manning’s sons are superior quarterbacks in relation to their father. This is not entirely surprising. Though there is a correlation between parent and offspring in their traits, that correlation is imperfect.

Note though that I put journeyman in quotes above because **any success at the professional level in major league athletics indicates an extremely high level of talent and focus.** Kobe Bryant’s father was among the top 500 best basketball players of his age. His son is among the top 10. This is a large realized difference in professional athletics, but across the whole distribution of people playing basketball at any given time it is not so great of a difference.

What is more curious is how this related to the reality of regression toward the mean. This is a very general statistical concept, but for our purposes we’re curious about its application in quantitative genetics. People often misunderstand the idea from what I can tell, and treat it as if there is an orthogenetic-like tendency of generations to regress back toward some idealized value.

Going back to the basketball example: **Michael Jordan, the greatest basketball player in the history of the professional game, has two sons who are modest talents at best.** The probability that either will make it to a professional league seems low, a reality acknowledged by one of them. In fact, from what I recall both received special attention and consideration because they were Michael Jordan’s sons. It is still noteworthy of course that both had the talent to make it onto a roster of a Division I NCAA team. This is not typical for any young man walking off the street. But the range in realized talent here is notable. Similarly, Joe Montana’s son has been bouncing around college football teams to find a roster spot. Again, it suggests a very high level of talent to be able to plausibly join a roster of a Division I football team. **But for every Kobe Bryant there are many, many, Nate Montanas.** There have been enough generations of professional athletes in the United States to illustrate regression toward the mean.

So how does it work? A few years ago a friend told me that the best way to think about it was a bivariate distribution, where the two random variables are additive genetic variation and environmental genetic variation. Clearer? For many, probably not. To make it concrete, let’s go back to the old standby:

**the quantitative genetics of height.**

For height in developed societies we know that ~80% of the variation of the trait in the population can be explained by variation of genes in the population. That is, the heritability of the trait is 0.80. This means that the correspondence between parents and offspring on this trait is rather high. Having tall or short parents is a decent predictor of having tall or short offspring. But the heritability is imperfect. **There is a random “environmental” component of variation.** I put environmental in quotations because that really just means it’s a random noise effect which we can’t capture in the additive or dominance components (this sort of thing may be why homosexual orientation in individuals is mostly biologically rooted, even if its population-wide heritability is modest). It could be biological, such as developmental stochasticity, or gene-gene interactions. The point is that this is the component which adds an element of randomness to our ability to predict the outcomes of offspring from parents. It is the darkening of the mirror of our perceptions.

Going back to height, the plot to the left shows an idealized normal distribution of height for males. I set the mean as 70 inches, or 5 feet 10 inches. The standard deviation is 2.5, ~~which means that if you randomly sampled any two males from the dataset the most likely value of the difference would be 2.5 inches~~ which is just the average deviation from the mean (it’s a measure of dispersion). Obviously the height of a male is dependent upon the height of a father, but the mother matters as well (perhaps more due to maternal effects!). Here we have to note that there’s clearly a sex difference in height. How do you handle this problem? Actually, that’s easy. Just convert the heights of the parents to sex-controlled standard deviation units. For example, if you are 5 feet and 7.5 inches as a male you are 1 standard deviation unit below the mean. If you are a female at the same height you are 1.4 standard deviation units above the mean (assuming female mean height of 5 feet and 4 inches, and standard deviation of 2.5 inches). If height was nearly ~100% heritable you’d just average the two parental values in standard deviation units to get the expectation of the offspring in standard deviation units. In this case, the offspring should be 0.2 standard deviation units above the mean.

But height is **not ~100% heritable.** There is an environmental component of variation which isn’t accounted for by the parental genotypic values (at least the ones with effects of interest to us, the additive components). If height is ~80% heritable then you’d expect the offspring to regress 1/5th of the way back to the population mean. For the example above, the expectation of the offspring would be 0.16 standard deviation units, not 0.20.

Let’s make this more concrete. Imagine you sampled a large number of couples whose midparent phenotypic value is 0.20 standard deviation units above the mean in height. This means that if you convert the father and mother into standard deviation units, their average is 0.20. So one pair could be 0.20 and 0.20, and another could be of someone 2.0 and -1.6 standard deviation units. What’s the expected distribution of male offspring height?

1) The midparent value naturally is constrained to have no variance (though as I indicate above since it’s an average the selected parents may have a wide variance)

2) The male offspring are somewhat above the average population in distribution of height

3) It remains a distribution. The expected value of the offspring is a specific value, but environmental and genetic variation remains to produce a range of outcomes (e.g., Mendelian segregation and recombination)

4) There has been some regression back to the population mean

I only displayed the males. There are obviously going to be females among the offspring generation. What would the outcome be if you mated the females with the males? Recall that the female heights would exhibit the same mean, 0.16 units above the original population mean. **This is where many people get confused** (frankly, those whose intelligence is somewhat closer to the mean!). They presume that a subsequent generation of mating would result in further regression back to the mean. No! Rather, the expected value of the offspring would be 0.16 units. Why?

Because through the process of selection you’ve created a new genetic population. The selection process is imperfect in ascertaining the exact causal underpinning of the trait value of a given individual. In other words, because height is imperfectly heritable some of the tall individuals you select are going to be tall for environmental reasons, and will not pass that trait to heir offspring. But height is ~80% heritable, which means that the filtering process of genes by using phenotype is going to be rather good, and the genetic makeup of the subsequent population will be somewhat deviated from the original parental population.** In other words, the reference population to which individuals “regress” has now changed.** The environmental variation remains, but the additive genetic component around which the regression is anchored is now no longer the same.

This is why I state that regression toward the mean is not magical in a biological sense. There is no population with fixed traits to which selected individuals naturally regress or revert to. Rather, populations are useful abstractions in making sense of the statistical correlations we see around us. The process of selection is informed by population-wide trends, so we need to bracket a set of individuals as a population. But what we really care about are the genetic variables which underpin the variation across the population. And those variables can change rather easily through selection. Obviously regression toward the mean would be exhibit the magical reversion-toward-ideal-type property that some imagine if the variables were static and unchanging. **But if this was the matter of things, then evolution by natural selection would never occur!**

Therefore, in quantitative genetics regression toward the mean is a useful dynamic, a heuristic which allows us to make general predictions. But we shouldn’t forget that it’s really driven by biological processes. Many of the confusions which I see people engage in when talking about the dynamic seem to be rooted in the fact that individuals forget the biology, and adhere to the principle as if it is an unthinking mantra.

And that is why there is a flip side: even though the offspring of exceptional individuals are likely to regress back toward the mean, **they are also much more likely to be even more exceptional than the parents than any random individual off the street!** Let’s go back to height to make it concrete. Kobe Bryant is 6 feet 6 inches tall. His father is 6 feet 9 inches. I don’t know his mother’s height, but her brother was a basketball player whose height is 6 feet 2 inches. Let’s use him as a proxy for her (they’re siblings, so not totally inappropriate), and convert everyone to standard deviation units.

Kobe’s father: 4.4 units above mean

Kobe: 3.2 units above mean

Kobe’s mother: 1.6 units above the mean

Using the values above the expected value for the offspring of Kobe’s father & mother is a child 2.4 units above the mean. Kobe is somewhat above the expected value (assuming that Kobe’s mother is a taller than average woman, which seems likely from photographs). But here’s the important point: **his odds of being this height are much higher with the parents he has than with any random parents.** Using a perfect normal distribution (this is somewhat distorted by “fat-tailing”) the odds of an individual being Kobe’s height are around 1 in 1,500. But with his parents the odds that he’d be his height are closer to 1 out of 5. In other words, Kobe’s parentage increased the odds of his being 6 feet 6 inches by a factor of 300! The odds were still against him, but the die was loaded in his direction in a relative sense. By analogy, in the near future we’ll see many more children of professional athletes become professional athletes both due to nature and nurture. But, we’ll continue to see that most of the children of professional athletes will not have the requisite talent to become professional athletes.

*Image Credit: Wikipedia*

You say: The standard deviation is 2.5, which means that if you randomly sampled any two males from the dataset the most likely value of the difference would be 2.5 inches (it’s a measure of dispersion).

That’s incorrect. The standard deviation of a distribution is indeed a measure of dispersion, but it is not the case that the most likely value of the difference between two draws from the distribution will be the standard deviation. Imagine two dice. Roll one and then the other. Take the absolute value of the difference between the number of spots on the two dice. There are 36 possibilities. Six times out of 36 you’ll get 0, ten times you will get 1, eight times 2, etc. So 1 is the likeliest value of the difference between two dice. The SD of a die roll is about 1.71.

As another example, the absolute value of the difference between draws from two normal distributions would give a half-normal distribution with likeliest value 0.

Right, the mean average absolute difference in height between two random men in that example is more like 2.8 than 2.5, but it’s just a linear transformation of the standard deviation. I forget the exact formula.

Question: does the population of children have the same variance as the population from which their parents were chosen?

Joe Montana was not much of an athlete and did not have much of an arm. What he did have was terrific concentration and a fearlessness that was reckless in the extreme, leading to both a lot of completed short passes and entire missed seasons due to injury.

Barry Bonds was a drug enhanced freak and can not be considered as a legitimate data point.

Additionally, Joe Montana’s success was due more to character than innate athletic ability and as such was most likely due more to environment than genes. And if you look at Bobby Bonds outstanding career and records he can in no way be described as “mediocre.”

Professional basketball is also a poor choice of sports to describe “excellence.” The regular season is nothing but a bunch of exhibition games where no effective defense is really played, and the various rules such as those against traveling are not enforced. It’s more like professional wrestling. The game is dominated by huge centers and “gunners” who are allowed by lax officiating to make a lot of baskets in order to “wow” the fans.

#1, thanks, i goofed trying to make it understandable, and then exhibited my own confusion!

does the population of children have the same variance as the population from which their parents were chosen?last i checked. but i’ll double-check.

Barry Bonds was a drug enhanced freak and can not be considered as a legitimate data point.i disagree with you about montana, but this is just plain stupid. before he became a major juicer bonds was a VERY GOOD baseball player. arguably the best of the 1990s.

http://www.baseball-reference.com/players/g/griffke02.shtml

http://www.baseball-reference.com/players/b/bondsba01.shtml

Kobe, don be rapin’:

http://www.youtube.com/watch?feature=player_detailpage&v=Z4RH2DBb8T4#t=202s

Thank you for doing this post. “But if this was the matter of things, then evolution by natural selection would never occur!” That was exactly the paradox I was trying to get my head around. It’s the “environmental” component that regresses to the mean.

I see three factors here. One is innate hereditary potential. One is connections. and one is education.

Someone with connections will be able to make absolutely the most of their talent. This is least important in highly competitive areas where there are objective standards, but even there it’s a factor. For example, in the college football and the NFL you hear fairly often of top players who were walkons with no scholarship or undrafted. I strongly suspect that many of these players played for low-prestige coaches on less-known teams in small-time leagues

Education: things like sports, music, and math are very specialized and require more than talent and general knowledge. A 15 year old batter and outfielder whose father was a top batter and outfielder will know all kinds of tricks of the trade that the average, equally talented kid doesn’t know. This is conditional; some fathers pass down only genese, and some teach the family trade.

Picasso’s dad was a professional artist. Mozart’s dad was a professional musician who specialized in pedagogy. Bach’s father and two uncles were professional musicians, and so were three of his sons. Mozart’s son was a mediocre professional musician — maybe Mozart’s father was a bteer teacher than Mozart was.

What I’ve read is that the strength of regression to the mean in such a situation (predicting child trait values) is tempered by how much information you have about the child’s ancestors in totality, so that for the F1 generation, knowing only their parents, we’d use an 80% regression, and for the F2s, since we have data for grandparents as well as parents we could use a higher value. So less regression, yes, but is it going a little too far to say that the F2s exhibit

*zero*regression?I’m unversed in statistics, but I’d like to understand this.

#9, don’t have falconer handy right now. but i can see how F2 would regress to grandparents for traits on a few genes which exhibit dominance effects via mendelian segregation. or perhaps inbred lineages. but not understanding how F2 would be influenced by grandparents if it is selection on quant. trait with lots of genes of small effect. height is in the latter case.

#9, also, a citation to where you read/heard about this would be of interest.

also, #9, it makes sense to me why you would want to look at an individual’s family

laterally.e.g., cousins, second cousins. that would give a better sense of the environmental variance affecting a population. just not sure why going back many generations would matter.Fascinating post. I knew Kobe’s father was a pro basketball player, but I hadn’t realized that his mother had a first order relative who was also a pro. So Kobe had pro basketball players on BOTH sides of his family. That certainly puts him in an ultra-rare category in terms of parentage.

John Emerson:

I’d say in regards to music and sports at least, it depends on the particulars. Classical music and baseball? Yeah. Blues and Track and Field, not so much.

Also, being the child of someone who is very gifted in what they do

*sucks*, especially if they’re a crap teacher. My Dad is a musician for whom it comes naturally. He’s not particularly trained, but can pick up near anything that’s not too technically complex by ear. I remember cutting the strings off an ukulele he gave me when I was seven out of frustration from never playing well enough to match him. For that reason I never again took up music until fairly recently, and found out as an adult, that while I’m not gifted, I’m pretty good at it, and when I’m not comparing myself to my father, I really, really enjoy playing music and composing songs. So, I have to say I really really feel what Jordan’s kids have to go through. There’s nothing like having some talent and the desire to do something, but to be forever eclipsed by your parent.I am not a scientist or statistician, just browsing….but #14 is interesting. I was wondering about the effect of the human element. For example, the sons and daughters of strong self-made people, often but not always fathers, seem to be especially hamstrung from achieving anything in any sphere…by the over achievement of their parent, which seems to create a toxic child-rearing environment.

The standard deviation is not the average deviation (which is always 0, by definition), or even the average absolute deviation, but the square root of the average squared deviation. The standard deviation and the average absolute deviation typically differ by around 30%. The standard deviation has deeper statistical justification, but can be thought of as another way of dealing with the fact that all the deviations add up to 0 (because some are positive, some negative), so their average is 0 (taking the absolute value is the other way to deal with this). Razib knows all this, just trying to help clarify some of the stats.

#9. What you say sounds like Galton’s law of ancestral heredity, which holds that each ancestral generation contributes a certain proportion to the offspring. But this laws is wrong. For an additive trait like height, the grandparents do not provide further information. For a trait with dominance, grandparents can be informative, but dominance decreases the heritability of a trait, not increases it. See Provine’s Origin of Theoretical Population Genetics for more details.

The “Distribution of Male Height” graph is wrong. The red curve should be skinnier (lower standard deviation). Otherwise the offspring generation will have a much higher standard deviation than the parent generation (part of offspring standard deviation is due to parents, part due to randomness; thus the part due to randomness can’t be equal to the entire standard deviation of parents).

The formula is that the standard deviation of the red curve should be equal to sqrt(1-0.8)*(standard deviation of black curve).

Also, a related issue: you say “They presume that a subsequent generation of mating would result in further regression back to the mean. No! Rather, the expected value of the offspring would be 0.16 units.” If this is true, then either 1) that subsequent generation has a much higher standard deviation than its parents or 2) correlation is 100% between that subsequent generation and its parents. Neither seems likely. Further on this issue, what about the parents with a phenotype midpoint of 0.2 made their children’s expected value 0.16, while the children of those with a phenotype midpoint of 0.16 also had an expected value of 0.16? How in practice can you distinguish between tall parents whose children will have an expected value equal to theirs vs. lower?

The “Distribution of Male Height” graph is wrong. The red curve should be skinnier (lower standard deviation). Otherwise the offspring generation will have a much higher standard deviation than the parent generationthe empirical distribution of offspring is 2/3 to 3/4 of the population from which parents were selected from (this is in a follow up post). the fact that the offspring have a relatively high deviation despite selection on parents is due to genetic segregation and recombination.

. Further on this issue, what about the parents with a phenotype midpoint of 0.2 made their children’s expected value 0.16, while the children of those with a phenotype midpoint of 0.16 also had an expected value of 0.16?you selected the parents to have a different mean from the distribution. you don’t select the children. your question is really hard to understand. you either don’t know this stuff well, or you know it really well that it’s beyond me frankly.

Let’s say there are 3 generations in a large population, with heights measured as random variables x1, x2, and x3. x2 are descendents of x1, and x3 are descendents of x2. Assume x1, x2, and x3 are all normal and have mean 0 (this just means there is not a systematic drift towards taller or shorter people over time), and the correlation between parents and their offspring is 0.8.

Then, in the first part of your post, you explain that E[x2 | x1 = 0.2] = 0.16 (read “the expected value of x2, given that x1 is 0.2 sd units, is 0.16 sd units”). This is correct.

Later (“Rather, the expected value of the offspring would be 0.16 units.”) you seem to argue that E[x3 | x1 = 0.2] = 0.16. But this is just wrong. In fact, E[x3 | x1 = 0.2] = 0.128 (0.8^2 * 0.2).

Or is your representation somehow different from a multivariate normal? You seem to suggest it isn’t.

Here is python code to show that E[x3 | x1 = 0.2] = 0.128 (the single quotes seem to be represented incorrectly in this blog format, you may need to just retype them as single quotes on your keyboard) :

———————————————–

corr = 0.8

x1chose = 0.2 # this represents the selection of gen 1. with height 0.2

x1 = [random.gauss(0,1) for a in range(100000)]

print ‘x1 avg.: ‘, sum(x1)/len(x1)

x2 = [corr*x1chose + (1-corr

**2)**.5*random.gauss(0,1) for a in range(100000)]print ‘x2 avg., given x1 = 0.2: ‘, sum(x2)/len(x2)

x3 = [corr*a + (1-corr

**2)**.5*random.gauss(0,1) for a in x2]print ‘x3 avg., given x1 = 0.2: ‘, sum(x3)/len(x3)

#20, , i think the technically the issue is that the

correlation between parents and offspring is not 0.8.0.8 is the slope of the regression line of the offspring on y vs. parents on x(that’s one way to ascertain heritability in a population). the correlation is considerably lower (it’s on the order of ~0.5 for siblings).let me state it concretely, because i think you didn’t notice when i switched the type of populations i’m talking about as well.

1) first, you have a normally distributed population

2) second, you select parents with midparent parents of exactly 0.2 deviations from the median of the first population (subset of 1)

3) you have x number of offspring cohorts. each offspring cohort will be 0.16 deviations from the original population in 1)

4) then i posited that the offspring themselves in 3) would mate randomly, which is different from what i posit in 2). the new population distribution, analogous to 1), would have a mean of 0.16.

Razib,

First, the sentence “~80% of the variation of the trait in the population can be explained by variation of genes in the population” suggests (to me, admittedly not a geneticist) an 80% correlation, not an 80% slope, but more importantly note that correlation = the slope assuming that 2 generations (variables) have same standard deviation, which is likely. Slope = correlation * ratio of standard deviations. But fine, we can set this issue aside.

Also, I did notice the difference (on my 2nd reading, between my 1st and 2nd comments). That’s why, in my 2nd comment, I first calculate E[x2 | x1 = 0.2], and next calculate E[x3 | x1 = 0.2]. These 2 are not analogous – the 1st one is expected value of a person’s height given that both his parents have a height of 0.2. The 2nd one is the expected value of a person’s height given that all of his grandparents have a height of 0.2. That seems to be consistent with your step 4 above.

The bottom line is that if you are arguing that the expected height of a person whose 4 grandparents have height 0.2 sd above mean (and that’s all you know about that person’s genetics) is the same as the expected height of a person whose 2 parents have a height of 0.2 sd above the mean, then I disagree with you (assuming we are using a multivariate normal distribution which is what I’ve implemented in python). The former should be lower. My python code shows this (it should actually be modified slightly to account for the fact that x3 is based on average of 2 draws of x2, rather than a single draw, but this doesn’t change the qualitative conclusion).

If it is possible for you to implement a simulation of your representation in a coding language, or to indicate what in my representation is wrong, that might clear it up.

I’ll try to convince you using one other thought: suppose that you did do what I thought (incorrectly) you were doing at first, i.e. that the members of the 3) population in your comment didn’t mate randomly, but that only those with height 0.16 mated. Is it intuitive to you that in that case, the avg. height in population 4) would be lower than in the case you actually describe?

The bottom line is that if you are arguing that the expected height of a person whose 4 grandparents have height 0.2 sd above mean (and that’s all you know about that person’s genetics) is the same as the expected height of a person whose 2 parents have a height of 0.2 sd above the mean, then I disagree with you (assuming we are using a multivariate normal distribution which is what I’ve implemented in python).your model is too simple. in the first generation you’re sampling in a biased fashion from the additive genetic variance (one normalish distribution). you are not sampling in a biased fashion from the residual/environment (the other normalish distribution which is confounded with the former in the total population). that is what regression back toward am a mean is proportional to the amount of environmental variance in the parental population, as it is counteracts the efficacy of selection on genes. in the second generation i’m not positing any selection on genes at all. both of the distributions in e3 are going to be centered around 0 in reference to the parental population, whereas in the e2 the underlying genetic component is deviated from 0 (via selection) and the environmental component remains about 0. the regression is the compound of the two.

I’ll try to convince you using one other thoughtdude,

animal breeders use this method all the time.it’s outlined as i described. if i can’t describe it for you clearly, that’s fine, but please don’t presume this is an abstruse theoretical issue! no, subsequent generations don’t regress like your model . that’s not something i made up, that’s something validated in breeding programs for decades. you’re not a geneticist, so you don’t know that empirically. that’s fine.and to be clear, the reference “mean” is itself a combination of genetic and environmental variables. my point in the original post is that it isn’t some fixed parameter around which populations and pedigrees cycle magically, but a variable of instrumental utility which is itself sensitive to many other parameters.

one can’t presume always that the mean is the grandparental generation is going to be the same mean in the grandchild generation!it is possible that grandparents are deviated from their population mean, while grandchildren of the same quantitative value are at their population mean.ok, i got it, thanks. If you have tall parents your expected height is lower to the extent that your parents are tall due to environmental factors. My model was too simple because I used a constant correlation whereas really the correlation should fall as the parents’ height deviates (on either side) from the mean of their cohort, as this deviation is evidence of environmental influence.

The environmental factor is never directly observable, but luckily an imperfect proxy is the grandparents — if your parents are very tall but your grandparents aren’t, likely a portion of your parents’ height is due to environmental factors, so you shouldn’t expect to be as tall as someone with tall parents and grandparents. That may be what #9 is getting at.

#25, yep. sorry about my lack of clarity…but then that’s why there are textbooks on this subject.

p.s. i used correlation in the post in a somewhat colloquial sense because i assumed readers would be more familiar with it. i did ponder whether to explicitly talk about the heritability as the slope of the regression line….