Reading citations is easier than most people think

It's really common to see claims that some meme is backed by “studies” or “science”. But when I look at the actual studies, it usually turns out that the data are opposed to the claim. Here are the last few instances of this that I've run across.

Dunning-Kruger

A pop-sci version of Dunning-Kruger, the most common one I see cited, is that, the less someone knows about a subject, the more they think they know. Another pop-sci version is that people who know little about something overestimate their expertise because their lack of knowledge fools them into thinking that they know more than they do. The actual claim Dunning and Kruger make is much weaker than the first pop-sci claim and, IMO, the evidence is weaker than the second claim. The original paper isn't much longer than most of the incorrect pop-sci treatments of the paper, and we can get pretty good idea of the claims by looking at the four figures included in the paper. In the graphs below, “perceived ability” is a subjective self rating, and “actual ability” is the result of a test.

Dunning-Kruger graph

In two of the four cases, there's an obvious positive correlation between perceived skill and actual skill, which is the opposite of the first pop-sci conception of Dunning-Kruger that we discussed. As for the second, we can see that people at the top end also don't rate themselves correctly, so the explanation for Dunning-Kruger's results is that people who don't know much about a subject (an easy interpretation to have of the study, given its title, Unskilled and Unaware of It: How Difficulties in Recognizing One's Own Incompetence Lead to Inflated Self-Assessments) is insufficient because that doesn't explain why people at the top of the charts have what appears to be, at least under the conditions of the study, a symmetrically incorrect guess about their skill level. One could argue that there's a completely different effect that just happens to cause the same, roughly linear, slope in perceived ability that people who are "unskilled and unaware of it" have. But, if there's any plausible simpler explanation, then that explanation seems overly complicated without additional evidence (which, if any exists, is not provided in the paper)¹.

A plausible explanation of why perceived skill is compressed, especially at the low end, is that few people want to rate themselves as below average or as the absolute best, shrinking the scale but keeping a roughly linear fit. The crossing point of the scales is above the median, indicating that people, on average, overestimate themselves, but that's not surprising given the population tested (more on this later). In the other two cases, the correlation is very close to zero. It could be that the effect is different for different tasks, or it could be just that the sample size is small and that the differences between the different tasks is noise. It could also be that the effect comes from the specific population sampled (students at Cornell, who are probably actually above average in many respects). If you look up Dunning-Kruger on Wikipedia, it claims that a replication of Dunning-Kruger on East Asians shows the opposite result (perceived skill is lower than actual skill, and the greater the skill, the greater the difference), and that the effect is possibly just an artifact of American culture, but the citation is actually a link to an editorial which mentions a meta analysis on East Asian confidence, so that might be another example of a false citation. Or maybe it's just a link to the wrong source. In any case, the effect certainly isn't that the more people know, the less they think they know.

Income & Happiness

It's become common knowledge that money doesn't make people happy. As of this writing, a Google search for happiness income returns a knowledge card that making more than $75k/year has no impact on happiness. Other top search results claim the happiness ceiling occurs at $10k/year, $30k/year, $40k/year and $75k/year.

Google knowledge card says that $75k should be enough for anyone

Not only is that wrong, the wrongness is robust across every country studied, too.

People with more income are happier

That happiness is correlated with income doesn't come from cherry picking one study. That result holds across five iterations of the World Values Survey (1981-1984, 1989-1993, 1994-1999, 2000-2004, and 2005-2009), three iterations of the Pew Global Attitudes Survey (2002, 2007, 2010), five iterations of the International Social Survey Program (1991, 1998, 2001, 2007, 2008), and a large scale Gallup survey.

The graph above has income on a log scale, if you pick a country and graph the results on a linear scale, you get something like this.

Best fit log to happiness vs. income

As with all graphs of a log function, it looks like the graph is about to level off, which results in interpretations like the following:

Distorted log graph

That's an actual graph from an article that claims that income doesn't make people happy. These vaguely log-like graphs that level off are really common. If you want to see more of these, try an image search for “happiness income”. My favorite is the one where people who make enough money literally hit the top of the scale. Apparently, there's a dollar value which not only makes you happy, it makes you as happy as it is possible for humans to be.

As with Dunning-Kruger, you can look at the graphs in the papers to see what's going on. It's a little easier to see why people would pass along the wrong story here, since it's easy to misinterpret the data when it's plotted against a linear scale, but it's still pretty easy to see what's going on by taking a peek at the actual studies.

Hedonic Adaptation & Happiness

The idea that people bounce back from setbacks (as well as positive events) and return to a fixed level of happiness entered the popular consciousness after Daniel Gilbert wrote about it in a popular book.

But even without looking at the literature on adaptation to adverse events, the previous section on wealth should cast some doubt on this. If people rebound from both bad events and good, how is it that making more money causes people to be happier?

Turns out, the idea that people adapt to negative events and return to their previous set-point is a myth. Although the exact effects vary depending on the bad event, disability², divorce³, loss of a partner⁴, and unemployment⁵ all have long-term negative effects on happiness. Unemployment is the one event that can be undone relatively easily, but the effects persist even after people become reemployed. I'm only citing four studies here, but a meta analysis of the literature shows that the results are robust across existing studies.

The same thing applies to positive events. While it's “common knowledge” that winning the lottery doesn't make people happier, it turns out that isn't true, either.

In both cases, early cross-sectional results indicated that it's plausible that extreme events, like winning the lottery or becoming disabled, don't have long term effects on happiness. But the longitudinal studies that follow individuals and measure the happiness of the same person over time as events happen show the opposite result -- events do, in fact, affect happiness. For the most part, these aren't new results (some of the initial results predate Daniel Gilbert's book), but the older results based on less rigorous studies continue to propagate faster than the corrections.

Chess position memorization

I frequently see citations claiming that, while experts can memorize chess positions better than non-experts, the advantage completely goes away when positions are randomized. When people refer to a specific citation, it's generally Chase and Simon's 1973 paper Perception in Chess, a "classic" which has been cited a whopping 7449 times in the literature, which says:

De Groat did, however, find an intriguing difference between masters and weaker players in his short-term memory experiments. Masters showed a remarkable ability to reconstruct a chess position almost perfectly after viewing it for only 5 sec. There was a sharp dropoff in this ability for players below the master level. This result could not be attributed to the masters’ generally superior memory ability, for when chess positions were constructed by placing the same numbers of pieces randomly on the board, the masters could then do no better in reconstructing them than weaker players, Hence, the masters appear to be constrained by the same severe short-term memory limits as everyone else ( Miller, 1956), and their superior performance with "meaningful" positions must lie in their ability to perceive structure in such positions and encode them in chunks. Specifically, if a chess master can remember the location of 20 or more pieces on the board, but has space for only about five chunks in short-term memory, then each chunk must be composed of four or five pieces, organized in a single relational structure.

The paper then runs an experiment which "proves" that master-level players actually do worse than beginners when memorizing random mid-game positions even though they do much better memorizing real mid-game positions (and, in end-game positions, they do the about the same as beginners when positions are randomized). Unfortunately, the paper used an absurdly small sample size of one chess player at each skill level.

A quick search indicates that this result does not reproduce with larger sample sizes, e.g., Gobet and Simon, in "Recall of rapidly presented random chess positions is a function of skill", say

A widely cited result asserts that experts’ superiority over novices in recalling meaningful material from their domain of expertise vanishes when they are confronted with random material. A review of recent chess experiments in which random positions served as control material (presentation time between 3 and 10 sec) shows, however, that strong players generally maintain some superiority over weak players even with random positions, although the relative difference between skill levels is much smaller than with game positions. The implications of this finding for expertise in chess are discussed and the question of the recall of random material in other domains is raised.

They find this scales with skill level and, e.g., for "real" positions, 2350+ ELO players memorized ~2.2x the number of correct pieces that 1600-2000 ELO players did, but the difference was ~1.6x for random positions (these ratios are from eyeballing a graph and may be a bit off). 1.6x is smaller than 2.2x, but it's certainly not the claimed 1.0.

I've also seen this result cited to claim that it applies to other fields, but in a quick search of applying this result to other fields, results either show something similar (a smaller but still observable difference on randomized positions) or don't reproduce, e.g., McKeithen did this for programmers and found that, on trying to memorize programs, on "normal" program experts were ~2.5x better than beginners on the first trial and 3x better by the 6th trial, whereas on the "scrambled" program, experts were 3x better on the first trial and progressed to being only ~1.5x better by the 6th trial. Despite this result contradicting Chase and Simon, I've seen people cite this result to claim the same thing as Chase and Simon, presumably from people who didn't read what McKeithen actually wrote.

Type Systems

Unfortunately, false claims about studies and evidence aren't limited to pop-sci memes; they're everywhere in both software and hardware development. For example, see this comment from a Scala/FP "thought leader":

Tweet claiming that any doubt that type systems are helpful is equivalent to being an anti-vaxxer

I see something like this at least once a week. I'm picking this example not because it's particularly egregious, but because it's typical. If you follow a few of the big time FP proponents on twitter, you'll see regularly claims that there's very strong empirical evidence and extensive studies backing up the effectiveness of type systems.

However, a review of the empirical evidence shows that the evidence is mostly incomplete, and that it's equivocal where it's not incomplete. Of all the false memes, I find this one to be the hardest to understand. In the other cases, I can see a plausible mechanism by which results could be misinterpreted. “Relationship is weaker than expected” can turn into “relationship is opposite of expected”, log can look a lot like an asymptotic function, and preliminary results using inferior methods can spread faster than better conducted follow-up studies. But I'm not sure what the connection between the evidence and beliefs are in this case.

Is this preventable?

I can see why false memes might spread quickly, even when they directly contradict reliable sources. Reading papers sounds like a lot of work. It sometimes is. But it's often not. Reading a pure math paper is usually a lot of work. Reading an empirical paper to determine if the methodology is sound can be a lot of work. For example, biostatistics and econometrics papers tend to apply completely different methods, and it's a lot of work to get familiar enough with the set of methods used in any particular field to understand precisely when they're applicable and what holes they have. But reading empirical papers just to see what claims they make is usually pretty easy.

If you read the abstract and conclusion, and then skim the paper for interesting bits (graphs, tables, telling flaws in the methodology, etc.), that's enough to see if popular claims about the paper are true in most cases. In my ideal world, you could get that out of just reading the abstract, but it's not uncommon for papers to make claims in the abstract that are much stronger than the claims made in the body of the paper, so you need to at least skim the paper.

Maybe I'm being naive here, but I think a major reason behind false memes is that checking sources sounds much harder and more intimidating than it actually is. A striking example of this is when Quartz published its article on how there isn't a gender gap in tech salaries, which cited multiple sources that showed the exact opposite. Twitter was abuzz with people proclaiming that the gender gap has disappeared. When I published a post which did nothing but quote the actual cited studies, many of the same people then proclaimed that their original proclamation was mistaken. It's great that they were willing to tweet a correction⁶, but as far as I can tell no one actually went and read the source data, even though the graphs and tables make it immediately obvious that the author of the original Quartz article was pushing an agenda, not even with cherry picked citations, but citations that showed the opposite of their thesis.

Unfortunately, it's in the best interests of non-altruistic people who do read studies to make it seem like reading studies is difficult. For example, when I talked to the founder of a widely used pay-walled site that reviews evidence on supplements and nutrition, he claimed that it was ridiculous to think that "normal people" could interpret studies correctly and that experts are needed to read and summarize studies for the masses. But he's just a serial entrepreneur who realized that you can make a lot of money by reading studies and summarizing the results! A more general example is how people sometimes try to maintain an authoritative air by saying that you need certain credentials or markers of prestige to really read or interpret studies.

There are certainly fields where you need some background to properly interpret a study, but even then, the amount of knowledge that a degree contains is quite small and can be picked up by anyone. For example, excluding lab work (none of which contained critical knowledge for interpreting results), I was within a small constact factor of spending one hour of time per credit hour in school. At the conversion rate, an engineering degree from my alma mater costs a bit more than 100 hours and almost all non-engineering degrees land at less than 40 hours, with a large amount of overlap between them because a lot of degrees will require the same classes (e.g., calculus). Gatekeeping reading and interpreting a study on whether or not someone has a credential like a degree is absurd when someone can spend a week's worth of time gaining the knowledge that a degree offers.

If you liked this post, you'll probably enjoy this post on odd discontinuities, this post how the effect of markets on discrimination is more nuanced than it's usually made out to be and this other post discussing some common misconceptions.

2021 update

In retrospect, I think the mystery of the "type systems" example is simple: it's a different kind of fake citation than the others. In the first three examples, a clever, contrarian, but actually wrong idea got passed around. This makes sense because people love clever, contrarian, ideas and don't care very much if they're wrong, so clever, contarian, relatively frequently become viral relative to their correctness.

For the type systems example, it's just that people commonly fabricate evidence and then appeal to authority to support their position. In the post, I was confused because I couldn't see how anyone could look at the evidence and then make the claims that type system advocates do but, after reading thousands of discussions from people advocating for their pet tool/language/practice, I can see that it was naive of me to think that these advocates would even consider looking for evidence as opposed to just pretending that evidence exists without ever having looked.

Thanks to Leah Hanson, Lindsey Kuper, Jay Weisskopf, Joe Wilder, Scott Feeney, Noah Ennis, Myk Pono, Heath Borders, Nate Clark, and Mateusz Konieczny for comments/corrections/discussion.

BTW, if you're going to send me a note to tell me that I'm obviously wrong, please make sure that I'm actually wrong. In general, I get great feedback and I've learned a lot from the feedback that I've gotten, but the feedback I've gotten on this post has been unusually poor. Many people have suggested that the studies I've referenced have been debunked by some other study I clearly haven't read, but in every case so far, I've already read the other study.

Dunning and Kruger claim, without what I'd consider strong evidence, that this is because people who perform well overestimate how well other people perform. While that may be true, one could also say that the explanation for people who are "unskilled" is that they underestimate how well other people perform. "Phase 2" attempts to establish that's not the case, but I don't find the argument convincing for a number of reasons. To pick one example, at the end of the section, they say "Despite seeing the superior performances of their peers, bottom-quartile participants continued to hold the mistaken impression that they had performed just fine.", but we don't know that the participants believed that they performed fine, we just know what their perceived percentile is. It's possible to believe that you're peforming poorly while also being in a high percentile (and I frequently have this belief for activties I haven't seriously practiced or studied, which seems likely to be the case for the participants of the Dunning-Kruger study who scored poorly on tasks with respect to those tassks). ^[return]
Long-term disability is associated with lasting changes in subjective well-being: evidence from two nationally representative longitudinal studies.

Hedonic adaptation refers to the process by which individuals return to baseline levels of happiness following a change in life circumstances. Two nationally representative panel studies (Study 1: N = 39,987; Study 2: N = 27,406) were used to investigate the extent of adaptation that occurs following the onset of a long-term disability. In Study 1, 679 participants who acquired a disability were followed for an average of 7.18 years before and 7.39 years after onset of the disability. In Study 2, 272 participants were followed for an average of 3.48 years before and 5.31 years after onset. Disability was associated with moderate to large drops in happiness (effect sizes ranged from 0.40 to 1.27 standard deviations), followed by little adaptation over time.

^[return]
Time does not heal all wounds

Cross-sectional studies show that divorced people report lower levels of life satisfaction than do married people. However, such studies cannot determine whether satisfaction actually changes following divorce. In the current study, data from an 18-year panel study of more than 30,000 Germans were used to examine reaction and adaptation to divorce. Results show that satisfaction drops as one approaches divorce and then gradually rebounds over time. However, the return to baseline is not complete. In addition, prospective analyses show that people who will divorce are less happy than those who stay married, even before either group gets married. Thus, the association between divorce and life satisfaction is due to both preexisting differences and lasting changes following the event.

^[return]
Reexamining adaptation and the set point model of happiness: Reactions to changes in marital status.

According to adaptation theory, individuals react to events but quickly adapt back to baseline levels of subjective well-being. To test this idea, the authors used data from a 15-year longitudinal study of over 24,000 individuals to examine the effects of marital transitions on life satisfaction. On average, individuals reacted to events and then adapted back toward baseline levels. However, there were substantial individual differences in this tendency. Individuals who initially reacted strongly were still far from baseline years later, and many people exhibited trajectories that were in the opposite direction to that predicted by adaptation theory. Thus, marital transitions can be associated with long-lasting changes in satisfaction, but these changes can be overlooked when only average trends are examined.

^[return]
Unemployment Alters the Set-Point for Life Satisfaction

According to set-point theories of subjective well-being, people react to events but then return to baseline levels of happiness and satisfaction over time. We tested this idea by examining reaction and adaptation to unemployment in a 15-year longitudinal study of more than 24,000 individuals living in Germany. In accordance with set-point theories, individuals reacted strongly to unemployment and then shifted back toward their baseline levels of life satisfaction. However, on average, individuals did not completely return to their former levels of satisfaction, even after they became reemployed. Furthermore, contrary to expectations from adaptation theories, people who had experienced unemployment in the past did not react any less negatively to a new bout of unemployment than did people who had not been previously unemployed. These results suggest that although life satisfaction is moderately stable over time, life events can have a strong influence on long-term levels of subjective well-being.

^[return]
One thing I think it's interesting to look at is how you can see the opinions of people who are cagey about revealing their true opinions in which links they share. For example, Scott Alexander and Tyler Cowen both linked to the bogus gender gap article as something interesting to read and tend to link to things that have the same view.

If you naively read their writing, it appears as if they're impartially looking at evidence about how the world works, which they then share with people. But when you observe that they regularly share evidence that supports one narrative, regardless of quality, and don't share evidence that supports the opposite narrative, it would appear that they have a strong opinion on the issue that they reveal via what they link to.
^[return]