Bruno Latour is sometimes derided, sometimes praised, for having made a science itself the object of study, and for pointing out the inextricably human politics that go into what gets science's imprimatur. Two recent articles have me thinking about some aspects of these politics.
Depending on who you ask, the publication of a peer-reviewed parapsychological study is either a scandal or a refreshing example of free inquiry. The Journal of Personality and Social Psychology, an academic journal with a good reputation (or it was), has printed a paper by Daryl Bem of Cornell University, a name and an institution with some respectability. The study reports two different experiments that, Bem claims, show there is reason to think that events in the future could impact the human mind. One experiment showing a 3.1% better-than-chance results when participants were asked to predict on which of two screens a picture with erotic content would appear. (The control group's non-erotic pictures produced results that stayed within the margins of chance alone). The other experiment asked volunteers to look at a series of words, then gave them a surprise quiz asking them to type in the words they recalled. After this, the computer randomly selected 24 words from the series and asked the subjects to type them again. The words that subjects re-typed (after the recall test) tended to be the words they had done better at recalling.
Now, eerily, even before I read about the critical reaction to Bem's paper, I somehow just knew, you know?, that the Committee for Skeptical Inquiry would have some thoughts on this. I also knew that the CSI would refer to the experiments of J.B. Rhine in the 1930's. It's eerie.
But perhaps my thoughts on Rhine were triggered by a recent New Yorker article, by Jonah Lehrer, on scientific inexactitude. This article is about the "decline effect," the tendency of a number of well-established experimental results across scientific disciplines to trail off with repeated investigation. That is: very well-designed experiments which seem to show robust correlations tend, on repetition, to yield less and less impressive conclusions. Rather than becoming more and more secure,
all sorts of well-established, multiply confirmed findings have started to look increasingly uncertain. It’s as if our facts were losing their truth: claims that have been enshrined in textbooks are suddenly unprovable. This phenomenon doesn’t yet have an official name, but it’s occurring across a wide range of fields, from psychology to ecology.So scientists attempting to replicate results are coming up short; so what, you might say--this happens all the time in science. Failure to replicate is probably the norm, which keeps one-off flukes or unintentionally engineered results from getting widely accepted:
The test of replicability, as it’s known, is the foundation of modern research. Replicability is how the community enforces itself. It’s a safeguard for the creep of subjectivity. Most of the time, scientists know what results they want, and that can influence the results they get. The premise of replicability is that the scientific community can correct for these flaws.But this phenomenon is different: this is the replication of already well-established research, research that had already passed the hurdles of scientific respectability, including, peer review and, well, replication.
Among the many complaints that Daryl Bem's results have occasioned is that there must be some problem with the design of the experiment. One comment on The Last Psychiatrist's post on this subject puts it succinctly:
There is a common problem that peer review is specifically designed to avoid. There are often results that seem strange or unexplainable to newcomers to a field that are actually well-known problems of experimental design (i.e. you're not testing what you think you're testing). This is where the experts come in; they have seen these errors before and can point them out before they propagate.The problem is that Bem's results are not those of a wet-behind-the-ears grad student. One can still say that he (or the Journal of Personality and Social Psychology) should have asked different experts (The Last Psychiatrist thinks it should've been physicists; a lot of commenters have suggested statisticians). See too NPR's Robert Krulwich's musings on this.
The scientific process in a nutshell: you notice a phenomenon that you want to account for. You frame a hypothesis. You construct an artificial circumstance in which the only variable is the mechanism of your hypothesis. If your phenomenon is unchanged when your mechanism changes, and you have rigorously screened out all other possible changes, your hypothesis is disproven. If, on the other hand, your phenomenon changes as you alter your chosen mechanism and nothing else, you may consider your hypothesis validated.
This little synopsis will be modified and stretched and clipped and spun by different philosophers of science, but in essence this is the scientific method, a wonder of parsimony, elegance, and indifference.
Of course there is a snag: the little word "only". How possible is it to alter only one circumstance? This is at least part of what the commenter meant by "you're not testing what you think you're testing." And now it seems to turn out that all sorts of random effects might squeeze into an experiment be it never-so-hermetically-sealed. This is at least one possible reading of the experiment, mentioned in the New Yorker article, which reproduced as minutely as possible the circumstances of a test of the effects of cocaine on mice. Same cocaine. Same dose. Same breed and age of mice. Same time in captivity, same dealer. Same cages. Same bedding material. Same etc., etc., etc. The only difference was location: In Portland and Albany the coked-up mice moved six or seven hundred centimeters more than usual; in Edmonton, Alberta, they moved over five thousand centimeters more. But different tests sent the stats of different labs' mice into outlier region. In other words, it might just be noise, but noise you can't screen out.
Or then again, maybe reality just wants to play tricks. Maybe it adjusts to your findings, in a kind of reversal of Rupert Sheldrake's morphological fields, so that rather than spreading, a breakthrough insight gets canceled out. Or maybe, as per Quentin Meillassoux's hyperchaos, whereby the laws of nature could change at any moment, the laws of nature are in fact changing at every moment. Or maybe what you can't screen out is fundamentally relevant, not noise at all, but either something you can't correct for, or something you'd never think to correct for. Maybe, as Heraclitus said, "Nature loves to hide."
The "decline effect" has been getting attention from Jonathan Schooler, who was frustrated by the difficulty he was having at replicating his own results, results which had made him famous in the world of cognitive psychology in 1990, concerning what he called "verbal overshadowing," or the notion that having described faces in words actually makes faces harder rather than easier to visually recognize. Schooler's initial results were striking: subjects who had watched a video of a bank robbery and then written a description of the robber identified the robber from photos later with an accuracy of about 38%, as opposed to 64% accuracy in those who had not made this written description. This is a significant result, and (assuming the experiment were well-designed in the first place), ought to be replicable. But Schooler himself found his results dwindling; the effect would be there, but less starkly. It dwindled by 30%, then another 30%.
A profoundly troubled Schooler looked into the work of a predecessor: the aforementioned J.B. Rhine, whose investigations into E.S.P. in the 1930's found one test subject who was astoundingly good at guessing (or "seeing", depending on what you believe) the faces of Zener cards. At least, he was good for a while; whereas most of Rhine's subjects were able to guess rightly at about the 20% chance-rate (there are five cards), for a while Rhine's star subject, Adam Linzmayer, would sometimes guess at a shocking near-50% rate. In fact, initially, Linzmayer guessed two different nine-card runs at 100%, and for a very long while his record remained in the upper 30's. Critics like to pooh-pooh Rhine's results with the claim that his experiments were sloppy (and some were), but what is really interesting is the fact that Linzmayer's high results did exactly what other results do, results that no one has dreamed of accusing of being fraudulent: they declined over time. Eventually, Rhine postulated that Linzmayer was bored or distracted; in any case, something was interfering.
in 2004, Schooler designed an experiment "testing" for precognition, but his real quarry was the decline effect. His experiment is structurally very like Bem's. Schooler asked test subjects to identify visual images flashed momentarily before them. The images were shown very quickly and usually did not register consciously, so subjects could not often give a description, but sometimes they could. Half of the images were then randomly chosen to be shown again. The question Schooler asked was: would the images that chanced to be seen twice be more likely to have been consciously seen the first time around? Could later exposure have retroactively "influenced" the initial successes?
The difference between Schholer's and Bem's experiments is not in the design, but in the aim. Schooler "knows that precognition lacks a scientific explanation. But he wasn’t testing extrasensory powers; he was testing the decline effect."
“At first, the data looked amazing, just as we’d expected,” Schooler says. “I couldn’t believe the amount of precognition we were finding. But then, as we kept on running subjects, the effect size”—a standard statistical measure—“kept on getting smaller and smaller.” The scientists eventually tested more than two thousand undergraduates. “In the end, our results looked just like Rhine’s,” Schooler said. “We found this strong paranormal effect, but it disappeared on us.”Bem, according to the New York Times, has received hundreds of requests for the materials to replicate his study. Since the materials included a good stack of erotic pictures, we must exercise some charity in the surmise as to the motives of researchers. Now here is my prediction: Bem's results will decline just as Schooler's did, and this will tend to validate critics' dismissal of his initial study; they will not ask themselves about the initial findings, just as none of them asked themselves about Schooler's. And we will still not know why the results flatline.
If replication is what separates the rigor of science from the squishiness of pseudoscience, where do we put all these rigorously validated findings that can no longer be proved? Which results should we believe?