This is a talk that was given for Consilium Scientific today May 18 at the invitation of Leeza Osipenko. Consilium are doing more than anyone to raise questions about the quality of the evidence we have in medicine – in particular around controlled trials. They have had some fabulous contributions in recent months – all of which can be acess on their website. This Lecture by me came with a Question and Answer session that had its cut and thrust moment.
As usual with these lectures Bill James and I also recorded If God Doesn’t Play Dice, Should Doctors in a version that I think works very well.
The talk appeals to ideas that can best be found in the work of Sander Greenland – see Here – which has the memorable phrase – First Do no Harm to Knowledge. Adapted here to Do not bring Poison out of a Good.
Einstein famously said that God does not play dice – with the universe. In France, in 1654, gambling with dice gave rise to probability theory, which led to what we now call medical statistics. 75 years ago doctors recruited medical statistics in the form of Randomized Controlled Trials (RCTs) to give them confidence when they Roll the Dice on the Drugs they give us.
How confident should we be?
Slide 2: Fifty years after the first RCT, Don Schell, a tough oilman from Wyoming, was put on Paroxetine for a minor sleep problem. Forty-eight hours later he shot his wife, his daughter and grand-daughter and then himself. His surviving son-in-law took a lawsuit against GlaxoSmithKline (GSK) – Tobin v SmithKline.
In the Tobin case, Ian Hudson, Chief Safety Officer of GSK was asked – Can SSRIs cause suicide. He says GSK practice EBM which means they base their views on randomized controlled trials (RCTs) – which use probability to find the truth.
A jury of 12 people, with no background in healthcare, dismissed Hudson’s EBM in favor of Evident Based Medicine. Their diagnosis was it was obvious paroxetine caused this and GSK were guilty of negligence.
Hudson’s view, however, remains ensconced at the top of Britain’s drugs regulator, of which he was later the Chief Executive Officer – as well as top of FDA, EMA, and other regulators.
Slide 3: Hudson’s views originate 70 years earlier in the work of a strange man – Ronnie Fisher. Here you see Fisher smoking a pipe. He dismissed the later link between smoking and lung cancer. Evidence was not Fisher’s strong point.
Fisher was not a doctor and never ran an RCT. Controlled trials and randomization were there before him but his book the Design of Experiments turbocharged them.
Fisher was trying to characterize expert knowledge. Experts know the right answer – like parachutes work. If we set up two groups, one with parachutes and the other not, we would expect those wearing parachutes to live and those not to die.
Chance was really the only thing that could get in the way of the expert being right – perhaps a strong wind lands a person in a snow covered tree. Chance could be assigned a statistically significant value. If 1 in 20 of those without parachutes lived, we wouldn’t say the expert didn’t know what he was talking about.
There might be other trivial things – someone with webbed feet might behave differently when falling, and randomization can control for any trivial unknown unknowns like this. Somehow Fisher’s book transformed randomization into something semi-mystical – that would help us overcome ignorance – but randomization can’t control for ignorance.
Slide 4: Fisher’s expert is a Robin Hood who 19 times out of 20 can split a prior arrow lodged in the Bull. Expertise is precise, accurate and Real World.
Slide 5: The RCTs done to license drugs, especially antidepressants, look like this rather than like Robin Hood. A mismatch on this scale indicates we are not dealing with expertise.
Slide 6: Tony Hill ran the first medical RCT in 1947 giving streptomycin for tuberculosis. Hill later showed smoking caused lung cancer. He had no time for Fisher. He knew doctors were not experts. His trial was not a demonstration of expertise. He used randomization as a method of fair allocation – not to manage mystical confounders.
Hill’s RCT found out less about streptomycin than a prior non-randomized trial in the Mayo Clinic, which showed it can cause deafness and tolerance develops rapidly.
Slide 7: In a 1965 lecture, Hill took stock of RCTs. He mentions that it is interesting that the people most heavily promoting RCTs are pharmaceutical companies.
He didn’t think trials had to be randomized. He thought double-blinds could get in the way of doctors evaluating a drug. He believed in Evident Based rather than Evidence Based Medicine.
Hill said we needed RCTs in 1950 to work out if anything worked. By 1960, we had lots of drugs that worked, none discovered by RCTs, and the need was to find out which drug worked best. This is not something RCTs can do – there is no such thing as a best drug.
He also said that RCTs produce average effects, which are not much good for telling a doctor what to do for the patient in front of them.
Here in this quote he is saying RCTs can help evaluate one thing a drug does which means they are not a good way to evaluate a drug overall. All RCTs generate ignorance but we can bring good out of this harm if we remember that. Hill never saw RCTs replacing clinical judgement.
Slide 8: This 1960 RCT run by Louis Lasagna makes Hill’s point. Thalidomide has therapeutic efficacy as a sleeping pill but this trial missed the SSRI-like sexual dysfunction, suicidality, agitation, nausea and peripheral neuropathy it causes.
Two years later, Lasagna was responsible for incorporating RCTs into the 1962 FDA Act – in order to minimize the chance of another thalidomide. By doing this, he was, more than anyone else, the man who got us using RCTs. The mechanism he put in place to stop thalidomide happening again was one it sailed through.
Other regulations aim at safety – whether for planes, cars, food or investment, bu the 1962 regulations uniquely stressed efficacy and in so doing badly compromised safety.
Slide 9: The 1950s gave us better antihypertensives, hypoglycemics, antibiotics and psychotropic drugs than we have ever had – all without RCT input.
Imipramine, the first antidepressant, is much stronger than SSRIs. It can treat melancholia –SSRIs can’t. Melancholia comes with an 80-fold increased risk of suicide.
In an RCT of imipramine versus placebo in melancholia, we would expect the red dots showing suicide attempts to be less on imipramine even though it can cause suicide because it treats this high risk condition. This RCT would look like evidence imipramine cannot cause suicide.
Imipramine was launched in 1958. At a meeting in 1959, experts noted that while it was a wonderful treatment it made some people suicidal. Stop the drug and the suicidality clears. Re-introduce it and suicidality comes back. This was Evident Based Medicine.
Slide 10: In the mild depression trials that brought the SSRIs to market – we see an increase of suicidal events compared to placebo in people at little or no risk of suicide.
Slide 11: Used as a comparator in these trials imipramine now too causes suicides.
The diametrically opposite RCT outcomes for imipramine stem from the fact these are Treatment Trials not Drug Trials. If the condition and treatment produce superficially similar effects, RCTs can confound us. This is true for most medical conditions and their treatments.
If you want to see what a drug does – you should do a Drug Trial.
Slide 12: Here is what a Drug Trial looks like. In healthy volunteer studies in the 1980s, companies found SSRIs made volunteers suicidal, dependent and sexually dysfunctional. These Drug Trials enabled companies to engineer Treatment Trials to hide these problems.
Slide 13: There are more dead bodies on SSRIs than on placebo in trials, yet the RCTs show the drugs work. This is because working is measured on a surrogate outcome. For antidepressants it’s the Hamilton Scale for Depression. Fifteen years after its creation, Max Hamilton commented that this scale standardizes clinical interviews which can be good and bad.
Slide 14: In trials, the Hamilton scale has suicide, appetite, sleep, anxiety and sex items on all of which the illness or the drug may produce effects. If Leeza is in an RCT and I ask her if she has been suicidal in the last week, if she says yes she tried to kill herself, I would score a 4. But if I figured this was caused by the drug I would score a Zero.
But trials eliminate judgement. Introduce judgement and one knows what the results mean. Trials have become just the opposite to what Tony Hill intended.
Slide 15: In addition to randomization, Fisher put Statistical Significance on the map. By 1980 every leading medical statistician was saying we need to get rid of statistical significance in favor of Confidence Intervals.
This image is from the James Webb telescope. Confidence Intervals were introduced by Gauss in 1810 to solve a telescope problem. Because of measurement error, telescopes often failed to establish if there was one or two stars in a location. As measurement errors should distribute normally, confidence intervals could help distinguish individual stars.
Slide 16: Confidence intervals rushed into therapeutics in the mid-1980s. Leading medical statisticians argued they were more appropriate than significance testing. They are more appropriate for measurement error but is this what we have in Treatment Trials?
Slide 17: Confidence intervals allow us to estimate the size of an effect and the precision with which it is known. The details on the likelihood of the Red Drug killing you here are more precise than for the Yellow Drug. The best estimate of the lethality of the Yellow Drug however is greater. The standard view is that if we increase the size of the Yellow Drug Trial, we will have greater precision and know better what the risks are. This is wrong, as you will see.
If you are forced to take one of these drugs, as things stand now, Ian Hudson, and FDA will say the only dangerous drug here is the Red One. This is because more than 95% of the data, more than 19 out of 20 data points, lie to the right of the line through 1.0. This is exactly what medical statisticians say is wrong.
I would take the Red drug, because these confidence intervals are not managing measurement error and we don’t know what they mean when they are not representing measurement error.
Slide 18: In 1991, facing claims Prozac caused suicide, Lilly analysed their RCTs and spun the Confidence Intervals here as evidence Prozac does not cause suicide. This is Ian Hudson thinking – there is no problem as nothing is statistically significant.
Sander Greenland and leading medical statisticians say you need to view these as compatibility intervals rather than confidence intervals. All these curves show a compatibility with Prozac causing suicide and the consistent excess of suicidal events in all groups points strongly to a problem.
The bigger point is that for 100 years statisticians have been telling us we cannot assume that statistical data bears any relationship to the real world – we have to establish it.
Slide 19: Here is a representation of suicidal events from the trials bringing Prozac, Seroxat and Zoloft to market around 1990. Note the events under screening. There is a 2 week washout period before a trial starts where people are taken off prior drugs before being randomized. This phase of a trial is dangerous – people are in withdrawal and may become suicidal.
Slide 20: When submitting the data to FDA, the companies moved events as you see here – arguing people in the run in phase were on nothing which is equivalent to being on placebo. There were other maneuvers at the end of the trials as ou see here.
Even with these maneuvers, there was an excess of suicidal events on SSRIs but the 95% confidence interval was no longer to the right of 1.0. Why do this? Because regulators and companies need a Stop-Go mechanism and statistical significance provides this. But doctors don’t need an external Stop-Go mechanism to replace their clinical judgement, so why do we go along with this?
Slide 21: Nobody noticed these maneuvers in 1990, but 10 years later in a crisis about children becoming suicidal on SSRIs, questions were asked. GSK and Pfizer responded:
‘GSK did not intentionally submit any erroneous or misleading information to FDA. The suicide data submitted to FDA explicitly identified when events occurred during the placebo run-in period. FDA had all this information right from the beginning.’
“Pfizer’s 1990 report to FDA plainly shows … that 3 placebo attempts as having occurred during single blind placebo phases… FDA has neither criticized these data or the report as inappropriate, nor required additional analyses”.
These maneuvers breach FDA regulations, which FDA staff noted. Senior FDA honchoes ignored this and even put their name to articles that embraced these illegitimate figures to argue placebo controlled RCTs were not unethical, as those on placebo were not at any greater risk than those on treatment.
FDA and companies liaised closely over the suicide crisis in 1990. Criminally? Perhaps. I prefer the idea of strategic ignorance.
There is a crisis in knowledge production here. This is not something you can expect FDA to take a lead on – they are bureaucrats. Doctors should be the people creating medical knowledge but they went missing in action around 1990, leaving companies able to create the appearances of knowledge.
Slide 22: Following the suicide in children crisis, FDA wanted the data from adult trials and wanted companies not to make the same maneuvers they had made before. GSK submitted these data, which you see point to a problem for paroxetine
The sacred mantra of RCTs is randomization controls for all possible confounders in all possible universes. The ability of randomization to introduce confounders into clinical trials is about to come to GSK’s rescue.
Slide 23: GSK also did 2 trials in Intermittent Brief Depressive Disorder (IBDD) patients who have regular suicide attempts. Paroxetine didn’t do well – one trial was stopped it was doing so poorly. Why do these trials?
Slide 24: When you add these figures together suddenly paroxetine protects against suicide. First you need to know IBDD patients could be admitted to MDD trials – we have no way to distinguish them. Some patients become IBDD by virtue of a poor response to an SSRI. What happens if IBDD patients are in an MDD trial is the same as if you add the groups of trials together as we did here.
This scenario happens every time a medical condition is heterogenous – as diabetes, dementia, parkinson’s disease, breast cancer, back pain, hypertension and most conditions are. In these cases, randomization will hide effects good and bad – and enable us to use a problem a drug causes to hide a problem a drug causes.
Slide 25: Graphically the Red Drug here is the MDD curve alone – more than 95% of the data are to the right of the 1.0 line. The traditional wisdom is that adding some more events to the Red Drug trials should give us a more precise version of the same estimate.
Adding less than 3% more in this case, we have shifted the curve to the opposite side of the 1.0 line. It’s a more precise confidence interval but this precision speaks to our ignorance rather than to better knowledge. Medical statistics books don’t hint at this possibility.
We can add 40 suicidal events to the paroxetine IBDD arm before GSK would have to admit Paroxetine causes a problem – on the basis that the results are now statistically significant.
Confidence intervals do not help us work out what is going on here. Nor do they help in heterogenous drug responses. If we clone a David who is sedated by a Red Drug and an Ian Hudson who is stimulated by it, the best estimate of the Red Drug’s effect will lie on the 1.0 line, apparently showing this drug has no effect on sleep. A method to distinguish between one and two stars should not produce an answer that there are no stars. Algorithmic judgements cannot substitute for a human judgement.
Slide 26: Here is another image from James Webb. Confidence intervals were a step on the way to revealing the individuality of stars. From genetics to astronomy, science reveals individuality except in medicine where statistical approaches as applied operate against individuality.
We are legitimately bound to be as objective as we can. Because of our fetish with numbers, we think that using Chance to control Bias is going make us objective and so we allow mindless algorithms to replace clinical judgement. Clinical medicine, like law, and the first 300 years of science, however, used Bias to Control Chance.
Slide 27: In the early 1980s, the idea that RCTs were the scientific and sophisticated way to demonstrate adverse effects was creeping in – as you see here. Lasagna, the man responsible for us doing RCTs took issue with this and said this is only true if by sophisticated you mean adulterated – to sophisticate wine means to adulterate it.
Evident Based Medicine is how to establish adverse events and a great example turned up a few years later.
Slide 28: In 1990 Martin Teicher and 2 colleagues claimed fluoxetine made 6 people suicidal. Following traditional clinical approaches for determining causality, exposure to the drug, dechallenge, rechallenge, listening to the patient, this article nailed beyond doubt that fluoxetine made some people suicidal.
Roughly twenty other groups reported similar findings over the next year including me. This was Evident Based Medicine showing Prozac could cause suicide.
Slide 29: Lilly responded with this article in the BMJ claiming an analysis of their RCTs showed no evidence Prozac made people suicidal. The cases reported, they said were sad but anecdotal – and the plural of anecdote is not data. Depression was the problem not fluoxetine. Clinical trials are the science of cause and effect. The challenge to all of us was whether we were going to believe the science or the anecdotes.
This was a knowledge creation moment that likely had input from all companies and perhaps FDA. This article created Evidence Based Medicine and just as with RCTs 30 years earlier, the people exhorting doctors to practice EBM today are Pharma companies.
In fact, the original phrase is the plural of anecdotes is data – otherwise Google wouldn’t work.
The idea the disease is responsible for suicide attempts and suicides in healthy volunteers is hard to believe but companies can wheel out experts to say just that.
My key point is Evident Based Medicine is the science – the Lilly data is an artefact. My challenge to you is which are you going to believe the Science or the Artefact?
Lilly claimed Prozac didn’t cause suicide even though the excess of suicidal events on it in this paper was compatible with the fact it does.
You’ve seen all companies cooked the books. When uncooked this excess was statistically significant.
But that’s not the real problem. An incompatability between an Evident Based Medicine and Evidence Based Medicine can help us move science forward. If Prozac was as effective as Imipramine grappling with the incompatibility outlined earlier would have increased our expertise.
Lilly, however, were not in the business of embracing discrepancies. Their argument was a religious one – a dogmatic one. They demanded we ignore the Evident.
The Evidence Based Medicine movement has refused to call out the ghostwriting of the company trial literature and the lack of access to company trial data. More to the point, they have not taken issue with this egregious breach of scientific methodology.
Therapeutics involves bringing good out of the use of a poison. Volunteering for clinical trials was a risky good we undertook to benefit our family, friends and countrymen. Companies have been bringing a poison out of this good.
Slide 30: The usual histories of science start with the foundation of The Royal Society in 1660, which famously said Science would deal with matters that could be Settled by Data. Participants could be Xtian, Hindu, Jew, Muslim, or Atheist, but they were called on to leave these badges at the door and come to a consensus about the best way to explain the experimental outcome in front of them.
The histories of science emphasize the word Data. Settled is a more important word. Statistics played no part in this science. The experiments were events that didn’t need statistical descriptions. Science does not replace judgment calls with a statistical artefact – this only began 33 years ago.
Slide 31: This history overlooks an event in 1618, when Walter Raleigh was executed – for being too close to the French and Spanish. Raleigh was convicted on the basis of things said about him by people who did not come into court to be cross-examined.
The legal system recognized an injustice and introduced Rules of Evidence. Hearsay could not be used as evidence. Jurors – a group of 12 people, Xtians, Hindus, Muslims, Atheists and Jews, can only base a verdict on material put in front of them that can be examined and cross-examined. The process of forcing 12 people with very different biases to come to a Verdict about what is in front of them is the essence of science.
Verdicts and diagnoses are provisional, which might appear to contrast with the objectivity of science, but scientific views are similarly provisional. Scientists attempt to overturn verdicts with new data.
Let’s say I gave Leeza fluoxetine 33 years ago and she became suicidal. I could examine and cross-examine her, run labs and scans, raise the dose, stop the drug, add an antidote, have a case conference with all of you able to ask questions to see if we could explain this in any other way. She is the data, the apparatus in which the experiment is taking place.
If Leeza and I and you conclude fluoxetine made her suicidal and report this to MHRA or FDA, the first thing FDA will do is to remove her name. No-one can now examine or cross-examine her and come to a scientific view about whether there is a link or not. Her injury has been made Hearsay – misinformation.
If you are later injured in the same way and see tens of thousands of reports of suicidality on SSRIs on FDA’s adverse event reporting system, you cannot bring this into court because no-one can be brought into court. It’s Hearsay not Evidence.
Company RCTs are equally hearsay and should not be let into Court as evidence. Accessing the data means accessing people – like Leeza or me – and we cannot do that with subjects in company trials, who often don’t exist. Company articles are ghostwritten and the authors, who have seen none of the patients, cannot speak to what happened either.
In contrast, if Leeza and I report her case in a Medical Journal as a Case Report, with our names on it, this is evidence and we can both be brought into Court.
Slide 32: In 1997, you have Lasagna here echoing Tony Hill 30 years earlier saying:
In contrast to my role in the 1950s which was trying to convince people to do controlled trials, now I find myself telling people that it’s not the only way to truth.
Evidence Based Medicine has become synonymous with RCTs even though such trials invariably fail to tell the physician what he or she wants to know which is, which drug is best for Mr Jones or Ms Smith – not what happens to a non-existent average person.
Slide 33: RCTs have created a scenario where drugs have benefits and no problems. This is leading to polypharmacy noted as an issue around 2000.
As of 2016, over 40% of over 45s in the US were on 3 or more drugs every day of the year. Over 40% of over 65s on 5 or more drugs every day of the week. US life expectancy has been falling dramatically – this is all pre-Covid.
Reducing medication burdens can increase life expectancy, reduce hospitalizations, and improve quality of life.
Slide 34: But reducing a medication burden is not easy – as this image from the movie The Hurt Locker illustrates. Many of these drugs explode on attempting to withdraw them. Deprescribing is the primary medical task of our age. No RCT will ever help with this. The best evidence will lie in clinical experience of tackling similar situations. Being able to talk to clinical colleagues will help but the key scientific partner is the patient – who brings clues from missing doses of some of these drugs, and a sense of what the drugs are doing that can only be accessed through them. The patient is the apparatus in which the experiment is taking place and each patient and their response to drugs is unique.
Slide 35: We began with Einstein. Ein Stein means one stone – one shape. Up till now, we have had no mathematics that might counter the averaging we get from misapplied medical statistics. Now we have.
Mathematics is more about shapes than numbers and earlier this year a New Shape was discovered, the first aperiodic or truly individual shape – that means it cannot be incorporated into other shapes or averages. This may offer us a template for a new robustly individual mathematics – and perhaps a better template for clinical practice than rolling dice has been – or should I say the rigged gambling we have had up to this.