The Fault Lies in Our Stars

December 13, 2020 7 Comments

The Fault lies in our Stars not in Ourselves:
Randomized Controlled Trials & Clinical Knowledge

In the Beginning

In 1947, a trial of streptomycin introduced RCTs to medicine. From then, through to their incorporation into the 1962 amendments to the Food, Drugs and Cosmetics Act, occasioned by the thalidomide tragedy, there were questions about the epistemological link between RCTs and clinical reality. Since 1962, there have been disputes about the best statistical approach to take to RCT data – whether confidence intervals are preferable to significance testing, for instance. There have also been efforts to account for a heterogeneity of treatment effects (HTE) within the wider Evidence Based Medicine (EBM) movement, which touch on the issues raised here, but this questioning assumes RCTs connect with clinical reality and the only task is one of smoothing some statistical edges.

Repeated characterizations of RCTs as offering gold-standard evidence likely leave many clinicians thinking these trials have a solid epistemological foundation, even as clinicians recognize difficulties in translating from population or average effects to individual patients. In legal settings, RCTs are pitched as generating evidence that is generalizable and knowledge that lies within confidence limits in contrast to the views of clinicians and case reports.

Pre 1962: Hill, Fisher and Randomization

A Medical Research Council trial of streptomycin in 1947 demonstrated the feasibility of randomization as a control of the subtle biases involved in evaluating a medicine. Tony Hill, the MRC trial lead, got the idea of randomization from a horticultural thought experiment about fertilizers outlined two decades previously in which Ronald Fisher proposed that randomization could control for unknown confounders. Hill thought that randomization might control for the difficult to detect ways in which clinicians steer patients likely to respond well into an active treatment arm. Hill’s randomization was a method for fair allocation, not a means of controlling for the unknowns linked to doctors not knowing what they were doing (Healy 2020).

Hills’ trial missed the tolerance that develops to streptomycin and the deafness and other problems it causes – information evident in a prior trial of streptomycin that controlled for confounders in the then standard way and depended on clinical judgement (Healy 2020).

RCTs brought statistical significance in their wake because Fisher argued that the only things that can interfere with expert judgement not being correct every time are unknown confounders and chance. Significance testing could control for chance and randomization for unknown confounders. Fisher’s model had an anchor in the real world – an expert whose judgements were invariably correct – such as offering a view that wearing a parachute if you jump from a plane at 5,000 feet will save your life. For Fisher, experiments were a way of demonstrating that we knew what we were doing rather than a leap into the unknown. They should get the same result every time.

The more doctors know what they are doing, the more they approach Fisher’s expert, but no one runs RCTs in situations where we are likely to get the predicted result every time.

In the case of breast cancer, on the basis of advances in physiology, it was hoped that giving Herceptin to Her 2+ receptor breast cancers might produce better responses than cisplatin, a more indiscriminate toxin, which nevertheless extends longevity compared to placebo. Trials confirm this but also reveal that even using Herceptin in Her 2+ breast cancers, we do not get the same result every time – there is a lot we don’t know.

In contrast, in trials comparing stents to other cardiac procedures, doing what seems physiologically obvious does not produce the expected results. The issue is not whether stents work but whether we know what we are doing, which we mostly don’t. While recent stent trials demonstrate the power of RCTs to stall a therapeutic bandwagon, the view that clearing blocked arteries might not produce a good outcome had been accepted clinical wisdom in vascular leg surgery and for stents in some quarters prior to any RCT.

Pre 1962: Neyman and Confidence Intervals

Jerzy Neyman and Egon Pearson took issue with Fisher’s real-world anchor – a semi-infallible expert. They borrowed from Carl Friedrich Gauss’ use of confidence intervals to manage the error in astronomical measurements of stars. Gauss’ ideas were picked up by Pierre-Simon Laplace and their combined input (1809-1827) to the central limit theorem, least-squares optimization and efficient linear algebra provided celebrated benefits for the physical sciences, engineering, astronomy, and geodesy.

Applied to imprecise measuring instruments and invariant entities like stars, confidence intervals have an anchor in the real world, helping us to decide if our varying measures reflect the presence of one or two stars. Taking successive measurements of a pulse in an individual is similar to determining the precise location of a star – the tighter the confidence interval bounding our measurements the more apparent we can do things reliably.

Confidence intervals could be used in a manner consistent with their use in astronomy to distinguish between a repeated set of pulse measurements before and during (but not after) administration of a drug – to one individual. The current use of confidence intervals in RCTs seems predicated on the idea that a cohort of patients in standard parallel group trials can be regarded as a single object like a galaxy. But pulses can increase in response to a drug in one individual and decrease in a second in response to the same drug. This is not measurement error.

In cases like this, claiming the true effect of the drug likely lies near some mean of the effects in a group of individuals, potentially giving us a best estimate of no effect, is wrong. A mechanism to decide whether there are one or two stars present should not turn up the answer there are none. If the gap between Average Treatment Effects (ATE) and Heterogenous Treatment Effects (THE), despite trial designs to mitigate the problem, is too great, there is some recognition that the notion of ATE falls apart (Kravitz et al 2004). In the case of stars, we knew enough about what we were doing to make reasonable inferences from varying measurements. We need to know as much to make comparable inferences when giving medicines – and we rarely do.

1962: Lasagna, Hill & Primary Endpoints

The 1938 U.S. Food, Drugs and Cosmetics Act required pharmaceutical companies to establish the safety of their products. The birth defects thalidomide caused produced a political crisis in which something needed to be seen to be done. Louis Lasagna, through Estes Kefauver, proposed that companies should also be required to demonstrate treatment effectiveness – an ineffective treatment cannot be safe. The 1962 Amendments to the 1938 Act paired the word Effectiveness with Safety throughout, with two placebo controlled RCTs later proposed by Lasagna as the means of demonstrating effectiveness.

These provisions were put in place before it was realized that demonstrating effectiveness rather than a treatment effect was not a realistic gateway to the market. In 1962, it was also assumed that RCTs offered generalizable knowledge and a positive result would invariably be replicated but this has not been borne out.

Before 1962, RCTs were not seen as offering gold standard knowledge about what drugs do. As Tony Hill put it in a 1965 lecture, RCTs have a place in the study of therapeutic efficacy, but they are only one way of doing so and any belief to the contrary is mad (Hill 1965). Hill’s lecture ties RCTs to the investigation of one effect and places the information they yield within the framework of clinical judgement.

Fisher’s significance testing, and Gauss’ confidence intervals, require a focus on one effect. In medical RCTs, a focus on a primary endpoint is key to ensuring that only chance or measurement error will get in way of the correct result. Ipso facto, this means RCTs are not a good way to evaluate a medicine.

A horticultural expert focused on whether a fertilizer improved corn yield would likely have no more accurate a view of its effects on worms in the ground or insects in the air than a non-horticulturalist – in respect of whose views significance testing by Fisher’s definition would not be appropriate. Similarly, Gauss’ confidence intervals applied to measurements of the location of a star are of little use when it comes to pinpointing the trajectories of satellites crossing the path of the observations.

It is often assumed that the primary endpoint in an RCT is the commonest effect of a drug. Treatment heterogeneity leading to wider confidence intervals than are ideal can be accommodated against this background as can missing other effects assumed to be rare or not appearing within the duration of the trial. But the RCT measuring process is often not trained on the commonest thing a drug does. The commonest effect of an SSRI is genital anesthesia, which appears almost universally and within 30 minutes of taking a first pill. It should not be possible to miss it, but this effect has been missed in all RCTs of these drugs for nervous conditions because of an RCT required focus on a primary endpoint.

The measuring attention given to a primary endpoint essentially creates an act of hypnosis in which common treatment effects can be missed entirely or given an incidental status. Casting RCTs as offering gold standard evidence about a drug, rather than one effect of the drug, creates an ignorance about the ignorance they generate.

RCT evidence should never trump an evident safety effect that appears after treatment. If a person becomes suicidal after taking an antidepressant, the issue of what is happening in that case is a matter of assessing the effects of their condition, circumstances, prior exposure to similar drugs, dose changes on the medication and whether there are other evident effects of treatment consistent with a link between suicidality and treatment. Unless RCTs have been designed specifically to look at the effects of treatment on a possible emergence of an effect like suicidality (and there have been none), RCT evidence is irrelevant and it is pernicious to pitch irrelevant RCTs as science that should count for more than clinical “anecdotes” containing CDR, dose response, and other evidence.

The transformation of RCTs from a hurdle industry had to surmount to make gold into gold standard knowledge has made RCTs a gold standard way to hide a drug’s 99 other effects.

Post 1962: Confounding & Causality

Discussions of the results of epidemiological studies apparently linking drugs to treatment effects often caution that confounding by indication undercuts any easy assumption of a link. RCTs, which are essentially epidemiological studies, rarely come with this rider. Many clinicians likely think that randomization takes care of confounding by all unknown unknowns, including by indication, with many saying RCTs demonstrate cause and effect where other epidemiological studies produce correlations.

Consider scenarios involving the antidepressants imipramine and paroxetine. Imipramine was discovered in 1957 and launched in 1958 without any RCT input. Among other actions, it is a serotonin reuptake inhibitor. In later RCTs, it (and other older antidepressants all discovered and marketed without RCTs) “beat” SSRI antidepressants in trials involving patients with melancholia (severe depression). Melancholic patients are 80 times more likely to commit suicide than mildly depressed patients.

By 1959, clinicians praising imipramine’s benefits also noted it could cause agitation and suicidality in some patients that cleared when the drug was stopped and reappeared when restarted. This Challenge-Dechallenge-Rechallenge (CDR) evidence, especially as it was replicated by several clinicians with different patients, offers close to Fisherian expert like certainty that imipramine causes suicide in certain individuals.

Despite being able to cause suicide, in an RCT of melancholic patients, imipramine seems likely to protect against suicide on average by reducing the risk from melancholia to a greater extent than placebo. In contrast, in the RCTs that brought SSRIs to the market, these drugs doubled the rate of suicidal acts. This was because, weaker than imipramine, SSRIs had to be tested in people with mild depression at little risk of suicide. The low placebo suicidal act rate revealed the risk from the SSRI – as it does for imipramine when put into trials of mild depression. RCTs can, in other words, mislead as regards cause and effect – potentially getting results all the way along a spectrum from “causes”, to possible risk, likely protective and “cannot cause”.

In any trial where both condition and treatment cause superficially similar problems, as when antidepressants and depression cause suicidality or bisphosphonates and osteoporosis both lead to fractures, a dependence on RCT data rather than clinical judgement risks misleading. This is likely the case for a majority of RCTs in clinical conditions, which are Treatment Trials rather than Drug Trials.

Drug Trials are done on healthy volunteers, and ordinarily do not have a primary endpoint. In these, treatment effects stand out more clearly. SSRI Drug Trials in the 1980s demonstrated sexual effects were common, often debilitating, and might endure after treatment stopped, that agitation up to suicidality was common and that dependence commonly occurred after exposures of two weeks. The correct choice of primary endpoint in subsequent Treatment Trials could eliminate these effects. The non-confidential Drug Trial data remain unpublished.

Paroxetine was later put into Treatment Trials of patients with Major Depressive Disorder (MDD) and patients with Intermittent Brief Depressive Disorders (IBDD). IBDD patients (borderline personality disorder) are repeated self-harmers. The depressive features IBDD patients have mean that they can readily meet criteria for MDD.

In April 2006, GlaxoSmithKline (GSK) released RCT data showing a worrying increase in suicidal events in MDD patients on paroxetine (Table). The data from IBDD RCTs in the GSK release were better. We can add 16 suicidal events to the paroxetine IBDD column and still get an apparently protective rather than problematic result for paroxetine when MDD and IBDD data are added together.

Table: Suicidal Events in MDD & IBDD Trials

Paroxetine

Placebo

Relative Risk

MDD Trials Acts/Patients

IBDD Trials Acts/Patients

Combined Acts/Patients

11/2943

32/147

43/3090

0/1671

35/151

35/1822

Inf (1.3, inf)

0.9

0.7

This effect has been noted as a hazard of meta-analyses but it must apply to some extent in every trial that recruits patients who have a superficially similar but in fact heterogenous conditions such as depression, pain, breast cancer, Parkinson’s disease, diabetes or almost any medical disorder. Every time there is a mixture of more than one patient group in a trial, randomization will ensure some patients hide some treatment effects – good and bad. Trials of standard treatments for back pain, for instance, mask the beneficial treatment effects of an antibiotic on back pains linked to infections (up to 10% of back pains).

This is Heterogeneity of Treatment Conditions (HTC) rather than HTE. In epidemiological studies confounding by indication is commonly taken to mean that we should not for example interpret results apparently associating a treatment like an antidepressant with suicidality given the possibility that depression can cause suicidality, but in fact this effect likely more commonly hides the adverse effects of treatment. It is even possible to design Treatment Trials to hide adverse effects – as above.

The assumption is that in Treatment Trials placebos simply control for natural variation. But placebos can have potent treatment effects, making them another treatment like an antibiotic in a backpain RCT. We do not know enough about placebo responses to know the extent to which, in the context of randomization, they might confound the data.

Every medicine that gets on the market, by definition, beats placebo (often inconsistently). As a result, it has become unethical to use placebos in clinical practice, when for those for whom it works a placebo may be preferable to therapeutic poisoning.

A quantitative approach to data generated by algorithm rather than an approach based on judgement also increases the risk that minor events in a placebo arm will be offset against significant events in an “active” treatment arm creating an opportunity to claim that nothing specific has happened, when it has.

Finally, the suicidality, sexual dysfunction, agitation, and insomnia antidepressants cause in clinical trials are commonly folded into a primary endpoint, the Hamilton Depression Rating Scale (HDRS), which includes questions on suicidality, sexuality, sleep and agitation. These changes render confidence intervals around scores on these items meaningless, compromise the use of the scale more generally, and risk hiding a benefit.

Post 1980: From Therapeutic Poisoning to Sacraments

In 1947, treatment with medicines was viewed as therapeutic poisoning. As of 1951, FDA made most new medicines prescription-only on the basis that they are unavoidably risky. But from the mid-1990s, regulators have licensed drugs on the basis of a favorable risk-benefit profile. This implies a balance in which benefits and risks can be weighed, but there is no balance. One statistically significant effect is taken to count for more than all other effects, even serious effects that occur more frequently and can include death, but which by design are not significant, transforming poisons into sacraments (hyper-real agents from which only good can come).

In 1959, clinicians could readily distinguish between treatment emergent suicidality and suicidality caused by melancholia. In 1961, Frank Ayd, the discoverer of amitriptyline a year before could distinguish the sexual dysfunction it causes from the sexual dysfunction melancholia causes. Through to 1991, clinical knowledge of the range of effects drugs can cause derived primarily from clinical experience, embodied in case reports and published in clinical journals. A steady rise of mechanical evaluations, however, allied to a sequestration of trial data, has relegated clinical evaluations that drug X causes effect Y, even when buttressed by evidence of CDR, to the status of anecdotes. From 1991, leading journals stopped taking anecdotes about “side” effects that almost by definition must be rare compared to the treatment effect.

As a result, where in the 1960s the harms of treatments took at most a few years to establish after a drug came on the market, by 1990 it could take decades for significant harms such as with impulse control disorders on dopamine agonists, persistent sexual dysfunction on isotretinoin, antibiotics, finasteride and other drugs, mental disorders on fluoroquinolones or leukotriene antagonists, or dependence on psychotropic drugs, to be accepted.

This growing delay underpins a perception that pharmacovigilance is in crisis. Proposed solutions mention the need for systems to detect rare treatment effects not found in RCTs. There is a turn to a mining of electronic medical records or other observational approaches. New signal detection methods and investigative approaches are always welcome, but these are not the answer to the problems we face, which lie not in a failure to detect rare effects but in a systematic failure to acknowledge common effects.

Through to 1991, clinical knowledge also derived from Drug Trials on healthy volunteers and this is almost self-evidently a better approach than relying on signal detection methods.

The ability of RCTs to focus on one effect suits Regulatory Trials but this focus does not suit an evaluation of treatments, the intention of which is to poison or mutilate in the hope of producing an overall benefit. Studies run on a primary endpoint chosen for commercial reasons cannot be expected to produce the kind of information that might inform therapeutic poisoning. Nor can we know a priori if data-handling methods developed for fertilizers and stars can encompass the complexity of therapeutic poisoning.

The question of whether the suicidality the patient in front of me is experiencing comes from their illness or their treatment is not a matter of deciding if there are 1 or 2 stars. In this case, we already know there are two stars and a lot about them, and one patient may have both kinds. Instruments (checklists) specifically designed with the characteristics of each star in mind may facilitate the distinction between the two, but in practice it’s a case of pattern recognition and a judgement call as to whether increasing or reducing the dose of treatment is more appropriate. The high stakes may make the option of falling back on an operational approach appealing – but it is not good science or good medicine.

If registered on adverse event forms, treatment emergent suicidality or sexual dysfunction should almost de facto be causally linked to treatment. Without clinical context, and the opportunity to dechallenge and rechallenge, faced with a requirement to tick boxes as to the likelihood of a link, the ethos of RCTs, which replaces clinical judgement with decisions based on analytic processes rather than an interrogation of people, steers investigators toward designating the effect as possibly unrelated.

Facing claims in 1983 that spontaneous reporting of adverse events was unsophisticated and not scientifically rigorous, and the only proper method of establishing effects was through trials, Lasagna, once a leading advocate for RCTs, responded that “this was only the case in the dictionary sense of sophisticated meaning “adulterated” and spontaneous reporting was in fact more worldly-wise, knowing, subtle and intellectually appealing than [trials]” (Lasagna 1983).

Implications: Objectivity

A few years later, Lasagna offered the view that:

“In contrast to my role in the 1950s which was trying to convince people to do controlled trials, now I find myself telling people that it’s not the only way to truth… Evidence Based Medicine has become synonymous with RCTs even though such trials invariably fail to tell the physician what he or she wants to know which is which drug is best for Mr Jones or Ms Smith – not what happens to a non-existent average person”.

Concerns about what is often termed the population effects of RCTs, or Average Treatment Effects (ATE) and the mismatch between these and the responses of individual patients has been framed in terms of HTE and recognized in EBM as needing an incorporation of RCT evidence into the judgement of clinicians and the values and preferences of patients. Designating RCTs as offering gold standard evidence, however, effectively side-lines the judgements of clinicians and patients.

The view that RCTs give population or average treatment effects assumes a valid population with individual outliers. In the case of antidepressants, however, there is no knowing how any individual will respond. Fisher expected us to get the same result in every individual case, and within limits confidence intervals offer the same guarantee. Neither Fisher nor Gauss would recognize a problem in translating from a population to an individual level. Diagnostic imprecision, and individual heterogeneity, mean we do not have Gaussian populations. To adapt Shakespeare, the fault lies in our stars not in ourselves.

In recent years, there has been sophisticated consideration of the statistical techniques employed in epidemiological studies, including RCTs (Greenland et al 2017), and of the merits of RCTs applied to complex situations in the social sciences (Deaton and Cartwright 2019). Both considerations have stressed the role of judgement in deciding what populations and experimental design are appropriate, and how results should be interpreted. Both view RCTs, and related designs using statistics, as assay systems yielding results specific to the system, rather than experiments that generate the ‘knowledge from nowhere’ that means we don’t have to worry whether the laws of gravity will apply to the next patient.

These positions are compatible with the argument here, which is that rather than assay systems that might in the right circumstances offer applicable information, RCTs have become algorithmic or operational procedures. DSM criteria in mental health, and the metrics for blood pressure, peak flow rates and bone densities are similarly operational. The creators of the DSM criteria claimed that of course just ostensibly meeting criteria for an illness doesn’t mean the person has the illness, clinical judgements are needed to establish what is really going on, just as they are in the case blood pressure, peak flow or bone density readings. In practice, however, operational exercises like RCTs, DSM criteria, and many medical metrics nudge us toward a suspension of judgement and put a third party, like the pharmaceutical industry, in a strong position to contest any introduction of judgement by a doctor or patient on the basis that the figures are supposedly more objective than any clinician or patient judgement can be.

Even facing strong epidemiological evidence that a drug causes birth defects or strokes, many clinicians will dismiss these as observational data and be unwilling to adjust practice until an RCT has demonstrated the effect. Industry openly play on clinical difficulties in identifying RCTs as producing observational data.

Science traditionally generates data and challenges us to interpret them. New techniques (like a new drug) can throw up new observations (data) that challenge prior judgements. The application of statistical techniques to data yields outputs, not observations. While these techniques and their outputs can be useful, the mission of science has not been to replace judgement by technical outputs.

Individual judgement of course is suspect. This argument does not advocate replacing collective evaluation by a reliance on individuals or doctors; the argument is for collective evaluation rather than its replacement by algorithmic processes. Collective evaluation has a clear footing in the real world, as the Mayo Clinic streptomycin trial demonstrated. The idea that clinical RCTs as happen now have as clear a footing is assumed not established.

Arguments favoring RCTs to point to a small series of treatments, such as internal mammary ligation, that RCTs demonstrated did not work, with the implication that clinical judgement can get things wrong. The internal mammary ligation trial only happened because the dominant clinical judgement was that this treatment didn’t work – an article in the Reader’s Digest notwithstanding. And randomization didn’t work in this 17-patient trial.

These arguments fail to note that most of the current treatment classes we have were introduced in the 1950s without RCTs. That the treatments introduced then from anti-hypertensives and hypoglycemics to psychotropic drugs are more effective than treatments introduced since. That RCTs facilitate the introduction of treatments with lesser effects.

Our most important failure is our complicity in a sequestration of trial data, fooled perhaps in some instances into thinking that analytic outputs are data. Data means the people entering into a study, who lie behind any table of figures or the outputs of any analytic process applied to those figures. At present, with the exception of a very few RCTs, case reports with names attached are the only form of controlled clinical investigation that offer the possibility of interrogating the data and an opportunity to ground any conclusions in the real world.

Drug interventions (therapeutic poisoning) invariably harm; the hope is that some good can also be brought from their use. Evaluations of a medicine by RCT harm (generate ignorance), but if used judiciously some good can be brought out of the ignorance they necessarily generate. It is less likely that good will be brought out of ignorance, if we rely solely on a data handling formula. Analytic methods can describe data but whether good comes from their use requires the kind of judgement calls that statistical approaches ordinarily make a virtue of side-lining. A recent study looking at 29 ways to analyze a dataset, generated from referees giving red cards to dark and light-skinned soccer players, demonstrated that different techniques can lead to a wide variation in results with none able to guarantee what is happening in the real world (Silberzahn et al. 2018).

Clinical practice is essentially a judicial rather than an algorithmic exercise. The view offered here is that our best evidence as to what happens or is likely to happen on treatment lies in the ability to examine and cross-examine the persons (interrogate the data) given that treatment. What holds true at the individual level must be true at the population level also. The evaluation of a treatment cannot be algorithmic.

An endorsement of clinical judgement does not suit health service managers or the pharmaceutical industry, for whom the supposed generalizability of RCT knowledge and confidence intervals that can be offered for such knowledge are legally appealing.

Implications: The Place for Randomized Controlled Trials

Randomization, placebo controls, confidence intervals and primary endpoints all have a place in the evaluation of treatments. Confidence intervals are clearly appropriate in instances where measurement error is likely to play a part. Randomization is an extra control on clinical bias. There is a place for it, unhooked from primary endpoints and statistical significance, as happens in large pragmatic trials – but here the word pragmatic concedes our limited understanding of what we are doing.

An increasing use of RCTs in social science, economic and political settings makes it clear that complexity is not a necessary bar to their use in trials with an appropriate focus on a primary endpoint. In medicine, the multidimensional nature of therapeutic poisoning adds an extra layer of complexity and makes a focus on a primary endpoint problematic, other than when a claimed benefit is contested.

RCTs may be better suited to evaluate time-limited surgical interventions as opposed to chronic therapeutic poisoning, as well as in studies to evaluate programs, and treatment studies that have an endpoint like all-cause mortality, but even here we risk being misled by findings of no change in mortality into missing a switch from cardiac events to cancers when many patients might prefer to die by heart attack (Mangin et al 2007).

RCTs also have a merit as a gateway to the market; randomization means that trials require less patients and can be run quickly. A positive result in commercial trials may indicate a compound has an “effect”. Trials aimed at establishing effectiveness, in contrast, require hard outcomes and time. This is not a realistic gateway to the market. Demonstration of an effect, as with SSRIs for depression, means it is not correct to say this drug does nothing and on this basis entry to the market could be permitted, although strictly speaking this is inconsistent with current statutes.

After 1962, RCTs became the standard through which industry would make gold, As they proliferated, the mantra that they provide gold standard medical evidence took hold. The ignorance of ignorance in claims that the only valid information on medicines comes from RCTs compounds a series of other factors that make RCTs a gold standard way to hide adverse events and encourage over-use of treatments.

The launch of a drug licensed on the basis of a treatment effect should be the point when more comprehensive clinical evaluations start, aimed at generating consensus as to the place of the drug in practice. As a general tool to evaluate the effects of a drug, Regulatory Trials should take second place to both the observations of a group of experienced clinicians, unconstrained by checklists and an investigation tailored to one effect, as well as to the values of patients who increasingly need to reduce their medication burden to achieve optimal benefits.

In addition, seasoned clinicians, allied to increasingly health-literate patients, are better placed than RCTs to determine cause in the case of the 99 other effects every drug has, especially effects such as sexual or suicidal effects of antidepressants, which need to be distinguished from superficially similar condition effects.

The fact that pharmaceutical companies run “RCTs” for regulatory and marketing purposes may have generated a belief that any problems with RCTs stem from a link to commerce.

The difficulty in recognizing adverse effects has for instance been compounded by company sequestration of trial data and ghostwriting of the clinical literature that hypes the benefits and hides the harms of treatments, compounded by a regulatory willingness to avoid deterring patients from treatment benefits by placing warnings on drugs.

Clinical practice is also compromised by licensing indications and by guidelines. There are no drugs licensed to treat adverse effects. When a person becomes suicidal on an SSRI, there is no treatment licensed to treat this toxicity. Clinicians wanting to help feel compelled to diagnose depression rather than toxicity but a depression diagnosis inevitably leads to a further treatment with an antidepressant rather than something more appropriate like a benzodiazepine, a beta-blocker, or red wine.

The incorporation of RCTs into the regulatory apparatus has introduced surrogate markers, which mean that in real life treatments may not show effectiveness consistent with RCT demonstrations of a treatment effect. Trials showing antidepressants work, for instance, have more deaths and both suicidal and homicidal events in their treatment arms compared to placebo.

Commercial trials have given rise to the idea of an abstract Risk-Benefit ratio which along with treatment effect sizes, the Number Needed to Treat (NNT) or to Harm (NNH) are based on the outputs from analytic processes rather than in clinical reality.

Possible answers to these problems lie with medical journals who should insist on the publication of data from Drug and Treatment Trials. Our hierarchies of evidence should come clean on whether they regard a ghostwritten article without access to trial data as better than or inferior to a Case Report that embodies dose responsiveness and CDR elements. And those deploying an analytic process should clarify how the resulting outputs might translate into the real world, rather than assuming they do.

It is not unreasonable to want to discard the industry bathwater but save the RCT baby. But doing so requires an explicit recognition that industry activities avail of an epistemological gap between the conduct of RCT assays and a consideration of the implication of their results rather than constitute the gap.

Coda

Evaluating treatment effects properly is hugely important. When drugs work, they can like parachutes save lives. Given the importance of the task, the notion of a hierarchy of evidence topped by mechanisms that do the deciding for us has a potent allure.

Relegating judgement to the bottom of the evidence hierarchy in medicine brings out our discomfort with judgement. Succumbing to an operational solution, however, is at least as dangerous as depending on judgement.

RCTs have led many to view drug treatments as comparable in effectiveness to parachutes. As a result, by the age of 50, close to 50% of us are now on three or more drugs and by the age of 65 on 5 or more drugs. For the past five years, our life expectancies have been falling and admissions to hospital for treatment-induced morbidity rising, an outcome that contrasts with the added safety of having parachutes and other gadgets in planes (Healy 2020). Adding parachutes and gadgets that are effective (rather than just have an effect) enhances aviation safety, although recent Boeing crashes point to the perils of too great a reliance on automatic decision tools. Combining five pluripotent drug gadgets almost certainly brings risks of interactions that airplane gadgets don’t bring and current data indicates that reducing medication burden from 10 or more drugs to 5 or less reduces hospitalization, increases life expectancy and improves quality of life (Garfinkel and Mangin 2010). But if RCTs of medicines essentially produce evidence that it is not correct to say this drug has no possible benefit, rather than that they are effective, our methods of evaluation rather than just the chemicals we prescribe may be contributing to increasing levels of mortality and morbidity.

Recent data on life expectancies and treatment linked morbidities call for an evaluation of the role of RCTs in the evaluation of drug treatments (Healy 2020). So does data indicating antidepressants are now the second most commonly used drugs by young women in the face of 30 out of 30 trials negative on the primary outcome, which advocates of RCTs, with no links to industry, claim to be able meta-analyze and extract positive effects from data taken from ghostwritten publications, without access to trial data.

References:

Deaton A, Cartwright N (2018). Understanding and misunderstanding randomised controlled trials. Social Science and Medicine 2018, 210, 2-21.

Garfinkel D, Mangin D. Feasibility study of a systematic approach for discontinuation of multiple medications in older adults. Arch Intern Med 2010, 170 1648-54.

Greenland S, Senn SJ, Rothman K, Carlin JB, Poole C, Goodman SN, Altman DG. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol, 2016, 31:337–350.

Healy D, The Shipwreck of the Singular; Healthcare’s Castaways. Samizdat Press, Toronto (Forthcoming 2020).

Hill A.B. Reflections on the Controlled Trial. Annals Rheum Disease 1966; 25, 107-113.

Kravitz RL, Duan N, Braslow J. Evidence-based medicine, heterogeneity of treatment effects, and the trouble with averages. Milbank Quarterly 2004; 82; 661-687.

Lasagna L. Discovering adverse drug reactions. JAMA 1983; 249: 2224-5

Mangin D, Sweeney K, Heath I. Preventive health care in elderly people needs rethinking. BMJ 2007, 335 285-287.

Silberzahn S, Uhlmann EL, Martin DP, et al. Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results. Advances in Methods and Practices in Psychological Science. 2018; 1:337-56.

Explanatory Note

An earlier version of this article was sent out to over 20 senior figures with expertise on areas to do with RCTs and Evidence, chosen mostly by Mark Wilson. Their responses are reproduced in Fawlty Stars.

Publication of the target piece and responses happens in some journals but throwing the argument open to anyone who wishes to contribute is a unique venture in peer review.

The argument lies at the heart of Chapter 6 of Shipwreck of the Singular which will be published in January 2021. The extra details in Shipwreck may help with getting to grips with what is going on.

Comments

annie says

December 14, 2020 at 1:48 pm

This links in my opinion into issues Jean-François has raised that I have not addressed here but hope to in my book Shipwreck of the Singular when it comes out. Neo-liberalism is essentially one-dimensional. It substitutes operational procedures for judgement. Having said this, it may now be clear to some that my essay (the revised one more obviously) is as much about neo-liberalism as it is about clinical trials.

David Healy September 28th

Prof. Peter Gøtzsche

Fawlty Stars**

Casting RCTs as offering gold standard evidence about a drug creates an ignorance about the ignorance they generate.*very good point*

In the case of airplanes, adding parachutes and other interventions that are effective (rather than just have an effect) enhances safety, although recent Boeing plane crashes point to the perils of too great a reliance on automatic decision tools. *Please drop this. Airlines are very safe, much safer than any other means of transport*

In March 2019, the Boeing 737 MAX passenger airliner was grounded worldwide after 346 people died in two crashes, Lion Air Flight 610 on October 29, 2018 and Ethiopian Airlines Flight 302 on March 10, 2019. Ethiopian Airlines immediately grounded its remaining MAX fleet.

Cause: Airworthiness revoked after recurring flight control failure

Date: Lion Air accident: October 29, 2018, Ethiopian Airlines accident: March 10, 2019,
First grounding: March 10, 2019 by Ethiopian Airlines,
First nationwide grounding: March 11, 2019 by the Civil Aviation Administration of China (CAAC),
Effectively a worldwide grounding: March 13, 2019 by the Federal Aviation Administration (FAA), Cleared for return to service: November 18, 2020 by the FAA

Deaths: 346:, 189 on Lion Air Flight 610, 157 on Ethiopian Airlines Flight 302

Duration: between accidents: 4 months and 10 days, of grounding by the FAA: 1 year, 8 months and 5 days (619 days)

https://www.theverge.com/2019/3/22/18275736/boeing-737-max-plane-crashes-grounded-problems-info-details-explained-reasons

Boeing’s safety analysis of the system assumed that “the pilots would recognize what was happening as a runaway and cut off the switches,” said the engineer. “The assumptions in here are incorrect. The human factors were not properly evaluated.”

Flawed analysis, failed oversight: How Boeing, FAA certified the suspect 737 MAX flight control system

https://www.seattletimes.com/business/boeing-aerospace/failed-certification-faa-missed-safety-issues-in-the-737-max-system-implicated-in-the-lion-air-crash/

Peter Lemme, a former Boeing flight controls engineer who is now an avionics and satellite-communications consultant, said that because MCAS reset each time it was used, “it effectively has unlimited authority.”

Facing legal actions brought by the families of those killed, Boeing will have to explain why those fixes were not part of the original system design. And the FAA will have to defend its certification of the system as safe.

Boeing 737 MAX Plane, is, in my opinion, the perfect analogy

Fresh from the Cockpit –
“This is the Captain speaking”

Patrick D Hahn
@PatrickDHahn
·
3h

SmithKline’s own studies showed nearly seven times as many suicide attempts per patient on #Paxil compared to #placebo.

Healthcare’s Castaways and GSK Paroxetine, take note…

Reply
susanne says

December 14, 2020 at 6:49 pm

There is so much to think about here, For now I found this article helpful RCTs seem to hark back to the Victorian obssession with classification initially with plants but has contaminated the thinking about people The classification was fairly harmless when we could all recognise some one as say introvert or extravert, but with greater and greater use of technology the perception of a person has become more and more dehumanising, More and more classification has become an insult to those on the receiving end of a ‘consultation.; Some aspects of RCTs can seem ridiculous , using little more than guesses in scientific garb eg confidence levels.-
But It seems that RCTs do have a use along with other methods for research I would be doubtful though about relying on the morality of groups of clinicians getting together to investigate ‘cases’ without some sort of oversight As soon as groups develop the same old corrupt behaviour is likely Case reports can be skewed. How could a ‘patient’ be able to have an independant input to verify or contribute equally to what is being documented?

American Psychological Association LogoSearch Menu
Home//Monitor on Psychology//2010//09//

More than one way to measure
Randomized clinical trials have their place, but critics argue that researchers would get better results if they also embraced other methodologies.

By Rebecca A. Clay

September 2010, Vol 41, No. 8

Print version: page 52

6 min read

Brain
Ben A. Williams, PhD, came by his distrust of randomized controlled trials (RCTs) the hard way: He developed a kind of brain cancer with no proven treatment.

There had been randomized trials of various approaches, but they were all failures, says Williams, an emeritus psychology professor at the University of California at San Diego. And although several drugs had helped a small percentage of patients in Phase II trials, he says, it can be hard to get hold of therapies not yet vetted by Phase III trials.

“Medicine was basically saying if it isn’t done this way, it doesn’t count,” says Williams, describing the difficulties his physicians had in gaining access to therapies that probably wouldn’t help him, but might. “The problem is the one-size-fits-all mentality.”

Like Williams, many other psychologists — as well as medical researchers — question the assumption by the National Institutes of Health, the Food and Drug Administration and others that RCTs should be the gold standard for clinical research. While the methodology — which involves randomly assigning participants to either a treatment or control group — does have its strengths, they say, it also has serious limitations that are often overlooked or ignored.

Because trial participants typically don’t represent the population as a whole, for example, results from RCTs may not apply more generally. And even if they did, it’s impossible to tell from an RCT which subset of participants actually benefited from the intervention being studied.

These critics don’t want to reject RCTs altogether. Rather, they want to supplement their findings with evidence from other methodologies, such as epidemiological studies, single-case experiments, the use of historical controls or just plain clinical experience.

Strengths and weaknesses

No one denies that RCTs have their strengths.

“Randomized trials do two things that are very rare among other designs,” says William R. Shadish, PhD, a professor of psychological science at the University of California at Merced. “They yield an estimate of the effect that is unbiased and consistent.” Although Shadish is reluctant to describe RTCs as the gold standard because the phrase connotes perfection, he does describe himself as a “huge fan” of the methodology.

“If you can do a randomized trial,” he says, “by all means do it.”
But that’s not always possible. By their very nature, he says, some questions don’t permit random assignment of participants. Doing so might be unethical, for example.

Even when RCTs are feasible, they may not provide the answers researchers are looking for.

“All RCTs do is show that what you’re dealing with is not snake oil,” says Williams. “They don’t tell you the critical information you need, which is which patients are going to benefit from the treatment.”

To account for heterogeneity among participants, he explains, RCTs must be quite large to achieve statistical significance. What researchers end up with, he says, is the “central tendencies” of a very large number of people — a measure that’s “not going to be representative of much of anybody if you look at them as individuals.”

Move beyond the context of an RCT itself, and the applicability of the results to individual patients becomes even more problematic.

For one thing, participants in RCTs tend to be a “pretty rarefied population” that isn’t representative of the real-world population an intervention would eventually target, says Steven J. Breckler, PhD, executive director of APA’s Science Directorate.

“Think about the people who show up for drug trials — patients who have probably tried everything else and are desperate for some kind of treatment,” he says, adding that they are further winnowed down as researchers eliminate would-be participants with co-morbid conditions and the like. “Are the results of that trial going to generalize to you and me? Or do we come from a population of people who would never have enrolled in a trial to begin with?”

Experiments, says Breckler, typically involve a trade-off between internal validity — the ability to trace causal inferences to the intervention — and external validity — the generalizability of the results.

“What people seem to fail to recognize is that the perfect RCT is designed strictly with internal validity in mind,” he says.

RCTs may be especially ill-suited to psychological interventions versus medical ones, adds Breckler. In contrast to medications that have a straightforward biochemical effect that’s unlikely to vary across individuals, he says, psychological interventions tend to interact with such factors as gender, age and educational level.

Supplementing RCTs

No one suggests that researchers give up RCTs. Instead, they urge the supplementation of RCTs with other forms of evidence.

“Evidence-based practice should rely on a very broad, diverse base of evidence,” says Breckler. “RCTs would be one source, but there are lots of other sources.” These sources could include Phase II trial data, epidemiological data, qualitative data and reports from the field from clinicians using an intervention, say Breckler and others.

Williams champions the use of historical controls as a supplemental source of information.

In this methodology, researchers examine the results of earlier, nonrandomized trials to establish a crude baseline. They then compare the results of subsequent nonrandomized trials to that benchmark.

The approach works, says Williams, adding that the process allows many interventions to be tested in quick succession. Faced with the failures of RCTs for glioblastoma treatment, for example, researchers turned to the historical record and found that only 15 percent of those with the cancer had no disease progression six months after treatment began.

“They found that if you add this thing to the standard treatment, you can push that number up to 25 percent and add two things and push it up to 35 percent,” he says. “It’s a crude comparison, no doubt, but it turns out to be an effective way of doing the research.”

The FDA agreed, approving a drug for treatment of glioblastoma not on the basis of an RCT but on multiple Phase II trials whose results were better than the historical norm.

Single-case experiments are another important source of evidence, says Alan E. Kazdin, PhD, a past president of APA and professor of psychology and child psychiatry at Yale. In contrast to RCTs, which involve many subjects and few observations, single-case designs involve many observations but often few subjects. Instead of simply doing a pre- and postassessment, the researcher assesses behavior — of an individual, a classroom, even an entire school — over time.

Say a patient has a tic, says Kazdin. In a single-case design, the researcher would observe the patient and establish the number of tics per hour. The researcher would then conduct an intervention and watch what happens over time.

“If you just do an assessment before some treatment and an assessment after treatment and compare the group that got it to the group that did not, you lose the richness of the change on a day-to-day, week-to-week, month-to-month basis,” says Kazdin, emphasizing that single-case designs are not mere case studies.

For Kazdin, overreliance on RCTs means missing out on all sorts of valuable information. Think of the nation’s telescope program, he says. The Hubble telescope looks at visible light. Another telescope looks at X-rays. Another handles gamma rays.

“The method that you use to study something can influence the results you get,” says Kazdin. “Because of that, you always want to use as many different methods as you can.” *

Rebecca A. Clay is a writer in Washington, D.C.

Further reading
Kazdin, A.E. (2010). Single-Case Research Designs: Methods for Clinical and Applied Settings, 2nd edition. New York: Oxford University Press.Shadish, W.R., Clark, M.H., & Steiner, P.M. (2008). Can nonrandomized experiments yield accurate answers? A randomized experiment comparing random and nonrandom assignments. Journal of the American Statistical Association, 103, 484, 1334–1356.Shadish, W.R., Cook, T.D., & Campbell, D.T. (2001). Experimental and Quasi-Experimental Designs for Generalized Causal Inference, 2nd ed. Florence, KY: Wadsworth.Williams, B.A. (2010). Perils of evidence-based medicine. Perspectives on Biology and Medicine, 53, 1, 106–120.

Reply
Peter Groot says

December 15, 2020 at 12:40 pm

Thanks David, for this article.

RCTs can be a valuable and important research instrument, but they are not necessarily the only and not always the best way to conduct research, especially not in mental health research.

The correct interpretation of RCTs requires very careful consideration. To determine if other forms of research may be more appropriate and yield more.

In part 1 “the right size” of a presentation I gave, entitled “Tapering medication (tapering strips) as a necessary tool for a meaningful conversation in the doctors office” I have tried to make this clear using an RCT involving clogs for medical personel as an analogy.

The presentation can be viewed here: https://bit.ly/384fcoQ.

Peter Groot,
User Research Centre NL
University Medical Center Utrecht

Reply
chris says

December 17, 2020 at 11:36 am

UK Psychiatrists are ripping people off neuroleptics- who have been on for more than a year – in two weeks. This happened to me – 400mg quetiapine ordered by phone to come off in two weeks and they contacted my GP to order him – no more tablets ofter two weeks. They don’t give a damn about correct slow taper.

Reply
- mary H says
  
  December 19, 2020 at 11:28 am
  
  Good grief Chris, this is so frightening. I am surprised that you have lived to tell the tale. My son was on 400mg of Quetiapine daily – took him over three years of sheer hell to reduce it to 125mg daily, which is where he remains at present.
  
  Reply
  - chris says
    
    December 25, 2020 at 6:47 am
    
    Sorry my mistake it was 100mg. I wish I had recorded the phone call. Wasn’t going to say no to comming off that vile drug. Lucky that I had a large quantity of 25mg tablets to tapper off over three months.
    
    The general public have no idea where all this virus business is going but they are worried by design. The technocracy madness wants to replace money with energy and social scientists (psychologists/psychiatrists) replacing left right politics to control us all, and what do you know – SAGE are social scientists.
    
    “Cameron’s presidential address to the American Psychiatric Association in 1953 suggests his involvement in the Cold War and his concerns about communism. Although he also used the opportunity to express his concerns about McCarthyism, Cameron held to a now familiar position — our best hope for a new world order and without hysteria, one without the totalitarianism of either the right or left, lies in science. With behavioural scientists as leaders, order would emerge from chaos. Were these attitudes a factor in his determination to change behaviour? It seems likely.”
    
    “our best hope for a new world order and without hysteria, one without the totalitarianism of either the right or left, lies in science. With behavioural scientists as leaders, order would emerge from chaos.”
    
    Donald Ewen Cameron torturer
    
    https://www.serendipity.li/cia/c99.html
    
    Could this be a start to fight back:
    
    https://drive.google.com/file/d/17X4GmMXn_m-vDwqEy9vMhbNqzodAEW3b/view
    
    All the best to your son, I’ve had more than a taste of what he – and yourself – have been through.
    
    Reply
annie says

December 30, 2020 at 6:05 am

In his letter to the MHRA, Mr Osborne said: ‘It was quite apparent from the evidence that she had a psychotic reaction as a result of taking the drug [doxycycline] and yet there is nothing on the drug information leaflet that either highlights or mentions this possibility. The information sent out with the drug should be reviewed to prevent future deaths.’

The MHRA is now probing the drug’s safety.

Why common antibiotics may trigger mental breakdowns: Coroner ruled that a malaria drug could be to blame for this student’s death… and there’s worrying evidence hers is not an isolated case

By PAT HAGAN FOR THE DAILY MAIL

PUBLISHED: 01:29, 29 December 2020 | UPDATED: 02:39, 29 December 2020

https://www.dailymail.co.uk/health/article-9093529/Why-common-antibiotics-trigger-mental-breakdowns.html

In the meantime, could many more patients be suffering, not realising it may be due to their antibiotics? Disturbing research suggests this may be the case.

Scientists at Augusta University in Georgia, U.S., carried out one of the largest studies into the psychiatric side-effects of antibiotics such as doxycycline.

They trawled through eight years’ worth of data from the U.S. Food and Drink Administration’s Adverse Event Reporting System — a catalogue of potentially harmful drug reactions reported by doctors and patients.

Professor David Healy, a psychiatrist who was consulted in the Alana Cutland case, first raised concerns about potential harmful effects of doxycycline in 2013, when he was a professor of psychiatry at Bangor University.

Now based at McMaster University in Ontario, Canada, Professor Healy says: ‘I know four or five people personally who have been on doxycycline and felt very anxious as a result. In all cases, the symptoms disappeared as soon as they stopped taking it.

‘In Alana Cutland’s case, it was an extreme effect. Most doctors think doxycycline is benign, but it may simply be the wrong drug for some people. The drug should carry a carefully worded warning to let people know the risks and that they should stop taking it immediately if they experience a mental health problem. It could save lives.’

Citizen Science …

The medical breakthroughs helped by ordinary people. This week: Clinical trials

Reply