How do we learn about improvement?

Danika Barry

Healthcare Improvement Fellow, USAID ASSIST Project/URC

M. Rashad Massoud

Director, USAID ASSIST Project/URC

Commentary on the Quality & Performance Institute's Technical Meeting held on December 17, 2014. A full transcript is available here.

For our December Quality & Performance Institute Technical Meeting, we invited Dr. Frank Davidoff and other thought leaders in the field of improvement science to comment on the issues raised in Davidoff’s recent article, “Improvement interventions are social treatments, not pills.”

Davidoff comments on the limitations of the “gold standard” randomized controlled trial study design, meticulously followed by Goldman et al. in an intervention to reduce emergency department visits and readmissions among elderly patients in an ethnically and linguistically diverse setting in Northern California.  Davidoff writes, “The study method is a thing of beauty; its beauty, unfortunately, is also its curse.” As Davidoff commented in our session, “the beautifully designed, protocol-driven study tells us that the intervention didn’t work, but it didn’t tell us what improvement can possibly do to change things.” The researchers’ rigid adherence to the study protocol, while required for so-called “P value-based statistical inference,” also prevented the researchers from taking actions to increase the success of the intervention. This is the realm of improvement science, which promotes continuous learning and adaptation of the intervention to address social contexts, and new knowledge gained during implementation. As the moderator, Dr. Rashad Massoud pointed out, this is the type of work we do with the USAID ASSIST Project. However, use of multiple, iterative interventions poses a challenge to convincingly demonstrate how we know whether the improvement is truly due to the intervention.

Davidoff identified two possible solutions, first to conduct mixed-methods analyses, which combine protocol-driven studies with qualitative research that can help us understand why an intervention works, and if not, how it might be improved. The second approach proposed was to switch to different kinds of statistical analyses, namely, those that recognize that improvement and healthcare delivery are time-dependent processes. Time-sensitive statistical analyses like statistical process control can help detect meaningful changes in outcomes as a result of an intervention. However, both of these solutions suffer from limitations to drawing causal inference, and are not particularly suited to control for confounders or bias in the way that protocol-driven study designs can.

Dr. Tom Bossert noted these limitations, and emphasized the need to integrate elements like control groups or randomization to answer fundamental questions about causality. Difference-in-Differences analyses between a control and intervention group, as well as innovative randomization, like that used by Dr. Gary King who randomized matched pairs of districts in an evaluation of a major health reform in Mexico, can help strengthen causal inference. Bossert also advocated for the importance of real-time data so that implementers can make improvements. While time-series run charts are one way of using these data, a common pitfall is the mistake of not collecting enough baseline data, which can artificially inflate the perceived intervention effect, as the initial outcome is recorded as near zero.

Dr. John Øvretveit emphasized the need to match methods to the study aims, and to the time and money available, saying, “the controlled trial is the best of designs and the worst of designs, depending on who it is for, and which decisions it is meant to inform.” He further commented, “the more we move towards complex social interventions, the more we have got to reinvent them.” We must question, as Cindy Brach did in her paper, “Will it Work Here?” Especially for context-sensitive interventions like those covered in the Goldman article, Øvretveit commented that we are implementing principles, concepts or ideas that must be adapted to the local context. Similarly, as argued in an article shared by Davidoff after the meeting, complex interventions require that intervention integrity be defined by the intervention function rather than their particular form, in order to address variation in local contexts, while simultaneously preserving an element of standardization required by randomized controlled trials, or other rigorous designs. Process or pathway analysis or PDSA cycles, which Davidoff commented are essentially “social tools,” can be useful to inform the adaptation process. Øvretveit further indicated there is a lot to be learned from hybrid studies in other fields, including the logframe model, action evaluations or case study approaches used in program evaluation, implementation science, and welfare service evaluations in public health.

Dr. Edward Broughton pointed out a similarity between controlled trials and improvement interventions, for example, in RCTs we do not control for every aspect of someone’s life, just one particular element. Similarly, in interventions for improvement, we do not control every part of the intervention, just the principles that need to be applied. In a similar way, the “seething dichotomy” between P value-based statistics and run chart analyses belies the fact that elements of both can actually be combined. This point was expanded upon in Broughton and Bossert’s response to a question from Esther Karamagi, Senior Quality Improvement Advisor, in the USAID ASSIST Project office in Uganda. Karamagi asked about the role of controls in day-to-day decisions on how one change leads to improvement. Broughton acknowledged the Project’s constraints in collecting control data, as we cannot collect data from facilities that we are not working in. Therefore, we must determine a somewhat lower or more practical level of evidence to go forward with an intervention. However, Bossert added that there may be ways to strengthen the evidence by identifying controls among other project sites, as long as the patients have similar characteristics.
In conclusion, the panelists commented that hybrid approaches can be a powerful way to triangulate evidence for causal inference. Additionally, the panelists emphasized the need to advocate to donors to provide the resources and incentives to perform this sort of work in the field contexts that we operate.

Related Countries: 


I really liked this idea of the PDSA as a social tool. I’d love to hear Frank explain what he means by that more.

A former Health Foundation Quality Improvement Scholar, Julie Reed, and her group in London published a recent paper that examined how improvers actually used PDSA cycles, and discovered a great deal of variation, and many gaps, in the application of this tool. The reference is: Taylor MJ, et al. Systematic review of the application of the plan-do-study-act method to improve quality in healthcare. BMJ Qual Saf 2014;23:290-8. (

To follow up on those observations, Reed and her colleagues have now interviewed over 75 front-line clinicians and improvement researchers (in the UK and US) to find out how they actually use PDSA cycles, and presented their preliminary findings at the IHI-sponsored improvement research meeting in Orlando earlier this month. They found that these cycles served to focus collaboration among members of care teams at multiple organizational levels, for example: introduction of a discharge checklist, at the level of individual and multiple hospital wards or units; formulation of policies and procedures on completion of paperwork to be required on every patient, at the level of nursing administration; use of consultants to increase enterprise-wide priority of patient safety and efficiency, at the organization and leadership level; and prioritization of daily team huddles, at the level of support services such as radiology).

The Goldman et al. article Frank Davidoff commented on demonstrate how the needs of the elderly will not be met with RCTs, or solely within the realm of internal medicine. These are complex social interventions that need improvement. However, our work and that of IHI’s are small part of the landscape. I have two questions directed to Frank Davidoff:

1. What has been the reaction to your article from the internal medicine community?
2. How can we work with those in internal medicine to have improvement work recognized and funded?

So far the journal has gotten only two written responses to my editorial from readers. One noted that it is possible to “balance” the strengths and weaknesses of two different study methods by applying more than one study method in evaluating a specific improvement intervention – as the reader’s group is actually doing. This “double-barreled” approach was suggested years ago by Campbell and Stanley in their monograph ( on experimental and quasi-experimental study methods – and I agree it’s reasonable and can be feasible, although it can demand a lot more time, effort, and funding.

The other reader commented on the frustrating unwillingness of many people in medical academia to accept the legitimacy and value of improvement work. I share that reader’s frustration, but would argue that the only legitimate way for the improvement community to overcome that resistance is by making it clear that improvement can be a true scholarly discipline, not just a “craft.” This will require continuing to improve its study methods, tie them as closely as possible to the underlying principles of scientific thought, and promote transparency, precision, and accuracy in reporting its work. (A good example of such efforts is the recent development of a publication guideline for reporting on exactly what an improvement intervention consists of: Hoffman, TC, et al. Better reporting of interventions: template for intervention description and replication (TIDieR) checklist and guide. BMJ 2014;348:g1687.( Becoming accepted as a scholarly discipline won’t be either easy or quick, but it’ll be worth the effort.

In our programme in India we had to start quickly and at relatively large scale so didn’t have time to organize a good baseline or comparison group. To try to answer the question about whether improvements are due to our work we are doing comparisons between the sites that we support. All of our sites receive regular improvement coaching visits from us and attend regular peer-to-peer learning sessions and they all have a few common indicators that they report to us and the government on a monthly basis. But they don’t all have the same improvement goal. For example, about half are working on improving antenatal care ANC, the rest are not. So we are comparing time series data on ANC results in the teams that have told us they are trying to improve ANC to those who are not.

The main reason for doing this is because the government and other stakeholders here see monitoring as the best method to improve care so we are comparing monitoring (sites sending and the government data) to monitoring plus active improvement work
What are the panel’s thoughts on this approach?

First, for Nigel, and again this may not be possible, but it is important to stress to the funder and partners the importance of baseline data to avoid some of the problems Nigel mentions. Sometimes it is possible to start and then only look at trends in data but that depends on how good the data is and the type of intervention. I would keep the two objectives separate unless there is reason to believe that they are similar enough not to influence the effects of the ASSIST activities.

Thanks for the question, Nigel. The kind of comparison that you note is an excellent example of using the circumstances that exist to design an evaluation strategy that will give the most credible design possible. Assuming that the two groups – one implementing ANC improvements and the other implementing other improvements—are either well matched on important characteristics or that those characteristics are being measured to be controlled for in statistical analysis, then I think the evaluation design should yield quite robust results. A nice aspect of the design is that you can work on, or at least be aware of, the quality of the ANC indicator data in sites that are NOT implementing ANC-related changes even though those facilities are working on other improvement priorities. Some may criticize the issue that facilities are self-selecting to work on particular improvements, in this case ANC and this may introduce confounders because of the non-random selection into groups. Those implementing and evaluating the intervention should be aware of such criticisms but they shouldn’t allow implementation or evaluation be hobbled by it.

I don’t have much to add to Edward’s elegant response, except perhaps to re-emphasize what I said in the panel discussion about the importance of replication. Consistency of improved performance across multiple sites that are working specifically to improve ANC (assuming that you have a reasonably accurate measure of performance level) will be a helpful indicator that that approach is effective; spotty and inconsistent improvement in performance in the ANC-focused sites (i.e., improvement isn’t reliably replicated across those sites) will argue against a meaningful effect of that approach (i.e., the apparent improvement may be due to chance).

The editorial by Dr Davidoff argues for a progressive “adaptation” of the intervention through cycles of changes to make it more specific to certain features of the context and thus arguably enhance its capacity to result in improvement. One problem with this may be that the “adapted intervention” becomes so specific to a given context that loses its capacity to be generalizable. Each different context may pose different needs and direction for “adaptation” of the intervention, rendering it “case specific” and not scalable as such. Please comment on this dilemma.

For Jorge’s first question, I think context is always important and as much should be incorporated in the analysis as possible. It takes considerable judgment to decide if the adaptations are related to the context or are generalizable. There is no “cookie cutter” response to this dilemma other than to make the arguments for why you think it is generalizable or specific to a context. Information on specific contexts is also important as there may be similar contexts in other situations.

During the panel discussion, John Ovretveit mentioned that some evaluations may have, as the intervention they are assessing, application of principles of improvement rather than a well-defined specific package of changes that a facility or community must implement with fidelity. I agree with this idea and think that, when we describe an improvement intervention in the context of research or rigorously designed evaluation with the thought of being able to make broader generalizations, we should present whether we are testing a “change package” implementation OR application of particular improvement principles (deliberately identifying and prioritizing problems, implementing and testing context-specific changes and applying successful ones or testing different changes if unsuccessful). This is no different to well-accepted studies of the effectiveness or different approaches to treatment of some diseases. For example, there are several studies that examine surgical versus conservative, non-surgical approaches to the treatment of specific forms of cancer that I think are analogous.

This dilemma harkens back to the Greek philosopher who struggled with the problem of knowing whether a ship is still the “same” as the original ship once it’s been repaired so many times that it no longer contains any of its original parts. More recently this has been recognized as a potential problem in “mass customization,” in which an original intervention is deliberately modified or adapted to local circumstances, which is often essential in getting it to be used – and to be effective – locally.

This is certainly a challenging problem, but not an insoluble one. (According to the researchers, the now well-known complex, multi-component intervention to reduce the infection of central lines in patients in 103 ICUs in Michigan involved the creation of “103 different checklists” by and for the individual ICUs. But, for various reasons, that checklist heterogeneity didn’t seem to interfere with the intervention, which was dramatically successful anyway.) The key, I suppose, is identifying the true “active ingredients” of the intervention, and making sure those are still present in the adapted version. Without that, improvement work is likely to result in “cargo cult” science, in which later “adapted” versions of an intervention resemble an original (effective) one only superficially, and (not surprisingly) don’t work therefore.

On the question of “attribution” of an observed improvement to a given tested intervention, I have noticed that together with the “intervention” itself there exists an agent external to the facility or organization. This external agent promotes, facilitates and sometimes helps implement such intervention. In our case, this agent is the ASSIST Project and local staff. However, it is frequently forgotten that this external agent is in fact an important part of the “intervention”, inseparable from it. Any observed effect would be then attributed both to the intervention and to the external agent that supported its implementation, but frequently only the “intervention” is considered, the external agent becomes invisible, leading to problems of replicability when the agent is not there anymore. Have you seen this in other cases? How to “control” for it?

On Jorge’s second question. The “halo effect” is a major problem in any case, but either the support of ASSIST should be seen as part of the intervention itself, or part of the startup for any subsequent intervention, or a follow up study should be designed to examine the process after the ASSIST project team stops its activity to test the sustainability of the program.

I think this question again speaks to how we define the “intervention” that we are evaluating and again I think this has a well-accepted analogue in the medical science literature. In evaluating improvement interventions, what the intervention involves (what was done and who did it) should be stated very specifically. If it was an external agent, it should be understood what they did to bring about the change and how they interacted with the subject of the improvement (facility or community, etc). In some studies of surgical treatments for disease, the surgical team (sometimes with the specific post-surgical rehabilitation) are specifically trained for the surgery under investigation and the fact that patients are receiving such specialized care has to be taken into account by those reading the research to determine if the technique used is applicable more generally. Jorge brings up a very valid point and it is definitely one that should be considered but not one that is a barrier to considering evidence from this field any more than it is in other areas of health and medicine.

Such “external agents” can clearly be important factors in the success – or failure – of improvement interventions. (For example, setting nationl clinical standards for hospitals within closed systems such as the US Veterans Administration has had an important effect in improving care in that system.) Most recent “taxonomies” of the elements of context distinguish between “external” vs “internal” context factors that can influence such interventions. (See, for example, Paul Bates’ excellent book “Organizing for Quality,” in which he develops such a taxonomy.) I agree with Edward that this phenomenon is common, and takes a variety of forms: facilitators; synergies; interaction between variables, etc. Sometimes it’s “all or nothing,” i.e., the intervention doesn’t work at all in the absence of the “external” factor, although the “external factor” alone doesn’t work either (i.e., the external factor is “necessary but not sufficient,” as the philosophers say); sometimes the relationships among such variables are more modulated (i.e., the external factor speeds up, or increases the size of, an intervention’s impact, but isn’t absolutely essential).

There are statistical methods for dealing with such multi-variable effects, but applying them is above my pay grade. In any event, the key is not to try to “control them out,” which can distort the reality of a situation, but learn to recognize their existence, accept them as a part of reality, and characterize their role as clearly and completely as possible. Whether to consider them as “part” of the intervention or as “external” to it is almost more a matter of semantics than reality

Thanks for such an exciting debate! ICHI is a classification of health interventions currently being developed by WHO Its definition and coding system do not include "complex interventions". It may be the case that an international taxonomy is needed to better differentiate interventions from clinical programmes of care, care pathways and care packages and those from system effectivenes. This debate rises the high-order need of a new taxonomy of scientific knowledge usable for health system research. Whilst the validity of EBM for the analysis of efficacy and safety of simple health interventions is out of question, the overall usability of the EBM paradigm for effectiveness and utility analysis of complex interventions/clinical programmes and packages of care is not so clear.

If we accept that there are three major phases of scientific knowledge: Discovery, Corroboration and Implementation; EBM could be considered the highest level of organisation of knowledge related to Discovery and corroboration. Unfortunately the original intention of EBM was another one: to guide the implementation phase by "translating" the discovery knowledge; but Implementation incorporates inductive, abductive and means-end inferences that go beyond the experimental/deductive approach. In addition Context and Uncertainty are part of the analysis of complex interventions and therefore new design methods and techniques of analysis have to be incorporated to understand, interpret and predict implementation. More than trying to "refine" EBM we need to develop a broader framework that redefines 1) "evidence" to fully incorporate inductive studies, and 2) the overall concept of "scientific knowledge" to incorporate prior expert knowledge. EBM, observational and expert knowledge could be combined to generate predictive and causality models . As an example of this broader taxonomy we have suggested a new area of scientific studies "Framing of Scientific Knowledge" (Salvador-Carulla et al, J Eval Clin Practice, 2014). PDSA , Healthcare Improvement and Spreading are areas of implementation research which are context dependent and related to expert knowledge and where international consensus on definitions, typology of studies and methods of analysis is urgently needed

Dear Luis,
Thank you for your superb contribution to this discussion. I read your article with great pleasure. It is a must read for anyone interested in this issue. I wanted to comment on two things you wrote:
“Whilst the validity of EBM for the analysis of efficacy and safety of simple health interventions is out of question, the overall usability of the EBM paradigm for effectiveness and utility analysis of complex interventions/clinical programmes and packages of care is not so clear.”
“Unfortunately the original intention of EBM was another one: to guide the implementation phase by "translating" the discovery knowledge; but Implementation incorporates inductive, abductive and means-end inferences that go beyond the experimental/deductive approach.”
The relationship of framing of scientific knowledge (FSK) with other types of studies represents an important way of understanding this subject. FSKs can help us understand the complexity arising from the nature of the work being conducted under the title of improvement. Thank you again.

From somebody at the very periphery of this field, we would like to pose a question around how improvement science is/ can be relevant to evaluation and adoption of “simple” pharmaceutical interventions. Reading the transcript of the meeting and the various comments would imply that in principle this may be the case. Would it though be practically feasible and meaningful?

Nowadays even “simple” pharmaceutical interventions, traditionally evaluated following EBM approaches, are becoming increasingly complex. Importantly here we need to consider that bringing a new therapy to clinical practice continues well beyond registration and launch of a product.
New chemical or biological therapies are part of a more complete disease management approach the evaluation of which requires that in addition to efficacy, we also demonstrate validity/ clinical utility of the accompanying diagnostic and/or predictive test in the relevant and appropriate healthcare setting, while improvement in quality of life together and potential cost effectiveness/ utility is also achieved.
In most drug development situations, the need to find an appropriate dose, to study patients of greater and lesser complexity or severity of disease, to compare the drug to other therapy, to study an adequate number of patient for safety purposes, and to otherwise know what needs to be known about a drug before it is marketed will result in more than one adequate and well-controlled study upon which to base an effectiveness determination.
We agree that existing data management systems can be harnessed to enable real-time collection and review of clinical information during trials. This approach facilitates reporting of information closer to the time of events, and improves efficiency, and the ability to make earlier clinical decisions.

Are there particular principles or methods of improvement science that could be applied in this context too? There are various efforts dedicated to generating alternative methodologies that could be potentially suitable in evaluating/ developing new therapies in areas where RCTs are not suitable, for example, rare diseases. Is it all about methodologies and statistical analyses or are there broader aspects that the study of improvement science could generate with applicability outside applied health/ healthcare services delivery?

Dear Suzanne and Nadina,
Thank you very much for your thoughtful comments. I particularly like your ideas: “seemingly simple biomedical interventions become more complex as patients have to use them”. “In addition to efficacy, we need to demonstrate clinical utility” reminds me of my observations as a clinician about patients’ ability to comply with a drug, which needs to be taken every 6 hours versus every 12. Also, the notion of writing “real time data” and reporting “closer to the time of events” is a complex one. It is exciting for me to see your thinking of the use of improvement methods in new medical technologies. Thank you again.

Facebook icon
Twitter icon
LinkedIn icon
e-mail icon