Caution needed with artificial intelligence in medicine, experts warn

Current hype surrounding machine learning and artificial intelligence (AI) in medicine is enormous, with seemingly daily headlines declaring that some new model will change how a disease is diagnosed or treated. Experts are more cautious, though, with several speakers at the recent Human Intelligence and Artificial Intelligence in Medicine Symposium warning that there is significant potential for patient harm if AI is rushed into clinic without sufficient testing and regulation.

Eric Topol MD, director of Scripps Translational Science Institute. Howard Lipin, The San Diego Union-Tribune

Rabiya S. Tuma PhD, Medscape Medical News May 29, 2018

AI-based clinical decision support and predictive models get a lot of attention in journals and the media, but the data available thus far are inadequate to support clinical uptake. There are no prospective, peer-reviewed data published for any of them, Eric Topol, MD, said during a keynote address at the symposium.

The US Food and Drug Administration (FDA) has now approved at least three AI-based tools for use in the clinic. The LVO Stroke Platform, from Viz.ai, was approved in February this year to help analyze computed tomography scans for signs of stroke. The IDx-DR device, from IDx LLC, was approved in April. It uses AI software to detect diabetic retinopathy in adults with diabetes and is designed to be used by non–eye care professionals. OsteoDetect, from Imagen, was approved last week, and is used to aid in the detection and diagnosis of wrist fractures in adults.

However, even those tools do not currently have published prospective data to support their efficacy.

When Topol, a cardiologist, executive vice president and professor of molecular medicine at The Scripps Research Institute in La Jolla, California, and editor-in-chief of Medscape reached out to the researchers behind the IDx-DR device, they told him there are data from a 900-patient prospective trial but the results are still in press. “But that is the first prospective trial with AI, which shows you how nascent the field is right now,” Topol said emphatically.

All of the other studies published to date on AI models in medicine are retrospective, including the two used to support the marketing approval for OsteoDetect.

Although some of the retrospective studies are very good, they are insufficient to justify clinical uptake, Topol and others said during the day-long symposium.

The “Unknown Unknowns”

“If someone says they have an AI model that has super-human accuracy and that’s all on test data, don’t believe it yet. High accuracy on a test set is not enough,” said Rich Caruna PhD, senior researcher at Microsoft Research in Redmond, Washington.

To illustrate potential pitfalls, Caruna described AI models he has worked on that were designed to distinguish patients with pneumonia who are at high risk for death from their disease and thus may benefit from more intensive care vs those who are at lower risk and could be safely discharged from an emergency department. An early iteration had very good accuracy on test data. It also indicated that patients with asthma are less likely to die from pneumonia than patients without asthma and thus could be safely discharged home.

Looking at the data with input from clinicians, Caruna said, they could see that asthma was protective in the data set that had been used to train the AI model. But it was protective because of confounders. Patients with asthma tend to get into care earlier because they are more aware of their lung function, and they may be escalated to inpatient or intensive care faster because of their predisposition. But the model doesn’t think through any of this.

“Remember one thing, supervised learning is the ultimate example of garbage in garbage out… If you write junky code, the answer you get out will bear no relevance to the real answer,” warned John Hennessy PhD, past president of Stanford University and co-recipient of the 2017 Turing Award, at the symposium. “These programs can extract information from examples, but they don’t have insights”

In other words, if implemented, the pneumonia model could have put asthma patients at risk because neural networks and machine learning models can’t think about why something is happening; they only detect a pattern in the data on which they are trained. Therefore, if a risk factor like asthma or heart disease can be diagnosed and effectively treated, that risk factor will appear protective to the model because of how clinicians are reacting to it, not because of the underlying biology.

“What is frightening is that the models tend to make the biggest mistakes on the things that we are capable of detecting and treating. If we trust these models too much it might lead to catastrophic outcomes,” Caruna said.

Another major concern about neural networks and AI, Caruna said, is that the models are based on such a large volume of data and are so complex that no one really knows what is driving outcomes, why one patient falls into one group or another according to the model.

In the case of the pneumonia model, the only reason he knew about the problem was that a colleague who had been working on a much simpler — and thus transparent — rule-based prediction model, like those now regularly used in care, saw that asthma appeared protective. Knowing that, Caruna checked his model and saw the same thing.

“I could fix that but what else would possibly be there that I wouldn’t know — and can’t see — and therefore can’t correct? It’s the unknown unknowns” that are really a problem with these super complex models, he said.

Given those unknowns, Caruna says prospective clinical trials are necessary to ensure that AI-based decision models improve outcomes and do not harm patients.

Several other speakers, including Topol, concurred with that view. Topol emphasized that the community needs to see that the tools improve patient outcomes in a prospective way, not just that they are highly accurate on data sets. “I do demand outcomes,” Topol said. “I don’t need all randomized clinical trials, but I have a problem with the in silico retrospective studies.”

New Regulatory Approaches Required

Robert Califf MD, former FDA commissioner, vice chancellor for Health Data Science at Duke Health in Raleigh, North Carolina, and an adviser to Google’s Verily Life Sciences, likens the current situation with computing and AI to the eve of past phases of the industrial revolution, a moment with great promise and potential for great harm.

“As I keep reminding engineers in the Google environment — it is different if your algorithm leads someone to buy the wrong pair of shoes compared to your algorithm leading them to recommend a health care practice or a decision that is adverse to health and leads to dead people,” he said.

Regulators will have an important role to play in protecting patients but what that regulation can or should look like remains to be determined. “I think it’s pretty obvious this can’t be regulated in the old- fashioned way,” he said.

The old-fashioned method was designed for drugs that have a long development phase, including preclinical and prospective clinical trials. “But here we have a very complex, multifactorial technology that is iterative and has a short latency. It is obvious that if every time your AI algorithm was changed you had to stop and have the FDA do an evaluation of whether it was working, you’d have a problem,” he said.

Califf illustrated his point with a stark example: “This leads to a situation where the old way is not fit for the purpose. But I would argue that despite claims to the contrary this is an enormous nuclear weapon with a tremendous potential for negative consequences. Only some of those are now unearthed,” He said.

Looking at the history of the FDA and how it comes to regulate new technology, there is a pattern, Califf said. The agency does its best to come up with an appropriate regulatory strategy but doesn’t always get it quite right, leading in the past to some catastrophic events. Those then trigger overly tight regulations, which eventually get loosened over time to approximately the right level of stringency and approach.

“We All Have a Role”

For now, Califf notes that clinical decision support, which he carefully distinguishes from decision control, is getting a regulatory pass. “I think that is the right thing to do,” he said. “But we’re going to learn that as AI gets better and better, it can do either tremendous good or tremendous harm, and we will have to develop methods to figure ahead of time which is which. Because doing it after the fact won’t be enough.”

Currently, the way the agency goes about it, is a bit like a pre–check line for airport security, such that there is a precertification process to show that it is technically correct. But then there also needs to be postmarket evaluation, which must include clinical outcomes, Califf explained.

The 21st Century Cures Act addresses some of these issues with a “pretty reasonable” approach, he said. If an AI-based model or device helps collect information, such as a device that counts steps, then there is no need for regulation. If the technology is used to diagnose or treat disease — such as the IDx-DR and OsteoDetect — then the FDA will need to approve it. If the algorithm actually is involved in care, such as telling a defibrillator when to fire, then it will be regulated like any other device.

As for who should help guide the field and regulation, Califf reiterated that the FDA cannot and should not do it alone. The agency doesn’t have the manpower or the expertise to consider all aspects of this new technology, including ethics, computing, and clinical care, on its own.

He particularly called on universities, with their diverse interests and talent pool, to get involved and ensure that this technology improves patients’ lives. And, he said, clinicians need to be involved, as do patient advocacy groups and data scientists. “We all have a role to play; this is not something that can’t be left to the FDA,” he said.

Source Medscape Medical News

For more news, join Medscape on Facebook and Twitter

Also see
Babylon claims its chatbot beats GPs at medical exam in BBC News
Feeling poorly? The app will see you now in Reuters
Scripps Translational Science Institute gets $34 million for digital, genomic health care in The San Diego Union-Tribune
Democratizing Health Care Via Smartphone in Aspen Ideas Festival