We were delighted to see research into the publicly available skin cancer datasets many use to train AI systems to assess skin cancer published in The Lancet Digital Health last week.
Applying academic rigour to a long known issue around machine learning helps to educate the industry around one of the foundations on which AI systems are built; training data.
Having spent almost a decade now working to apply AI to skin cancer, I thought I would share our perspective on the paper. The discussion was broad and I’ve taken what I see as the main conclusions from the paper to comment on.
- The lack of metadata limits the clinical utility of training data which has implications for the generalisability of AI and such systems should be trained with metadata rich training data
The authors suggest that the lack of metadata associated with the skin lesion images reduces the value of the data for training. The metadata they refer to is the additional information a clinician might use to help make a diagnosis. This would be things like age, UV exposure, family history of skin cancer and a range of other risk factors.
A key challenge with this assertion is that it appears to be based on two cited papers in which the effect is not very strong.
One of the papers actually showed very little impact for including the metadata. A key metric, Area Under the Curve (AUC) for melanoma increases from 0.784 to 0.794 with the inclusion of the meta data for clinical images and that increase is within the error margins. Further for dermoscopic images AUC increases only from 0.830 to 0.832. The table is below and links to the paper if it is of interest.
The second paper only deals with smartphone images and not dermoscopic ones and reports AUC 0.929 to 0.948. When the error margins are taken into account and considering we’re not discussing dermoscopic images which contain more information than clinical ones, the effect again isn’t very strong.
While it feels like metadata offers an opportunity to improve AI systems, something the authors acknowledge is appealing as it closely reflect clinical practice, at this stage there isn’t sufficient evidence to suggest that is true. Certainly I’d argue that there isn’t enough evidence to drive policy in this area.
More research is required to adequately be confident that adding complex and different data modalities into AI systems increases performance. There is a risk of overfitting to this data as well which leads to a lower generalisability.
- Few AI systems have been validated with externally sourced datasets meaning AI systems may be generally overestimated
This is something that we wholeheartedly agree with. No AI system should be validated with the currently available public datasets. That’s simply because these datasets aren’t representative of the general population and have been created for many different reasons, with few of them being to assess an AI system.
Prospectively gathered data is essential. At Skin Analytics we’re very proud of having run the first ever powered prospective clinical study to evaluate an AI systems ability to identify melanoma. We have four further prospective studies underway across the UK, Australia and the US.
Even with that level of data, deploying AI into healthcare systems should be done with appropriate risk controls and post market surveillance to ensure that the AI is safe and effective. This is why products like ours are considered to be medical devices and have to comply with regulatory laws and answer to regulatory bodies such as the MHRA.
AI is solving a large problem that is not just going to go away and it’s wholly appropriate to priortise safety when patients’ lives are impacted.
But we can’t forget that risk needs to be weighted against the benefits. Skin cancer rates are doubling the UK every 10-15 years and about 30% of dermatology posts in the NHS are unfilled. Of those positions that are filled, roughly a third are filled by locums who may not have the same level of training.
Healthcare systems are not equipped to deal with the volume of patients they need to see, especially in the light of the backlog created by Covid.
At Skin Analytics, we strongly advocate the need for prospective, controlled, clinical trial data as the base level of evidence required to consider an AI deployment in healthcare. But there is a balance to be struck with risk and benefit when it comes to the evidence is needed.
Instead of answering every possible research question, we would support a detailed evaluation of risk and whether appropriate mitigations can be designed to protect patients. That way we can ensure the conversation is also considering the benefits that can alleviate the challenges currently faced by dermatology departments.
- AI systems may be dangerous if used widely if the datasets they were trained on aren’t shared in detail, with metadata included
Putting aside the metadata point which we discussed above, we would agree that a level of discussion around the datasets used for training is an important diligence step in assessing an AI solution.
However, a more important step is assessing the data the AI was validated on. If explainable AI is the holy grail, then detailed understanding of the datasets involved would likely be important. However, there is research that suggests that focusing on explainable AI involves significant risks and that the focus should instead be on appropriate clinical validation.
While we align to this view, we believe that the evidence requirements need to be driven by the risks and benefits associated with the use of an AI system. Explaining AI can create a false sense of understanding in what is a complex system and for now we advocate a more statistical approach to evaluating and monitoring performance rather than auditing training data. Ultimately patients are affected by the outcomes rather than the training data.
- The geographic distribution of training data was not equal and Fitzpatrick skin type was poorly represented in higher Fitzpatrick skin types
This is indisputably true and tied to a broader social conversation about how our institutions have been created and how that impacts social equity. At Skin Analytics we strongly believe that every person should have access to high quality healthcare. Period.
However this particular conversation is more complicated than it first appears. The authors note that the majority (79%) of publicly available skin cancer data comes from Europe, Oceania and North America exclusively.
The implication is that there is a lack of interest or effort behind capturing data from the rest of the world. It’s true that healthcare inequality has meant that digital health advances have happened faster in wealthy countries and that means there is more digital data to share for training algorithms.
But that’s not the whole story. The WHO’s Global Cancer Registry reports that 85% of the worldwide melanoma each year, occur in patients in those three regions that supplied the 79% of available training data.
That’s not to say that we don’t have to do more to ensure that we get better datasets for training. I share it only to highlight that the conversation is more complicated than it first appears.
Regarding Fitzpatrick skin types, the challenge is also complicated. As the researchers point out, only 2,436 (~2.3%) images in the training sets had Fitzpatrick skin type data and within that was a pitiful amount of data for darker skin types.
Let’s move past Fitzpatrick skin type for a moment (it was developed in the 1970s to classify white skin and there is plenty of research to show it may be inappropriate for classifying darker skin types). Instead if we focus on race/ethnicity, in the US between 2007 and 2011, 1.02% of invasive melanoma were found in Black, American Indian, Asian or Pacific Island patients (0.53%, 0.21% and 0.28% respectively).
So while we don’t know the race or ethnicity for 97.7% of training data found in this report, you can be sure that darker skin tones make up a small number, just because the incidence of skin cancers is significantly lower in these patients.
Unfortunately, and this is the real problem, we can’t say the same for melanoma mortality. While Black, American Indian, Asian and Pacific Island Patients make up only 1.02% of the invasive melanoma found in the US over that time period, they represent 2.22% of the melanoma deaths.
I’ll say it again, at Skin Analytics we believe that every person should have access to high quality healthcare. That means that we have to work together to solve these issues of health inequality.
It is right for the researchers to hold a light to the fact that as an industry we’re not yet adequately training our AI systems. But we can’t forget that we’re not adequately training our clinicians either which was highlighted by the impressive work by Malone Mukwende in ‘Mind the Gap’.
The lack of this data for training across medicine is an issue we must solve and to do so, we will have to work together. We’re always looking for ways to solve this issue in dermatology, so please do reach out if you have a good idea!