In a recent article from L’Atelier BNP Paribas, Alex Hadwick breaks down the quiet breakthrough of vision AI, and what the future looks like through its lenses.
Content and image credit – L’Atelier BNP Paribas: AI’s biggest leap is hiding in plain sight … and it’s not LLMs by Alex Hadwick
The ability to use Google Lens to search images, or log in via facial recognition, has become another part of the technology landscape. Smartphones can imitate how our eyes and brains work, how we reconstitute data from over 100 million photoreceptive cells in millions of possible colours across an amazingly broad spectrum—a process millions of years in the making.
It’s worth pausing a moment to absorb this breathtaking change; the advances almost all took place in the last decade, through a machine learning revolution. Now, artificial intelligence’s capacity to replicate our visual system enables machines to do more than ever.
From classifying cancers to acting like eyes for self-driving trucks up to 30 tonnes, the power of AI image analysis is an achievement worth opening eyes to.
A decade in the making
When Andreas Wendel, Chief Technology Officer for autonomous vehicle firm Kodiak Robotics stepped into this world around 2008, deep learning “was very different,” he remembers. “It was much more structured,” requiring large conceptual frameworks and manual data processing, analysis, and modelling.
Neil Daly, CEO & founder of medical image analysis company Skin Analytics, recalls that when they started in 2012, “Deep learning wasn’t a thing, so we were doing classical machine learning” to find and categorise skin cancers, a goal that remains the company’s mission today.
That process involved a lot of hard grafting for Skin Analytics. They held repeated meetings with dermatologists and attempted “to describe what [they told] me mathematically,” says Daly. “Is it the unevenness of the border? Is it the number of colours in the lesion? And the answer is always, yes, it’s all those things.” Each time, dermatologists couldn’t quite put their fingers on the precise reason behind their diagnoses; they just knew.
“I can’t codify that,” states Daly. “In the end we had something like 200 different measurements on these lesions mathematically representing them, and then we had a support vector machine that was optimising the relative weights.”
That worked in testing. But when a team member’s impromptu trial at a family Christmas falsely suggested all his relatives had melanoma, they “realised just how complex this field was,” Daly recounts. “We thought we had this really great performance. But then if you got any real world data, it didn’t actually work.”
In 2015, “convolutional neural networks really started,” recounts Wendel, opening a new set of possibilities.
From theoretical to real
The advent of multi-layered neural networks was key to consistent visual analysis.
Neural networks are learning programs, inspired by research into how human vision works. A network stacks layers of computer neurons, imitating brain pathways of biological neurons. Each neuron takes an input and assigns a mathematical output that gets passed to another layer. By combining more layers, you can increase computing power and process more inputs.
For images, the network assigns a value to a characteristic in that image. A simple black-and-white image of a few hundred pixels that shows a shape—something relatively easily categorised and assigned a value—might only take a few layers to process due to limited characteristics and consistent rules.
But the real world is more complex, making advances in “machines’ compute power a really big unlock,” remarks Wendel, allowing contextual differentiation of complex elements at speed.
Consider how a self-driving truck needs to see the world and navigate it safely: “First, you need to perceive the world around you,” says Wendel. This requires a variety of sensors and sensor fusion, including visual cameras and lasers around the vehicle, categorising input in milliseconds. Then “you plan a path and say this is the action that my vehicle should take at this moment … and that basically creates a reaction to which the world will react again.”
That dynamic environment requires more layers of decision-making. “We want to be predictive of what happens around us,” not just reactive, he explains. “A truck cannot brake as hard as a passenger vehicle can, so we need to give all the other participants the benefit of the doubt, but we also need to say ‘this is a pattern I have seen—when someone does this, they’re actually about to cut us off’.”
The complete system is therefore juggling a lot of different decisions and sensory input at any given time while anticipating environmental change, which is why “machine learning plays a big role in the perception part. It’s more about image processing and images in the wider sense,” he says.
“The minute that we introduced deep learning,” and relied on the “real power of these computer vision deep networks,” there were major gains, agrees Daly. “I think the first time we used a deep learning network, we got a 10 percent improvement on anything that we’d done previously, and that was with no optimisation.”
“Now, you just throw more and more data” into modern machine learning and “it has really gotten to a very data-driven point,” continues Wendel. “You can solve many, many different problems with it.” This is especially so as “computers are extremely good at doing very repetitive tasks,” and “visual analysis can be automated very well given a lot of data.”
While volume is typically king, Daly outlines another approach: “We decided early on we could either get a huge amount of data that had labels given by dermatologists only, or we could only work with data that had histopathology outcomes, which is where you do a tissue sample.”
Their decision shrank the available pool of data “by a factor of probably 10.” Daly recalls, “When we said we were going to just work these tiny data sets, everyone thought we were crazy.” But a definitive outcome yielded greater precision and better training data, meaning they could look to match or exceed human accuracy in the long-term.
Despite different approaches, both agree that moving from theoretical and second-hand data to capturing the complexity of the real world has been critical, bringing them closer to realising the visions of their respective companies.
Bringing vision AI to life
For Daly and Skin Analytics, the aim is to cut mortality rates in skin cancer, a gap largely caused by limited access to diagnostics and failure to catch malignant cancers early.
He notes that while nearly 98 percent survive following early diagnosis, “the actual survival rate for skin cancer around the world is in the mid-to-high 80s,” leaving a preventable 10 percent gap largely caused by “the access challenge.”
“We know that if you can increase access to dermatologists, then the utilisation of those dermatology services goes up dramatically,” he observes. But the pipeline to provide and access those services is constrained.
“You have a very small number of specialists because it costs a lot of money, it takes a lot of time, and these people have to be experts … Then the healthcare system gatekeeps getting to those specialists.”
Vision AI offers hope. “What’s so exciting about computer vision is that you can unblock that. You can suddenly make that specialist resource infinite,” which matters in a field where speed of diagnosis matters, but training a specialist takes 15 years.
Skin Analytics is not alone in believing computer vision can play a critical role in the medical field while easing diagnostic burdens. A South Korean study achieved remarkable results using deep learning to diagnose Autism Spectrum Disorder (ASD) by studying optical scans of retinas. While there was only limited evidence of a link between retinal thickness and ASD, the model achieved 100 percent accuracy diagnosing autism in the test set.
Researchers caution that more research is needed, but the application is clear. The number of patients waiting over 13 weeks for an autism assessment in the UK rose 545 percent between April 2019 and December 2023.
Kodiak and their self-driving vehicles represent another major growth area—industrial applications. The combination of computer “eyes,” a machine learning-driven brain, and robotic actuators is a huge field on the verge of exploding, given widespread labour constraints.
Wendel says that over the course of “driving two and a half million miles,” their trucks have established clear routes, including a daily load from Dallas to Atlanta—a roughly 1,500-mile round trip.
While this is largely on highways, and human drivers often take over for the last mile of distribution, Wendel says the sections they focus on, known as the “middle mile” in logistics, are key stretches for autonomy. He describes them as “the long, often dull, dangerous kind of stretches of the road that less and less people want to do … there’s a really big driver shortage.” They hope to deploy completely driverless vehicles in the next 12 months.
This labour constraint will drive many key technologies powered by vision AI. Hard, often poorly paid manual jobs are where the technology will be most impactful.
Take warehouse work, where workers already perform “vision picking”: An alternate reality layer, seen through smart glasses, finds and logs an item that needs to be taken from shelves. The next step is the emerging field of picking robots, which uses computer vision to both scan a product and understand its dimensions. Boston Dynamics, the now-famous creator of eerily human robots, originally funded by US military grants from DARPA, has pivoted into this area.
In agriculture, there is a chronic lack of seasonal labour in many key food-producing regions and an emerging crisis in soil quality. Training robots to spot weeds means they can remove them through mechanical weeding, use herbicides more judiciously, or even burn weeds with lasers.
Trusting the technology
While there is plenty of promise in the advances underpinned by computer vision, getting these systems embedded requires trust from users, the legal system and the wider public.
“There is an actual need for this application,” says Wendel. “There is a job shortage … so now it’s really on us to show that we can safely take the driver out” of the picture.
But for AI “there’s this natural reticence to be ready to trust a system that might introduce more risk,” Daly notes. “The barrier of evidence to prove … that it works is very high.” Even with completed trials and deployments for their firm, “I don’t think we’re there yet.”
Yet he sees “a real change in perception that’s been led by these large language models and all the press that they’re getting in this last year. It’s really changing people’s perception about what AI can and can’t do, and the clinicians seem to be responding to that and the weight of evidence that we have.”
Daly also feels it is important to define what AI can’t do, especially in a field as “messy” and “grey” as medicine. “You might have a patient who is 93 and has a basal cell carcinoma. You could cut that basal cell carcinoma out, but at 93, is that going to improve their quality of life or reduce their quality of life? Those decisions can’t be made by a machine.”
A picture’s worth a thousand words. Can it capture a feeling?
Wariness around these technologies may create impediments to use, but the systems are not panaceas, says Daly. They possess clear limitations.
An example lies in tech, which promised the HR world they could simplify and improve hiring by reading facial expressions. This fell apart when AIs met actual human faces. The poster child for this is HireVue, which heralded its advances in 2019, only to drop the software and subsequently face a series of class action lawsuits.
While HireVue saw scrutiny in courts, there is often no way to analyse private company claims of AI efficacy, so objective measures are few and far-between. In AI-assisted hiring, variable video feed quality, cultural differences, racial issues and neurodiversity led to modest scientific consensus about their inefficacy. More broadly, the fact that the share of companies reporting talent shortages has skyrocketed in the last three years brings the whole field into question.
Even when the tech works as intended, results can be concerning without suitable regulations or government constraints. China has long been a centre for facial recognition technology, with companies given wide remit to develop the technology … but even Beijing conceded the need for limits, drafting tighter regulations in 2023.
Then there is AI’s ability to disrupt the visual sphere with image and video generation.
Open source investigative extraordinaire Bellingcat has logged far-right racists creating imagery with DALL-E and pornographers generating non-consensual imagery of real people, noting: “A recent report by independent research group Graphika said that there was a 2,408 percent increase in referral links to nonconsensual pornographic deepfake sites across Reddit and X in 2023.”
AI image and video creation is now endemic in 2024, a pivotal year for global democracy as national and supranational elections could impact up to four billion people. The Center for Countering Digital Hate (CCDH) asked the largest AI image generators to create images related to the 2024 US presidential election. 41 percent of the prompts “generated images constituting election disinformation.”
Image is everything
Given these issues, it seems likely that the biggest positive impacts will be felt in areas where human eyeballs, brains and hands are in sharp demand and their efficacy can be clearly measured.
Wendel says that while AI image generation is “incredible technology,” the real question is, “What are people going to be willing to pay for it?” The process is energy intensive and therefore expensive. A Carnegie Mellon University study found that generating 1,000 images takes the same energy as a 4.1-mile journey in an average petrol-powered car.
So while headline-generating, the most critical places where machines interact with the visual environment are likely to be areas with clear limitations to how much human labour can currently achieve, but where computer vision can filter and analyse before it reaches a person; or in industrial applications that use AI vision to categorise a specific environment.
There are risks to deploying these technologies in areas like medical imaging, and humans naturally insert cognitive bias into the mix. Repeated studies reflect differences in how clinicians analyse patients based on sex or ethnicity, resulting in poorer outcomes for women and minorities. A Northwestern University study on deep learning and skin conditions found specialists perform better working with AI overall, but primary care professionals were less accurate for darker skin colours, leading researchers to conclude, “it’s not the AI that is biased, it’s how physicians use it.”
The keys to mastering computer vision lie in keeping use cases specific and solving problems where humans lack bandwidth—focusing on objects that can fully be categorised by AI, and rigorously testing the systems in monitored scenarios. If we can build and deploy responsibly—and that’s a big if—we may be freed from some of our most severe constraints.
This article from L’Atelier BNP Paribas is part of “Overlooked: Technologies quietly changing the world,” a series of pieces that focus on advances making tangible everyday differences, but that also promise transformative change.