What is multimodal AI?

By Paddy Smith

January 25, 2021

undefined mins

It’s the future of deep learning, but what exactly is multimodal AI, and how is it used...

Multimodal AI isn’t new, but you’ll start hearing the phrase more outside core deep learning development groups. So what is multimodal AI, and why is it being called ‘the future of AI’?

Multimodal AI: the basics

Let’s start with modes. Think of a mode like a human sense. You might see and taste a carrot, for instance. You would be able to identify that you were eating a carrot faster than if you had to eat the carrot blindfolded. You could also identify the carrot if you could see but not taste it. If it was not carrot shaped (eg puree) you might still guess it was carrot from the colour. But if you could eat that puree as well, you could get confirmation from the flavour. That’s multimodal AI in a nutshell. It’s a combination of different inputs, allowing the learning intelligence to infer a more accurate result from multiple inputs.

Multimodal AI: how does it work?

In standard AI, a computer is trained in a specific task. Imaging, say, or language. It’s given a sample of training data, from which it can learn to identify other similar images or words. It’s simpler to train the AI if you are only dealing with one source of information, but the results can be skewed by lack of context or supporting information. In multimodal AI two or more streams of information can be processed, giving the software a better shot at deducing what it’s looking at.

Multimodal AI: what’s the benefit?

Put simply, more accurate results, and less opportunity for machine learning algorithms to accidentally train themselves badly by misinterpreting data inputs. The upshot is a 1+1=3 sort of sum, with greater perceptivity and accuracy allowing for speedier outcomes with a higher value.

Multimodal AI: how does it help businesses?

By recognising context, multimodal AI can give more intelligent insights into business planning. If machinery is being serviced according to predictive maintenance, it’s better if the AI can take the input from various sensors, it might infer that an older piece of equipment does not need servicing as often if the AI is flagging that it works just as well as a newer bit of kit once the temperature stabilises. Or it might understand that a new team is not underperforming when it is engaged in quite heavy training which takes time other teams might throughput as productivity.

Multimodal AI: can it prioritise one input over another?

Yes, and that is crucial to its successful use. Should it look at the carrot or taste the carrot first? Does that change depending on whether the carrot is whole or pureed? Balancing the inputs to be aggregated is the ML skill needed to make the most of multimodal AI.

AI ML