Modality: The Multi-Dimensional Language of Computer Vision

Through the processing of data from multiple modalities, AI systems will provide more valuable interactions and higher-quality information.

The meaning of modality is defined as “a particular mode in which something exists or is experienced or expressed.” In artificial intelligence, we use this term to talk about the type(s) of input and output data an AI system can interpret. In human terms, modality’s meaning refers to the senses of touch, taste, smell, sight, and hearing. However, AI systems can integrate with a variety of sensors and output mechanisms to interact through an additional array of data types.

pattern recognition background
Pattern recognition and machine learning performed with a variety of cameras and sensors enables systems to identify and interpret meaningful patterns within data to perform specific tasks or solve defined problems.

Understanding Modality

Each type offers unique insights that enhance the AI’s ability to understand and interact with its environments.

Types of Modalities:
  • Visual: Models such as Convolutional Neural Networks (CNNs) enable the processing of visual data for tasks like image recognition and video analysis. For instance, Google’s DeepMind leverages computer vision technologies for accurate predictions of protein structures.
  • Sound: This refers to the ability to process auditory data. Typically, AI systems use models like Recurrent Neural Networks (RNNs) to interpret sound waves. The most common applications today are for voice recognition and ambient sound detection. For example, voice assistants (e.g., Siri, Alexa) use auditory modalities to process user commands.
  • Textual: These modalities have to do with understanding and generating human texts. These systems often leverage large language models (LLM) and natural language processing (NLP) as well as Transformer-based architectures. Chatbots, translation tools, and generative AIs, like ChatGPT, rely on these word modalities.
  • Tactile: This relates to touch-based sensory modalities for haptic technologies. A poignant example today is robots that can perform delicate tasks, such as handling fragile items.

Initially, AI systems were focused heavily on singular modalities. Early models, like perceptrons laid the groundwork for visual modality in the 1950s, for example. NLP was another major breakthrough for a variety of modalities in AI systems. While its obvious application is in human-readable text, it also led to computer vision models, such as LeNet, for handwriting recognition. NLPs still underpin the interactions between humans and most generative AI tools.

Convolutional Neural Networks Concept
Concept of Convolutional Neural Networks (CNN) modality

The introduction of RNNs and CNNs in the late 20th century was a watershed moment for auditory and visual modalities. Another leap forward occurred with the unveiling of Transformer architectures, like GPT and BERT, in 2017. These particularly enhanced the ability to understand and generate language.

Today, the focus is shifting toward multi-modal AI systems that can interact with the world in multifaceted ways.

Multi-Modal Systems in AI

Multi-modal AI is the natural evolution of systems that can interpret and interact with the world. These systems combine multimodal data, such as text, images, sound, and video, to form more sophisticated models of the environment. In turn, this allows for more nuanced interpretations of, and responses to, the outside world.

While incorporating individual modalities may help AIs excel in particular tasks, a multi-model approach greatly expands the horizon of capabilities.

Breakthrough Models and Technologies

Meta AI is one of the entities at the forefront of multi-modal AI research. It’s in the process of developing models that can understand and generate content across different modalities. One of the team’s breakthroughs is the Omnivore model, which recognizes images, videos, and 3D data using the same parameters.

The team also developed its FLAVA project to provide a foundational model for multimodal tasks. It can perform over 35 tasks, from image and text recognition to joint text-image tasks. For example, in a single prompt, FLAVA can describe an image, explain its meaning, and answer specific questions. It also has impressive zero-shot capabilities to classify and retrieve text and image content.

Diagram of the FLAVA framework showing the processing of an input image and text through separate encoders and their convergence in a multimodal encoder, which feeds into task-specific heads for different AI applications.
An overview of FLAVA’s architecture. It shows the fusion of image and text inputs for comprehensive multi-task processing. (Source)

Data2vec, another Meta initiative, proves that “exact same model architecture and self-supervised training procedure can be used to develop state-of-the-art models for recognition of images, speech, and text.” In simple terms, it supports the fact that implementing multiple modalities does not necessitate extreme developmental overhead.

A diagram illustrating a method for speech processing in Data2vec. It shows two parallel processes: 'Model in student-mode' receives an audio input and outputs a simplified model representation. Below, 'Model in teacher-mode' receives the same audio input and outputs a detailed model representation. During training, the goal is to predict the teacher model's representation of the original input, with a dotted line indicating that the teacher model tracks and guides the student model.
Schematic representation of a dual-mode AI training approach. A ‘student’ model learns to emulate a ‘teacher’ model’s complex speech-processing tasks. This exact same model can process text, auditory, and visual modalities. (Source)

Google has also contributed significantly to the field with models like Pix2Seq. This model takes a unique approach by solving seemingly unimodal tasks using a multi-modal architecture. For example, it treats object detection as a language modeling task by tokenizing visual inputs. MaxViT, a vision transformer, ensures that local and non-local information is combined efficiently.

A flow diagram illustrating the Pix2Seq framework for object detection. Three different images with various objects are inputted into the Pix2Seq model, which processes the command 'detect objects'. The model outputs sequences of tokens for each image. These tokens represent the coordinates for bounding boxes and the class labels of detected objects, such as 'train', 'motorcycle', and 'person'.
Pix2Seq model workflow: Converting visual inputs into sequenced data tokens for object detection. The model translates images into annotated textual information for various object classes. (Source)

On the technology front, NVIDIA has been instrumental in pushing multi-modal AI innovation. The NVIDIA L40S GPU is a universal data center GPU designed to accelerate AI workloads. This includes various modalities, including Large Language Model (LLM) inference, training, graphics, and video applications. It may still prove pivotal in developing the next generation of AI for audio, speech, 2D, video, and 3D.

Powered by NVIDIA L40S GPUs, the ThinkSystem SR675 V3 represents hardware capable of sophisticated multi-modal AI. For example, the creation of digital twins and immersive metaverse simulations.

Real-Life Applications

The applications of multi-modal AI systems are vast, and we’re only at the beginning. For example, autonomous vehicles require a combination of visual, auditory, and textual modalities to respond to human commands and navigate. In healthcare, multi-modal diagnostics incorporate imaging, reports, and patient data to provide more precise diagnoses. Multi-modal AI assistants can understand and respond to different inputs like voice commands and visual cues.

Multimodal AI for healthcare
Multimodal AI application in healthcare. (Source)

And, at the very forefront, we are seeing advanced new robotics systems using muti-modal capabilities. In a recent demo, Figure 01 demonstrated the ability to combine human language inputs with a visual interpretation. This allowed it to perform typical human tasks in a kitchen, based on verbal instructions. We are seeing similar developments with other competitors, such as Tesla’s Optimus.

robot human interaction and multimodal AI
The intersection of robot-human interaction facilitated with multimodal AI. (Source)

Technological Frameworks and Models Supporting Multi-Modal AI

The success of multi-modal systems necessitates the integration of various complex neural network architectures. Most use cases for multi-modal AIs require an in-depth understanding of both the content and context of the data it’s fed. To complicate matters further, they must be able to efficiently process modalities from multiple sources simultaneously.

This raises the question of how to best integrate disparate data types while balancing the need to enhance relevance and minimize noise. Even training AI systems on multiple modalities at the same time can lead to issues like co-learning. The impact of this can range from simple interference to catastrophic forgetting.

However, thanks to the field’s rapid evolution, advanced frameworks and models that address these shortcomings emerge all the time. Some are designed specifically to help harmoniously synthesize the information from different data types. PyTorch’s TorchMultimodal library is one example such example. It provides researchers and developers with the building blocks and end-to-end examples for state-of-the-art multi-modal models.

Notable models include BERT, which offers a deep understanding of textual content, and CNNs for image recognition. Torch multimodal allows the combination of these powerful unimodal models into a multi-modal system.

This has also led to revolutionary breakthroughs. For example, the development of CLIP has changed the way computer vision systems learn textual and AI representations. OR, Multimodal GPT, which extends OpenAI’s GPT architecture to handle multi-modal generation.

A diagram illustrating the workings of the CLIP model. To the left, it shows how CLIP pre-trains image and text encoders to predict which texts match which images. To the right, it illustrates how the model is turned into a zero-shot classifier. It also converts the dataset's classifiers into image captions.
CLIP pre-trains an image encoder and a text encoder to predict which images were paired with which texts in a dataset, turning CLIP into a zero-shot classifier. All of a dataset’s classes are then converted into captions such as “a photo of a dog.” CLIP then predicts the class of the caption it estimates best pairs with a given image. (Source)

⁤Challenges to Developing Multi-Modal AI ⁤Systems

There are several challenges when it comes to integrating different data types into a single AI model:

  • Representation: This is the challenge of encoding different data types in a way that makes it possible to process them uniformly. Joint representations combine data into a common “space”, while coordinated representations keep them separated but structurally linked. It’s difficult to integrate different modalities due to variances in noise, missing data, structure, and formats.
  • Translation: Some applications may require wholly converting data from one type to another. The exact process can differ based on the modality of both data types and the application. Often, the translated data still requires additional evaluation by either a human or using metrics like BLUE and ROUGE.
  • Alignment: In many use cases, modalities also need to be synchronized. For example, audio and visual inputs may need to be aligned according to specific timestamps or visual/auditory queues. More disparate data types may not naturally align due to inherent structural differences.
  • Fusion: Once you’ve solved representation, you still need to merge the modalities to perform complex tasks, like making decisions or predictions. This is often challenging due to their different rates of generalization and varying noise levels. ⁤
  • Co-learning: As touched on earlier, poor co-learning can negatively impact the training of both modalities. However, when done right, it can improve the ability to transfer knowledge between them for mutual benefit. It’s largely challenging for the same reasons as representation and fusion. ⁤

Finding solutions to these challenges is a continuous area of development. Some of the model-agnostic approaches, like those developed by Meta, offer the most promising path forward.

Furthermore, deep learning models showcase the ability to automatically learn representations from large multi-modal data sets. This has the potential to further improve accuracy and efficiency, especially where the data is highly diverse. The addition of neural networks also helps solve challenges related to the complexity and dimensionality of multi-modal data.

Impact of Modality on AI and Computer Vision

Advancements in multi-modal predict a future where AI and computer vision seamlessly integrate into our daily lives. As they mature, they will become increasingly important components of advanced AR and VR, robotics, and IoT.

AI vision inspection
Robotics applied in manufacturing applications to automate physical tasks

In robotics, AR shows promise in offering methods to simplify programming and improve control. In particular, Augmented Reality Visualization Systems improve complex decision-making by combining real-world physicality with AR’s immersive capabilities. Combining vision, eye tracking, haptics, and sound makes interaction more immersive.

For example, ABB Robotics uses it in its AR systems to overlay modeled solutions into real-life environments. Amongst other things, it allows users to create advanced simulations in its RobotStudio software before deploying solutions. PTC Reality Lab’s Kinetic AR project is researching using multi-modal models for robotic motion planning and programming.

In IoT, Multimodal Interaction Systems (MIS) merge real-world contexts with immersive AR content. This opens up new avenues for user interaction. Advancements in networking and computational power allow for real-time, natural, and user-friendly interfaces.

Explore More Usescases