Multimodal learning is the challenging task of using data from various modalities to improve the capacity of one model.

A powerful approach called Meta Transformer is proposed, showing how one single backbone architecture can handle 12 data modalities using the same set of parameters, with a favourable performance in diverse tasks such as medical applications, stock analysis, weather forecasting, remote sensing, social network, autonomous driving or speech recognition.

Fig. 1  Unified Multimodal Learning [1]

The input is raw data from several modalities that is converted to a unified token space. Afterwards, an encoder with frozen parameters extracts from the token sequence high-level semantic features with which is able to solve different tasks like classification, detection and segmentation.

Fig. 2  Meta-Transformer can be applied to various fields of applications



 [1] Zhang, Yiyuan, et al. “Meta-Transformer: A Unified Framework for Multimodal Learning.” arXiv preprint arXiv:2307.10802 (2023).