Multimodal learning is the challenging task of using data from various modalities to improve the capacity of one model.
A powerful approach called Meta Transformer is proposed, showing how one single backbone architecture can handle 12 data modalities using the same set of parameters, with a favourable performance in diverse tasks such as medical applications, stock analysis, weather forecasting, remote sensing, social network, autonomous driving or speech recognition.
Fig. 1 Unified Multimodal Learning [1]
The input is raw data from several modalities that is converted to a unified token space. Afterwards, an encoder with frozen parameters extracts from the token sequence high-level semantic features with which is able to solve different tasks like classification, detection and segmentation.
Fig. 2 Meta-Transformer can be applied to various fields of applications
Bibliography
[1] Zhang, Yiyuan, et al. “Meta-Transformer: A Unified Framework for Multimodal Learning.” arXiv preprint arXiv:2307.10802 (2023).