Adapting Multimodal LLMs to Data Scarce Applications

Start year: 2024

Summary: Relying on abundant text, image, and paired data, large-scale Vision-Language models (such as CLIP, Stable Diffusion, and GPT4) have shown success in building large-scale foundation models for real-world applications. Existing multimodal learning mainly focus on a restricted set of modalities, such as vision, language, and audio, with easily accessible high-quality training data. However, in the areas of many real-world applications, especially in interdisciplinary/scientific research fields, the limited data cannot support learning a large-scale multimodal model from scratch. For example, it is hard to scale up the paired data for analysing the relationship between climate change (and extreme weather) and human activities; collecting large experimental datasets for studying material characteristics is also challenging. To accelerate the generalization of multimodal learning to understudied domains, we consider whether the knowledge in pre-trained large-scale multimodal foundation models, e.g., CLIP and GPT4, (and the successful experience of developing these models) can be adapted to more diverse and under-represented modalities in the wild (such as various tabular and sensor data, cross-media data). This project aims to study adaptive multimodal learning for adapting and transferring the learned knowledge in a large-scale foundation model to new tasks with limited and non-ideal data.