Large vision-language model ; Vision-language model ; Large multimodal model ; Large language model ; Artificial intelligence ; Transformer
Abstract
Large multimodal models are typically transformer-based foundational models that can process and generate multiple types of data (modalities), including text, images, audio, and video [1,2]. Large vision-language models (LVLMs) are a subset of large multimodal models that specifically focus on aligning and integrating visual and linguistic systems are trained to perform well-defined narrow tasks and have limited adaptability. By contrast, LVLMs generalize across diverse tasks and support flexible downstream applications without requiring task-specific retraining.