Home About us

2024 Best Practices for SOTA Multimodal Large Model Architecture Design

Deep Learning and NLP 2024/07/04 08:36

Author: Dreamweaver, SJTU × AIGC/LLM, Tencent · Multimodal Application Research (Internship)
Statement: This article is only for sharing, the copyright belongs to the original author, and the infringing private message is deleted!

We will find that most of the latest popular MLLM architectures adopt the LLaVA-like ViT+MLP+LLM paradigm. Thanks to its streamlined design, efficient data and training, and stronger baseline performance, the LLaVA architecture has established a good application ecosystem. High-quality MLLM has also emerged in China, InternVL has narrowed the gap between the open source model and GPT-4V, with 4K high-resolution processing capabilities, and MiniCPM-V has achieved efficient end-to-end deployment, so that small models can compete with top closed-source models. The latest Cambrian-1 encourages researchers to think outside the box and explore more possibilities for visual representation. There are multiple paths to AGI, and the native multimodal large model is the only way.

This article focuses on the LLaVA-NeXT, InternVL, MiniCPM-V series, and the vision-centric Cambrian-1, with a brief introduction to VILA1.5 and CogVLM2.

LLaVA-NeXT series


In October 23, LLaVA-1.5 was released, which achieved high efficiency of training samples by adding a simple MLP layer between visual and language modalities, providing the possibility for the implementation of multi-modal large models in low-data business scenarios.

[2310.03744] Improved Baselines with Visual Instruction Tuning[1]

Deep Learning and NLP, 2024 Best Practices for SOTA Multimodal Large Model Architecture Design


In January 24, LLaVA-NeXT (1.6) was released, which maintained a streamlined design and data efficiency on the basis of 1.5, and supported higher resolution, stronger visual reasoning and OCR capabilities, and visual dialogue in a wider range of scenes. The model is trained in two stages: Phase 1 pre-training only trains the connection layer, and Phase 2 fine-tuns the instruction to train the entire model.

LLaVA-NeXT: Improved reasoning, OCR, and world knowledge[2]

In May 24, the team released a version of LLaVA-NeXT based on a stronger LLM, supporting LLaMA3 (8B) and Qwen 1.5 (72B/110B). Larger LLMs provide better visual world knowledge and logical reasoning capabilities, and the largest model is close to the performance of GPT-4V, while ensuring training efficiency.

LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild[3]


In April '24, LLaVA-NeXT-Video was released, demonstrating strong zero-shot video comprehension. The high-resolution image dynamic partitioning in LLaVA-NeXT can be naturally migrated to the video modality to represent multiple frames of the video, so that the LLaVA-NeXT trained only on the image-text modality can be generalized on the video task. In addition, length generalization at the time of inference is used to efficiently handle long video inputs that exceed the maximum length of the LLM. Based on the LLaVA-NeXT-Image model, the authors released LLaVA-NeXT-Video, which supervises fine-tuning on video data, and LLaVA-NeXT-Video-DPO, which uses DPO preference alignment under the supervision of AI feedback. Deploy and inference with SGLang to support scalable, large-scale video inference. It is conceivable that this will facilitate the efficient text annotation of massive videos, giving birth to more powerful video generation models in the future.

LLaVA-NeXT: A Strong Zero-shot Video Understanding Model[4]

Command fine-tuning Ablation Study

The team also shared an ablation study of factors other than data in the process of fine-tuning visual instructions, and analyzed them from the perspectives of model architecture, visual representation, and training strategy.

LLaVA-NeXT: What Else Influences Visual Instruction Tuning Beyond Data? [5]


In June 24, LLaVA-NeXT-Interleave was released, proposing that the interleaved format can be used as a general template to unify different visual modalities, such as single image (multi-patch), multi-image (multi-image), video (multi-frame), 3D (multi-view). Under the condition of ensuring the performance of LLaVA-NeXT single image input, the performance of other modal tasks can be improved, and it has preliminary migration ability on different modal tasks. This unified model supports a wider range of real-world applications, such as summarizing and answering questions in multi-page PPTs, generating prompts for image editing, and summarizing and comparing multiple documents.

LLaVA-NeXT: Tackling Multi-image, Video, and 3D in Large Multimodal Models[6]

The authors conducted an ablation study on the training strategy:

InternVL series


In December 23, Shanghai AI Lab @OpenGVLab released InternVL. In this work, there is a large gap in the parameter scale and feature characterization ability between the visual encoder and the LLM in the modal alignment, so it is naturally proposed to expand the parameter quantity of the visual end to 6B (InternViT-6B), and then use different quality graphic data to gradually align with the LLM. In addition, the number of parameters of the connection layer has also been expanded, similar to Q-Former, where an 8B language middleware QLLaMA is designed, using Chinese-LLaMA parameter initialization to enhance its cross-language understanding ability, adding 96 learnable query tokens and cross-attention layers (1B) to achieve further alignment of visual and language modalities.

[2312.14238] InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks[7]

The following figure shows the three-stage progressive training strategy of InternVL, the quality of the training data is gradually improved, at the beginning, a large-scale noisy graphic-text pair is used for comparative pre-training (similar to CLIP), then a QLLaMA connector with frozen parameters is added, only cross-attention is learned, image-text matching/comparison/loss generation (similar to BLIP) is used, and finally LLM is introduced for supervised fine-tuning, giving multimodal dialogue and Q&A capabilities.

Deep Learning and NLP, 2024 Best Practices for SOTA Multimodal Large Model Architecture Design

The multi-stage nature of InternVL training gives it inherent versatility, and it can support a variety of visual-linguistic tasks by flexibly combining different modules, as shown in the figure below.

Deep Learning and NLP, 2024 Best Practices for SOTA Multimodal Large Model Architecture Design

One point worth discussing here is that InternVL scales up both the visual side and the connection layer in order to balance the number of parameters on the visual side and the language side. A natural question is, does the visual side really need such a heavy amount of parameters? Because the latest LLaVA-NeXT still uses about 300M ViT and a lightweight MLP connection layer, it only improves the performance of multimodal tasks by extending the LLM. My personal humble opinion is that visual comprehension includes perception and reasoning, and the perception part may not require such a large number of parameters, while the reasoning part acts on the high-level visual features, and the ability to understand and reason the visual mode is given by fine-tuning the LLM, so for the balance of performance, efficiency and stability, it seems that the necessity of scale up here is not very strong, of course, it is worth in-depth experimental verification and discussion here. Seeing the figure in this paper, it reminds me of Google's Coca paper in 22, where the author divided the text decoder in half by layer, the shallow half is used for text unimodality, and the deep half is used for image and text multimodality.

[2205.01917] CoCa: Contrastive Captioners are Image-Text Foundation Models[8]

Deep Learning and NLP, 2024 Best Practices for SOTA Multimodal Large Model Architecture Design


In April 24, InternVL-1.5 was released, with stronger overall performance and support for up to 4K resolution for inference.

[2404.16821] How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites[9]

Deep Learning and NLP, 2024 Best Practices for SOTA Multimodal Large Model Architecture Design

The above figure shows the overall architecture of the model, using the LLaVA-like ViT+MLP+LLM paradigm, combined with the enhanced InternViT-6B-448px-V1.5 and the bilingual InternLM2-Chat-20B, with an overall parameter of about 26B. Compared with InternVL-1.0, dynamic high resolution is supported on the input, the connection layer is changed to lightweight MLP, and the pixel shuffle operation is used to reduce the number of output visual tokens to 1/4. The training is divided into two phases, the pre-training phase trains InternViT and MLP mappings, and then fine-tunes the model.

In addition, the translated prompt is worth learning:

You are a translator proficient inEnglishand{language}.Your task is to translate the following English text into{language}, focusing on a natural and fluent result that avoids “translationese.”Please consider these points:
1.Keep proper nouns, brands,and geographical names inEnglish.
2.Retain technical terms or jargon inEnglish, but feel free to explain in{language}if necessary.
3.Use{language} idiomatic expressions forEnglish idioms or proverbs to ensure cultural relevance.
4.Ensure quotes or direct speech sound natural in{language}, maintaining the original’s tone.
5.For acronyms, provide the full form in{language}with the English acronym in parentheses.
Textfor translation:{text}
{translation results}

In the ablation study section, the authors investigate whether larger LLMs require larger visual encoders, which is actually an experiment for our previous problem with the number of visual parameters in InternVL-1.0. Experiments compare LLaVA-NeXT and InternVL-1.2, both of which use 34B LLM, and under the condition of ensuring as much comparison fairness as possible, the experiment proves that the larger visual model can provide the overall performance of the model to solve multimodal tasks (but the original paper does not seem to give specific data?). )。 The team subsequently released a distilled version of the visual model, the InternViT-300M-448px[12], which maintained the same scale as the visual side of the LLaVA-NeXT.

MiniCPM-V series

MiniCPM-V [13] is a series of multi-modal LLMs released by Facewall that support efficient device-side deployment.

MiniCPM-V 2.0

In April 24, MiniCPM-V 2.0 was released, with only 2.8B parameters, and the overall performance exceeded Yi-VL 34B, CogVLM-Chat 17B, Qwen-VL-Chat 10B and other larger open source models, with outstanding OCR capabilities, support for bilingual dialogues in Chinese and English, and some indicators were close to Gemini Pro. The visual encoder uses SigLIP SO400M/14-384px, the LLM uses MiniCPM-2.4B, and the connection layer uses the Perceiver Resampler in Flamingo (similar to Q-Former using a learnable query to extract salient visual information, but not conditional on input text). Based on the self-developed RLHF-V, it realizes trusted behavior alignment, which is close to GPT-4V in mitigating multimodal illusions. Based on the self-developed LLaVA-UHD, it supports up to 1344x1344 resolution and arbitrary aspect ratio input. Based on the self-developed VisCPM, the multimodal ability generalization of cross-language is realized, and then there is good bilingual ability in Chinese and English. In addition, the model has less memory overhead and faster speed on the device side, even when processing high-resolution images. The official also provides an example of mlc-miniCPM deployed on Android.

MiniCPM-Llama3-V 2.5

In May 24, MiniCPM-Llama3-V 2.5 was released, with a total of 8B parameters, and the overall performance exceeded that of closed-source models such as GPT-4V-1106, Gemini Pro, Qwen-VL-Max, and Claude 3, and the OCR and instruction following capabilities were further enhanced (enhanced functions such as full-text OCR extraction, table to Markdown conversion), and more than 30 language conversations were supported. It can also be efficiently deployed on the device side. On the basis of MiniCPM-V 2.0, the LLM is replaced by Llama3-8B-Instruct, which is based on the updated RLAIF-V to further reduce the hallucination rate. At present, the official supports practical functions such as efficient CPU inference of llama.cpp and ollama, GGUF 16-bit quantization, and LoRA fine-tuning.

Deep Learning and NLP, 2024 Best Practices for SOTA Multimodal Large Model Architecture Design


In May 24, NVIDIA released VILA1.5 [14], which provides video understanding capabilities and open-sourced 3B/8B/13B/40B models, which are at the forefront of the current open-source list MMMU and Video-MME. VILA is detailed in my previous article, here is a brief review: VILA is pre-trained on large-scale interleaved graphic data, so as to have the ability to understand multiple graphs, and the authors have found through experiments that: (1) the interlaced arrangement of images and texts is more critical; (2) Fine-tuning the LLM in the process of interleaved graphics and text pre-training can give it the ability of contextual learning; (3) Mixing text-only instruction data helps improve performance; (4) Compressing the visual token can expand the number of video frames.


In May 24, the CogVLM2 was released for the GLM large model of Zhipu [15], followed by the release of GLM-4V. CogVLM2 is based on Llama3-8B-Instruct and supports 8K contextual, 1344x1344 resolution, bilingual dialogue in Chinese and English. GLM-4V-9B was replaced with the GLM-4-9B language model, and the same data and training strategy were adopted, and the original vision experts of CogVLM were removed, and the model size was reduced to 13B. CogVLM and CogAgent are detailed in my previous article.


In June 24, LeCun & Xie Saining team released Cambrian-1, focusing on vision-centric multimodal LLMs, and open-sourced 8B/13B/34B models. At present, multimodal LLMs still have large visual defects, and it is necessary to enhance visual representations to better interact with language modalities and endow the model with stronger perception and localization capabilities in real scenes. One of the implications of this study is that the work influencing multimodal LLMs has begun to focus on improving the quality of visual representations, rather than always scaling up LLMs.

[2406.16860] Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs[16]

Deep Learning and NLP, 2024 Best Practices for SOTA Multimodal Large Model Architecture Design

As shown in the figure above, this work focuses on the five core design elements of multimodal LLMs, namely: visual representation, connector design, instruction fine-tuning data, instruction fine-tuning strategy, and evaluation benchmark.

  1. 1. Visual representation

The authors evaluated a variety of visual encoders and their combinations, and the figure below shows that the language-supervised CLIP model has the advantages, but the self-supervised approach can also perform similarly with sufficient data and appropriate fine-tuning. Moreover, combining multiple types of vision encoders can help improve the performance of multimodal LLMs, especially for vision-centric tasks. It was noted that the high-resolution encoder greatly enhanced the performance of diagramming and vision-centric tasks, and that the ConvNet-based architecture was suitable for such tasks.

Deep Learning and NLP, 2024 Best Practices for SOTA Multimodal Large Model Architecture Design

  1. 1. Connector design

Spatial Vision Aggregator (SVA), a dynamic, spatially aware connector to fuse visual features (from multiple visual encoders) with LLMs is proposed. As shown in the figure below, this method sets up some learnable latent query tokens to interact with multiple visual features (visual features are used as key/value) through cross-attention. The design of SVA has two elements: (1) by explicitly defining the sub-region of the visual feature graph corresponding to each query token, the spatial inductive bias is introduced, which is convenient for the model to retain the understanding of the spatial structure when processing visual information, and more accurately locate and integrate local features; (2) The multi-layer aggregation of visual features in LLM allows the model to repeatedly use visual information on different levels of features, and enhances the model's in-depth reasoning ability of visual content. This method can effectively reduce the number of visual tokens required, for example, compared to Mini-Gemini and LLaVA-NeXT, the number of visual tokens of Cambrian-1 is 20%.

Deep Learning and NLP, 2024 Best Practices for SOTA Multimodal Large Model Architecture Design

  1. 1. Command fine-tuning data

The authors released the instruction fine-tuning dataset Cambrian-10M, which synthesizes instruction data such as OCR, generic VQA, and pure language, and also selects the higher quality 7M version. Different types of visual instruction data can give different abilities to the model, so the balance of data matching is also crucial, and the experimental results show that it is important to balance the proportion of OCR, general data and language data. In addition, in the experiment, the authors found that the trained multimodal LLM may perform well on the benchmark, but the actual dialogue ability is weak and the replies are short. Therefore, the authors introduced additional system prompts during training to encourage the model to output longer responses and chain-of-thought reasoning, enhancing the performance of tasks such as mathematical reasoning.

  1. 1. Instruction fine-tuning strategy

The authors follow the two-stage training strategy of LLaVA, first using the adaptation data to fine-tune only the middle MLP connection layer, and then turn on the LLM and connector for fine-tuning. The results show that the first phase of pre-training on the connector can improve performance, which can be further enhanced by using more adaptation data. In addition, the authors compared the performance impact of fine-tuning the visual encoder, and showed that fine-tuning the visual encoder can enhance the performance, especially for self-supervised pre-trained visual encoders (such as DINO v2, MoCo v3, MAE, etc.), which can be significantly improved in vision-centric testing.

  1. 1. Vision-centric benchmark CV-Bench

Most of the existing benchmarks cannot correctly evaluate the visual perception and localization ability of the model, and the corresponding sample size is limited. CV-Bench reuses samples from existing vision benchmarks and contains 2,638 vision-centric VQA questions involving spatial position relationships and object counts in 2D, depth order and relative distance in 3D.

Finally, let us look forward to the continuous new breakthroughs of China's AGI basic model and lead the world trend!

Deep Learning and NLP, 2024 Best Practices for SOTA Multimodal Large Model Architecture Design

Source | Highland barley AI

This article is from Xinzhi self-media and does not represent the views and positions of Business Xinzhi.If there is any suspicion of infringement, please contact the administrator of the Business News Platform.Contact: system@shangyexinzhi.com