Visual preference alignment involves training Large Vision-Language Models (LVLMs) to predict human preferences between visual inputs. This is typically achieved by using labeled datasets of chosen/rejected pairs and employing optimization algorithms like direct preference optimization (DPO). Existing visual alignment methods, primarily designed for single-image scenarios, struggle to effectively handle the complexity of multi-image tasks due to the scarcity of diverse training data and the high cost of annotating chosen/rejected pairs. 🌈We present Multi-Image Augmented Direct Preference Optimization (MIA-DPO), a visual preference alignment approach that effectively handles multi-image inputs. MIA-DPO mitigates the scarcity of diverse multi-image training data by extending single-image data with unrelated images arranged in grid collages or pic-in-pic formats, significantly reducing the costs associated with multi-image data annotations. Our observation reveals that attention values of LVLMs vary considerably across different images. We use attention values to identify and filter out rejected responses the model may have mistakenly focused on. Our attention-aware selection for constructing the chosen/rejected pairs without relying on (i) human annotation, (ii) extra data, and (iii) external models or APIs. MIA-DPO is compatible with various architectures and outperforms existing methods on five multi-image benchmarks, achieving an average performance boost of 3.0% on LLaVA-v1.5 and 4.3% on the recent InternLM-XC2.5. Moreover, MIA-DPO has a minimal effect on the model's ability to understand single images.
Some previous studies have explored different types of single-image hallucinations, such as object hallucination which means the model incorrectly describes objects that are not present in the image. Compared to single-image hallucinations, multi-image scenarios introduce more complex types of hallucinations. As shown in Fig. 2, we categorize multi-image hallucinations into two-types:
(1) Sequence Confusion. When presented with multiple images, the model may fail to identify which image the input prompt refers to. For instance, in the top case shown in Fig. 2, the question is directed at Image 3 (birds and sky), but the model responds based on Image 4 (a train on tracks).
(2) Element Interference. The presence of multiple images significantly increases the number of visual elements compared to a single image, leading to confusion between different elements by LVLMs. For example, in the bottom case of Fig. 2, the question “What color is the car in Image2?” should be answered with “white”. However, the LVLM incorrectly interpreted the color attribute of the motorcycle in Image 3 as the color of the car in Image 2, resulting in an incorrect response.
Attention as an Indicator for Detecting Hallucinations The attention mechanism reveals wherethe model is “looking” when making a decision. We observe that the attention mechanism provides crucial clues for detecting multi-image hallucinations (Fig. 2). Ideally, attention values should focus on areas of the referred input image relevant to the question. If the attention values are scattered or not strongly focused on the correct visual element or region, it suggests the model is experiencing difficulty understanding multi-image sequences or distinguishing elements between different images. Based on our observation, we design an attention-aware selection that uses the attention values to select the rejected sample that contains the hallucinations in the DPO algorithm.
Post-Selection for Data Cleaning Although our attention-aware selection is effective in constructing the DPO data, a small amount of noisy samples may be included and potentially causing detrimental effects. To filter out the noisy samples, we incorporate a post-selection step using the following three metrics: (1) Perplexity (PPL) (2) Length Ratio (3) Edit Distance.
Rather than expending effort on collecting and annotating new multi-image prompts, we efficiently convert existing single-image datasets, such as LLaVA-665k, by incorporating unrelated images. Our low-cost, scalable approach enriches data forms and allows us to comprehensively explore the various types of multi-image hallucinations that LVLMs might produce. As shown in Fig. 4, we construct multi-image prompts in three formats: (1) Sequence: Multiple images are arranged sequentially, with questions targeting specific images. The number of images varies from 2 to 5. (2) Grid Collage: Multiple images are merged into a single image, each labeled with a number description. Questions focus on specific images based on language descriptions. The number of images ranges from 2 to 9. (3) Pic-in-Pic One image is resized and overlaid onto another, and questions are asked about the combined image.
Results on LLaVA-v1.5 As present in Tab. 1, applying MIA-DPO to LLaVA-v1.5 achieves improvements of 1.2%/5.8%/2.3%/2.1%/3.5% on five multi-image benchmarks, which demonstrates
the effectiveness of MIA-DPO. As for the challenging MMMU benchmark that requires complex
domain-specific knowledge, MIA-DPO enables LLaVA-v1.5 to achieve a 1.2% improvement. The
experimental results on MMMU demonstrate that MIA-DPO enhances the LLaVA-v1.5’s reasoning
ability on multi-image problems. Additionally, on the BLINK dataset that includes multi-view and
spatial relationship reasoning, MIA-DPO significantly boosts the performance of LLaVA-v1.5 by
5.8%. Such an improvement highlights the effectiveness of MIA-DPO in enhancing the model’s
ability to understand and reason under multi-image scenarios.
Comparison with Preference Optimization Baselines In Tab. 1, we compare MIA-DPO with
three preference optimization baselines (LLaVA-RLHF, HA-DPO, POVID) on LLaVA-v1.5. Thanks
to our multi-image attention-based method for constructing the DPO data, MIA-DPO achieves significant advantages on the reported five multi-image benchmarks compared to the baselines.
More LVLM Architectures We also applied MIA-DPO to other LVLM architectures, such as
the recent InternLM-XC2.5 model. As shown in Tab. 1, MIA-DPO boosts the performance of
1.2%/0.8%/11.1%/4.5%/4.1% across the five benchmarks, resulting in an average improvement of
4.3%. The results on LLaVA-1.5 and InternLM-XC2.5 demonstrate that MIA-DPO is general and
effective for different LVLM architectures. Notably, despite the Supervised Fine-tuning (SFT) phase
of InternLM-XC2.5 involving multi-image data, our MIA-DPO still further boosts performance on
multi-image benchmarks.
While MIA-DPO is effective in multi-image scenarios, we also report the performance on singleimage benchmarks. As shown in Tab. 2, MIA-DPO outperforms the LLaVA-v1.5 baseline and DPO methods, including LLaVA-RLHF and HA-DPO, in average results across seven single-image benchmarks. As for the InternLM-XC2.5 model, MIA-DPO achieves a 1.4% increase on MMStar but performs slightly below baseline on average across all single-image benchmarks. The slight degradation in InternLM-XC2.5’s single-image performance suggests that while the model benefits greatly in multi-image scenarios, there may be a trade-off in optimizing for more complex, interleaved inputs. Overall, our findings highlight the robustness of our MIA-DPO, which not only excels in improving multi-image performance but also preserves proficiency on single-image tasks. Our MIA-DPO serves as a strong candidate for real-world applications requiring versatile multimodal abilities across both single and multiple image tasks.
Ablation Studies on Post-Selection In our ablation study, we experimented with the postselection process for DPO data. As illustrated in Fig. 3, our post-selection process includes three
components: perplexity (ppl), text length, and edit distance. We conduct ablation studies to compare
the impact of whether to use the post-selection or not. In Tab. 3, the results show that while MIADPO without post-selection (row 1) still led to improvements across multiple multi-image benchmarks, its performance was consistently lower than that of MIA-DPO with post-selection (row 2).
Our findings highlight that post-selection effectively removes outlier and low-quality data, further
enhancing the overall quality of the DPO pair data and boosting model performance.
Ablation Studies on Data Types In the process of constructing multi-image DPO data for MIADPO, we created three types of data: Sequence, Grid Collage, and Pic-in-Pic Data. These three
types of data work together to specifically eliminate the two types of multi-image hallucinations we
identified: Sequence Confusion and Element Interference. To study the impact of each data type on
overall performance, we trained the LLaVa-v1.5 model separately with 20k instances of each data
type and summarized the results in Tab. 3.
The experimental results indicate that using each data type individually for DPO on LLaVa-v1.5
yields similar average scores of 42.6, 42.4, and 42.7 across five benchmarks. However, when combining all three data types, the model achieves a higher average score of 43.4, as shown in Tab. 1.
This suggests that the three data types address different hallucination types, and their combination
produces better results than using them separately.
We visualize the reasoning process of the LLaVA-v1.5 model before and after applying MIA-DPO
on multi-image cases. In Fig. 6, we show the attention map of the generated text tokens relative to
the input image tokens. The top and second rows display the attention distribution before and after
applying MIA-DPO, respectively. The attention difference (delta value) in the third row indicates
which areas receive increased attention due to applying our preference optimization process.
Using MIA-DPO, the LLaVA-v1.5 model adjusts its focus to specific image regions corresponding
to the given instruction. In both the first and second cases, we observe an increased focus on the
instruction-targeted areas of Image 1 after applying MIA-DPO. In the third case, attention gravitates
more toward Image 2, which is specified in the language instruction. The visualization results indicate that MIA-DPO effectively improves the model’s ability to correctly allocate attention to the
relevant image regions, reducing the likelihood of multi-image hallucinations.