RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

1Wuhan University, 2Shanghai Jiao Tong University, 3The Chinese University of Hong Kong,
4Shanghai AI Laboratory, 5MThreads, Inc., 6Nanyang Technological University
*Equally contributing first authors. Corresponding authors.

Abstract

CLIP (Contrastive Language–Image Pre-training) uses contrastive learning from noise image-text pairs to excel at recognizing a wide array of candidates, yet its focus on broad associations hinders the precision in distinguishing subtle differences among fine-grained items. Conversely, Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories, thanks to their substantial knowledge from pre-training on web-level corpora. However, the performance of MLLMs declines with an increase in category numbers, primarily due to growing complexity and constraints of limited context window size.

To synergize the strengths of both approaches and enhance the few-shot/zero-shot recognition abilities for datasets characterized by extensive and fine-grained vocabularies, this paper introduces RAR, a Retrieving And Ranking augmented method for MLLMs. We initially establish a multi-modal retriever based on CLIP to create and store explicit memory for different categories beyond the immediate context window. During inference, RAR retrieves the top-k similar results from the memory and uses MLLMs to rank and make the final predictions. Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model’s comprehensive knowledge base, significantly boosting accuracy across a range of vision-language recognition tasks. Notably, our approach demonstrates a significant improvement in performance on 5 fine-grained visual recognition benchmarks, 11 few-shot image recognition datasets, and the 2 object detection datasets under the zero-shot recognition setting.

🔥Highlight

  • In-depth Observation. We conduct an in-depth analysis of the strengths and weaknesses of VLMs and MLLMs in processing fine-grained datasets.
  • Plug-and-Play. Our RAR can be seamlessly integrated into various MLLMs in a plug-and-play manner.
  • RAR in Classification. In few-shot image recognition, our approach boosts the top-1 accuracy from 57.0 to 63.2 (%) on the 4-shot setting, and from 63.0 to 69.8 (%) on the 8-shot setting.
  • RAR in Detection. In zero-shot object recognition, our approach yielded an 8.4 (%) point increase over the CLIP baseline and a 6.4 (%) enhancement relative to RegionCLIP


RAR System Overview

We propose augmenting standard MLLMs with our RAR, a retrieving-and-ranking augmented technique. Our RAR enables models to dynamically incorporate external knowledge into the processing and generation workflows. By augmenting MLLMs with external knowledge sources, we address challenges related to language ambiguity, synonym handling, and the limitations imposed by limited context windows when dealing with vast vocabularies. Our method uses the inherent strength of MLLMs in generalizing from existing knowledge while addressing their limitations in visual recognition. We design a multimodal retriever that extracts the image or text embeddings and stores embeddings in an external memory M. For the inference stage of downstream recognition tasks, we retrieve top-k categories from the memory and use MLLMs to refine the retrieved results as the final prediction through ranking.

Here,we showcase the dataset utilized in our RAR testing, featuring an extensive array of fine-grained classification datasets alongside detection datasets enriched with a vast vocabulary.

RAR in Few-Shot Image Recognition

We employ our RAR in image classification. MLLMs, when integrated with retrieval capabilities, demonstrate impressive performance across various classification tasks, including those involving fine-grained datasets. The figure below illustrates one of our case studies.


Our results are as follows. Compared to the CLIP initial retrieval results(top row) , our RAR (third row) with ranking facilitates a notable increase in classification accuracy. Additionally, we observe that LLaVA1.5 + finetuning (second row) baseline underperforms in datasets with large vocabularies such as ImageNet due to the constraint of LLMs’ context window.

RAR in Zero-Shot Object Recognition

We extended our multimodal retriever to zero-shot recognition on object detection datasets such as LVIS and V3Det. Compared to the classification datasets, we apply the additional pre-processing techniques such as cropping and resizing to extract the image embeddings. Our improved detection pipeline is illustrated as follows:


Given the pre-existing object proposals such as ground-truth box annotations, the zero-shot object recognition task measures the model’s capability of aligning regions with textual class descriptions. We select two representative models CLIP and RegionCLIP and report their performances as the baseline results. The figure on the left displays the LVIS test results, while the figure on the right presents the test results of V3Det.

The figure below illustrates two of our detection case studies.

RAR in Fine-Grained Visual Recognition

We evaluate our RAR on the fine-grained visual recognition setting defined in previous work FineR. We use only 3 unlabelled images per category to build our memory M for retrieving. We follow FineR to select four representative methods as our baselines to compare with: WordNet+CLIP, BLIP-2, CaSED, and FineR. Averaged results over 5 datasets are shown below, and our RAR achieves the top performance on both the cACC (58.5%) and sACC (65.3%) metrics.

Interesting Observation

In the field of image classification, especially when facing the challenges of fine-grained image categorization, can MLLMs prove competent and effective? To further explore the potential of MLLMs in image classification tasks, we employed the GPT-4V model to test selected images from our fine-grained datasets.


GPT-4V identifies key characteristics such as “coupe” (a two-door car) and “long fuselage” (long body of an aircraft), which are crucial for distinguishing between similar categories.

To further explore the potential of RAR, we expanded the memory size to include all images from the training set stored in memory. We then compared the performance of RAR under this setup with that of GPT-4V across multiple image classification datasets. The results are presented in the following table. The results show that, regardless of whether the base model is LLaVa, Intern-IXC2, or Qwen-VL, RAR significantly outperforms GPT-4V in terms of accuracy. It is observed that even 7B MLLMs, when integrated into the RAR pipeline, far surpass the classification capabilities of GPT-4V across multiple image classification datasets.


Ablation Study

In the experiments conducted for our paper, we selected the top 5 retrieved results for ranking. To test the scalability of this method, we conducted a new experiment using the top 10 retrieved results, ranking these ten categories and then assessing the accuracy of the top 5. In this experiment, we utilized a 4-shot setting, the result is shown below.