學習筆記｜看懂圖片意境的模型，LLaVA + Phi3和llama3的多模態方案

Eric Chang

10 min readMay 15, 2024

LLaVA（Large Language-and-Vision Assistant）是相當受到歡迎的GPT-4 對標產品，如其名是希望透過大型語言模型來實現對於視覺問題上的問答。

它的訓練方法簡單來說，就是結合預訓練的CLIP ViT-L/14視覺編碼器和Vicuna模型，並且在fine tune的階段對Visual Chat、Science QA進行相對應的調教。

LLaVA

Visual Instruction Tuning

llava-vl.github.io

結合phi模型的論文研讀

由於結合了LLM模型，於是我們就有了將Vicuna模型進行替換的想法，思考能不能讓其他更先進的開源模型（llama3或者phi3）來進行結合，查看LLaVA的效果會不會更好。2024年1月，中國的華東師範大學發表了LLaVA-𝜙: Efficient Multi-Modal Assistant with Small Language Model這篇論文，就是在探討LLaVA結合小語言模型是否會有不錯的表現。

LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model

In this paper, we introduce LLaVA-$ϕ$ (LLaVA-Phi), an efficient multi-modal assistant that harnesses the power of the…

arxiv.org

論文發布的當時，微軟的phi-3還沒有發布，所以論文使用的LLM模型為phi-2 （2.7B）的小模型，論文說明了即使是具有2.7B參數的小語言模型，只要它們是用高質量的語料庫訓練的，就可以有效地實現文本和視覺元素的複雜對話，而且在跟其他較為大型的語言模型上，表現甚至還有超越的可能，不過還是輸給LLaVA-1.5的表現。

另外有個限制在論文中也有提及，由於Phi-2使用了codegen mono 分詞器，因為論文內的模型並未專門進行多語言指令的微調， LLaVA-Phi架構無法處理多種語言的指令，包括中文的處理。

透過Python 實作llava-phi-3-mini推論

2024年在llama3跟phi-3相繼發佈之後，也有不少開發者將LLaVA嘗試結合llama3跟phi-3，看看這個組合是否可以在視覺對話上表現得更好。這次xturner也很快就把llava-phi-3-mini的版本完成出來，我們在本地實際運行一次。

xtuner/llava-phi-3-mini · Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

如果單看llava-phi-3-mini的表現的話，跟llama3或原本的llava版本相比，表現相當不錯，但是在MMBnech的部分得分卻比llama3的版本稍微低一點，MMBnech是multimodel裡相對比較全面的評分方式。

我們這次使用LLaVA經典的案例來進行推論，我們先進行相關的套件安裝：

!pip install - upgrade -q accelerate bitsandbytes
!pip install git+https://github.com/huggingface/transformers.git

接著是準備相對應的Processor 跟量化物件。我們先說BitsAndBytes這quantization的方法，它採用混合精度（Mixed precision），也是CUDA的輕量化包裝，我們採用4bits精度。而AutoProcessor 是langchain用於載入模型的Processor，可以根據模型名稱自動選擇對應的Processor，並將Processor載入為可以直接使用的物件。

import torch
import requests
from transformers import AutoProcessor, LlavaForConditionalGeneration
from transformers import BitsAndBytesConfig
from PIL import Image

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model_id = "xtuner/llava-phi-3-mini-hf"

processor = AutoProcessor.from_pretrained(model_id)
model = LlavaForConditionalGeneration.from_pretrained(model_id, quantization_config=quantization_config, device_map="auto")

準備好我們的模型後，開始準備我們想要分析的資料，這裡我們使用LLaVA的兩張經典圖片，湖岸的碼頭跟兩隻小喵咪：

image1 = Image.open(requests.get("https://llava-vl.github.io/static/images/view.jpg", stream=True).raw)
image2 = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)

針對這兩張圖片，我下了一些問題，我有限制輸出的內容在100 tokens以內，如果想要更多的文字可以自己調整：

prompts = [
            "USER: <image>\nPlease describe the weather in this image. \nASSISTANT:",
            "USER: <image>\nPlease describe the colors in this image?\nASSISTANT:",
]

inputs = processor(prompts, images=[image1, image2], padding=True, return_tensors="pt").to("cuda")

output = model.generate(**inputs, max_new_tokens=100)
generated_text = processor.batch_decode(output, skip_special_tokens=True)
for text in generated_text:
  print(text.split("ASSISTANT:")[-1])
  print('=' * 12)

輸出的結果如下，第二個關於顏色的問題，我覺得回答的品質還滿不錯的，模型說圖片上有兩隻貓，一隻灰色和黑色條紋的，另一隻是棕色和黑色條紋的。這兩隻貓躺在一張粉紅色的沙發上：

The weather in this image is cloudy.
============
The image features two cats, one with a gray and black striped coat and the other with a brown and black striped coat. The cats are laying on a pink couch. I’m sorry, but I am not able to generate images. Is there anything else I can help you with?
============

換llava-llama3在LM Studio試試看

在視覺模型的相關支援上，LM Studio的功能支援度跟資源還算是比較好的，而且也會根據電腦的規格來推薦比較適合的Model。我在Macbook本地下載完llava-llama3的模型後，上傳了一張我滿喜歡的藝術作品，並且請他描述一下他所看到的圖片。

Llava-llama3的原文回覆如下，其中sense of tranquility and solitude這句話讓我比較驚訝，代表本地的模型也是可以透過解讀圖片、去感受（或者猜想）圖片想要表達的意境，這是相對比較主觀的感受。

The image portrays a serene scene of a small building perched on the edge of a cliff. The building, painted in white with a contrasting blue roof, stands alone against the backdrop of the sky. The sky is a canvas of pink and orange hues, dotted with fluffy clouds that add depth to the vast expanse. A small trail, perhaps indicative of frequent visitors or animals, leads up to the building. Despite its simplicity, the image conveys a sense of tranquility and solitude. There are no discernible texts or countable objects in the image. The relative position of the building to the sky indicates it is at a considerable height due to its location on the cliff. The trail appears to be the only sign of human activity or interaction with the scene. The image does not provide any information that allows for confident identification of the worldwide landmark referred to as ‘sa_12554’.

用GPT把英文翻成中文，我個人覺得意境還是相當不錯的，只是後面有個’sa_12554'，不確定這是什麼意思：

這張圖展現了一幅寧靜的景象，一座小建築物坐落在懸崖邊緣。建築物漆成白色，屋頂是對比鮮明的藍色，獨自矗立在天空的背景下。天空呈現粉紅和橘色調，點綴著蓬鬆的雲朵，為廣闊的天際增添了層次感。一條小徑通向建築物，可能是頻繁遊客或動物的痕跡。儘管畫面簡單，卻傳達了一種寧靜和孤獨的感覺。圖片中沒有可辨識的文字或可數的物體。建築物相對於天空的位置表明它位於懸崖上，高度相當可觀。小徑似乎是唯一表明人類活動或與景觀互動的跡象。這張圖片沒有提供任何信息，無法確定所謂的“sa_12554”世界地標。