【RTX3060】Multimodal Large Language Models（MLLM）のSPHINX-TinyをDockerで動かしてみる

AI技術の進化は日々加速しており、その中でもMultimodal Large Language Models（MLLM）の一角を担うSPHINXは特に注目に値します。この記事では、Windows11とRTX3060を使用し、Docker環境でSPHINX-Tinyを動かす方法を詳しく解説します。

はじめに：SPHINX-Tinyの概要
環境設定：Dockerを用いた準備
モデルのダウンロード
スクリプトの実行
推論結果
リポジトリ
まとめ
1. 関連

はじめに：SPHINX-Tinyの概要

SPHINXは、画像とテキストを同時に理解することができるMLLMです。このモデルは、特にRTX3060のような一般的なGPUでも動作可能な「SPHINX-Tiny」バージョンを有しています。

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. First, for str...

Alpha-VLLM/LLaMA2-Accessory at main

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

環境設定：Dockerを用いた準備

まず、Dockerを用いてSPHINX-Tinyを動かすための環境を構築します。以下のコマンドでDocker環境をセットアップし、コンテナ内に入ります。

>docker-compose up 
>docker-compose exec webapp /bin/bash

モデルのダウンロード

一式ダウンロードします。

Alpha-VLLM/LLaMA2-Accessory at main

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

スクリプトの実行

コンテナ内で、SPHINXディレクトリに移動し、single-gpu_inference.pyスクリプトを実行します。このスクリプトは、以下の手順でSPHINXモデルを用いた画像とテキストの処理を行います。

モデルの読み込み：
SPHINXモデルを読み込むために、SPHINXModel.from_pretrainedメソッドを使用します。
画像の読み込み：
Image.openを使って画像ファイルを読み込みます。
質問の設定と回答の生成：
質問を設定し、model.generate_responseメソッドで回答を生成します。

from SPHINX import SPHINXModel
from PIL import Image
import torch

model = SPHINXModel.from_pretrained(pretrained_path="/app/finetune/mm/SPHINX/SPHINX-Tiny", with_visual=True)
image = Image.open("examples/1.jpg")
qas = [["What's in the image?", None]]

response = model.generate_response(qas, image, max_gen_len=1024, temperature=0.9, top_p=0.5, seed=0)
print(response)

qas[-1][-1] = response
qas.append(["Then how does it look like?", None])
response2 = model.generate_response(qas, image, max_gen_len=1024, temperature=0.9, top_p=0.5, seed=0)
print(response2)

推論結果

お試し画像1

The image depicts a group of zebras and antelopes, with some of them drinking water.
The image is a black and white photo of a group of animals, primarily zebras and antelopes, standing near a body of water. The animals are drinking water, and the water appears to be shallow.

この画像にはシマウマとレイヨウの群れが描かれており、そのうちの数頭は水を飲んでいます。
この画像は、水域の近くに立っている主にシマウマとアンテロープの動物の群れの白黒写真です。動物たちは水を飲んでおり、水深は浅いようです。

お試し画像2

The image shows two women standing on a beach, facing the ocean. They are wearing tank tops and appear to be looking out at the water. There are surfboards in the foreground, and a box with the label "BOXED" is placed between them.
The image gives a serene and picturesque beach scene. The women are standing close to the water, with the ocean as their backdrop. The surfboards are placed in the foreground, suggesting they might be preparing for a surfing session. The box with the label "BOXED" is positioned between the two women, possibly indicating a product they are promoting or selling.

この画像には、海に面してビーチに立っている 2 人の女性が写っています。彼らはタンクトップを着て、水面を眺めているように見えます。手前にサーフボードがあり、その間に「BOXED」と書かれた箱が置かれています。
穏やかで美しいビーチの風景をイメージしています。女性たちは海を背景に水の近くに立っています。サーフボードは前景に置かれ、サーフィンの準備ができていることを示唆しています。「BOXED」と書かれた箱が 2 人の女性の間に配置されており、おそらく彼らが宣伝または販売している製品を示しています。

お試し画像3

The image showcases a basket filled with various fruits. There are apples, pineapples, bananas, and a single kiwi.
The image has a vibrant and colorful display of fruits. The apples are of different shades of green, pineapples are of various shades of yellow, bananas are of different shades of yellow and green, and the kiwi is a bright green.

画像は、さまざまな果物が入ったバスケットを示しています。リンゴ、パイナップル、バナナ、そしてキウイが 1 つあります。
フルーツが鮮やかでカラフルに表現されたイメージです。リンゴはさまざまな色合いの緑色、パイナップルはさまざまな色合いの黄色、バナナはさまざまな色合いの黄色と緑、キウイは明るい緑色です。

リポジトリ

GitHub - Sunwood-ai-labs/LLaMA2-Accessory: An Open-source Toolkit for LLM Development

An Open-source Toolkit for LLM Development. Contribute to Sunwood-ai-labs/LLaMA2-Accessory development by creating an account on GitHub.

まとめ

Windows11とRTX3060を使用したSPHINX-Tinyの実行は、AI技術の新たな地平を切り開く一歩です。Dockerを利用することで環境構築が容易になり、幅広い開発者にとってアクセスしやすい技術となります。このような技術の進展は、AIと人間の相互作用を深め、新しい応用分野を開拓していくことでしょう。