Llama 3.2 Vision Finetuning Unsloth Radiography （📒Googgle colabノートブック付）

はじめに
データ準備
モデルの訓練
推論
モデルの保存
VLLMのためのfloat16での保存
📒ノートブック
モデルのURL
その他の有用なリンク:
1. 関連

はじめに

このノートブックでは、医療画像分析のための大規模言語ビジョンモデル(Vision Language Model: VLM)のファインチューニングを行います。具体的には、Llama 3.2 11Bモデルを使用して、X線画像、CTスキャン、超音波画像を分析し、医療専門家をサポートするためのモデル調整を学びます。

主な特徴:

Unslothライブラリを使用した効率的なトレーニング
医療画像データセットを使用した実践的なファインチューニング
最小限のGPUメモリ使用で11Bパラメータモデルを扱う方法
LoRAによる効率的なパラメータ調整

無料のTesla T4 Google Colabインスタンスで実行できます！

サポートが必要な場合はDiscordに参加してください + ⭐ Githubでスターをください ⭐

自身のコンピュータにUnslothをインストールする場合は、Githubページのインストール手順に従ってください。

[NEW] 2024年11月現在、Unslothはビジョンファインチューニングをサポートしています！

このノートブックでは以下を学びます:

データ準備
モデルの訓練
モデルの実行
モデルの保存

%%capture
!pip install unsloth
# 最新のnightly Unslothバージョンを取得
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

以下のモデルをサポートしています:

Llama 3.2 Vision 11B, 90B
Pixtral
Qwen2VL 2B, 7B, 72B
LlavaのすべてのバリアントやLlava NEXTなど

16ビットLoRAを load_in_4bit=False で、または4ビットQLoRAを使用できます。どちらも高速化され、少ないメモリ使用量で動作します。

from unsloth import FastVisionModel # FastLanguageModel for LLMs
import torch

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Llama-3.2-11B-Vision-Instruct-bnb-4bit", # Llama 3.2 vision support
    "unsloth/Llama-3.2-11B-Vision-bnb-4bit",
    "unsloth/Llama-3.2-90B-Vision-Instruct-bnb-4bit", # Can fit in a 80GB card!
    "unsloth/Llama-3.2-90B-Vision-bnb-4bit",

    "unsloth/Pixtral-12B-2409-bnb-4bit",              # Pixtral fits in 16GB!
    "unsloth/Pixtral-12B-Base-2409-bnb-4bit",         # Pixtral base model

    "unsloth/Qwen2-VL-2B-Instruct-bnb-4bit",          # Qwen2 VL support
    "unsloth/Qwen2-VL-7B-Instruct-bnb-4bit",
    "unsloth/Qwen2-VL-72B-Instruct-bnb-4bit",

    "unsloth/llava-v1.6-mistral-7b-hf-bnb-4bit",      # Any Llava variant works!
    "unsloth/llava-1.5-7b-hf-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/Llama-3.2-11B-Vision-Instruct",
    load_in_4bit = True, # Use 4bit to reduce memory use. False for 16bit LoRA.
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
)

model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers     = True,  # ビジョン層をファインチューニングしない場合はFalse
    finetune_language_layers   = True,  # 言語層をファインチューニングしない場合はFalse 
    finetune_attention_modules = True,  # アテンション層をファインチューニングしない場合はFalse
    finetune_mlp_modules      = True,  # MLP層をファインチューニングしない場合はFalse

    r = 16,            # 大きいほど精度が高くなりますが過学習の可能性があります
    lora_alpha = 16,   # 推奨: alpha == r 以上
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
    use_rslora = False,   # ランク安定化LoRAをサポート
    loftq_config = None,  # LoftQもサポート
)

データ準備

ROCOラジオグラフィーデータセットのサンプル版を使用します。データセットはこちらからアクセスできます。完全なデータセットはこちらです。

このデータセットには、医療条件や疾患を示すX線画像、CTスキャン、超音波画像が含まれています。各画像には、専門家が書いた説明文が付いています。目的は、VLMを医療専門家のための有用な分析ツールとなるようにファインチューニングすることです。

from datasets import load_dataset
dataset = load_dataset("unsloth/Radiology_mini", split = "train")

データセットを確認してみましょう。最初の例を見てみましょう：

dataset

dataset[0]["image"]

dataset[0]["caption"]

データセットをフォーマットするために、すべてのビジョンファインチューニングタスクは以下の形式でフォーマットする必要があります：

[
{ "role": "user",
  "content": [{"type": "text",  "text": instruction}, {"type": "image", "image": image} ]
},
{ "role": "assistant",
  "content": [{"type": "text",  "text": answer} ]
},
]

VLMに専門の放射線技師になってもらうためのカスタム指示を作成します。1つの指示だけでなく、複数のターンを追加して動的な会話にすることもできます。

instruction = "You are an expert radiographer. Describe accurately what you see in this image."

def convert_to_conversation(sample):
    conversation = [
        { "role": "user",
          "content" : [
            {"type" : "text",  "text"  : instruction},
            {"type" : "image", "image" : sample["image"]} ]
        },
        { "role" : "assistant",
          "content" : [
            {"type" : "text",  "text"  : sample["caption"]} ]
        },
    ]
    return { "messages" : conversation }
pass

データセットをファインチューニング用の「正しい」形式に変換しましょう：

converted_dataset = [convert_to_conversation(sample) for sample in dataset]

最初の例は以下のような構造になっています：

converted_dataset[0]

ファインチューニングを行う前に、ビジョンモデルがすでに画像を分析する方法を知っているかもしれません。確認してみましょう！

FastVisionModel.for_inference(model) # 推論のために有効化！

image = dataset[0]["image"]
instruction = "You are an expert radiographer. Describe accurately what you see in this image."

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": instruction}
    ]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

モデルの訓練

ここではHugging Face TRLのSFTTrainerを使用します！詳しい情報はTRL SFTのドキュメントを参照してください。
処理を高速化するために60ステップで実行しますが、完全な実行を行う場合はnum_train_epochs=1を設定し、max_steps=Noneをオフにすることができます。
TRLのDPOTrainerもサポートしています！

ビジョンファインチューニングのセットアップを支援する新しいUnslothVisionDataCollatorを使用します。

from unsloth import is_bf16_supported
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig

FastVisionModel.for_training(model) # トレーニングのために有効化！

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    data_collator = UnslothVisionDataCollator(model, tokenizer), # 必須！
    train_dataset = converted_dataset,
    args = SFTConfig(
        # per_device_train_batch_size = 2,
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 30,
        # num_train_epochs = 1, # 完全なトレーニング実行の場合はこちらを設定
        learning_rate = 2e-4,
        fp16 = not is_bf16_supported(),
        bf16 = is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",     # Weights and Biases用

        # ビジョンファインチューニングには以下の項目が必須です：
        remove_unused_columns = False,
        dataset_text_field = "",
        dataset_kwargs = {"skip_prepare_dataset": True},
        dataset_num_proc = 4,
        max_seq_length = 2048,
    ),
)

#@title 現在のメモリ統計を表示
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. 最大メモリ = {max_memory} GB.")
print(f"{start_gpu_memory} GB のメモリが予約済み.")

trainer_stats = trainer.train()

#@title 最終的なメモリと時間の統計を表示
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"トレーニングに {trainer_stats.metrics['train_runtime']} 秒使用.")
print(f"トレーニングに {round(trainer_stats.metrics['train_runtime']/60, 2)} 分使用.")
print(f"ピーク予約メモリ = {used_memory} GB.")
print(f"トレーニングのためのピーク予約メモリ = {used_memory_for_lora} GB.")
print(f"最大メモリに対するピーク予約メモリの割合 = {used_percentage} %.")
print(f"最大メモリに対するトレーニング用ピーク予約メモリの割合 = {lora_percentage} %.")

推論

モデルを実行してみましょう！指示と入力を変更できます - 出力は空白のままにしておいてください！

min_p = 0.1とtemperature = 1.5を使用します。これらの設定についての詳細は、このツイートを参照してください。

FastVisionModel.for_inference(model) # 推論のために有効化！

image = dataset[0]["image"]
instruction = "You are an expert radiographer. Describe accurately what you see in this image."

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": instruction}
    ]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

モデルの保存

最終的なモデルをLoRAアダプターとして保存するには、オンライン保存用のHugging Faceのpush_to_hubまたはローカル保存用のsave_pretrainedを使用します。

[注意] これはLoRAアダプターのみを保存し、完全なモデルは保存しません。16bitまたはGGUFで保存する場合は、下にスクロールしてください！

model.save_pretrained("lora_model") # ローカル保存
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # オンライン保存
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # オンライン保存

今保存したLoRAアダプターを推論用に読み込む場合は、FalseをTrueに変更してください：

if False:
    from unsloth import FastVisionModel
    model, tokenizer = FastVisionModel.from_pretrained(
        model_name = "lora_model", # トレーニングに使用したモデル
        load_in_4bit = load_in_4bit,
    )
    FastVisionModel.for_inference(model) # 推論のために有効化！

image = dataset[0]["image"]
instruction = "You are an expert radiographer. Describe accurately what you see in this image."

messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": instruction}
    ]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt = True)
inputs = tokenizer(
    image,
    input_text,
    add_special_tokens = False,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

VLLMのためのfloat16での保存

float16での直接保存もサポートしています。float16の場合はmerged_16bitを選択してください。Hugging Faceアカウントにアップロードするにはpush_to_hub_mergedを使用してください！個人用トークンは https://huggingface.co/settings/tokens で取得できます。

# 保存は1つだけ選択してください！（両方は必要ありません）

# 16ビットでローカルに保存
if False: model.save_pretrained_merged("unsloth_finetune", tokenizer,)

# Hugging Faceアカウントにエクスポートして保存
if False: model.push_to_hub_merged("YOUR_USERNAME/unsloth_finetune", tokenizer, token = "PUT_HERE")

以上で完了です！Unslothについて質問がある場合は、Discordチャンネルがあります！バグを見つけた場合や、最新のLLM情報を追いかけたい場合、またはヘルプが必要な場合、プロジェクトに参加したい場合などは、参加してみましょう！