Google Colab ✖ OpenAI GPT-OSS 20Bモデルのファインチューニング完全ガイド

🚀 はじめに

このチュートリアルでは、OpenAIの最新モデルGPT-OSS 20BをGoogle Colab L4 GPU（22GB VRAM）でファインチューニングする方法を解説します。UnslothライブラリとLoRAを使用することで、効率的にモデルを訓練できます。

OpenAI gpt-oss をファインチューニングできるnotebookを日本語化してみた！！！
＊後ほど記事にしてだしまーす！！ https://t.co/ZuPbcKi4jT pic.twitter.com/5Ji93tmiKA

— Maki@Sunwood AI Labs. (@hAru_mAki_ch) August 9, 2025

📢 最新情報

新機能: UnslothがOpenAIのGPT-OSSモデルのトレーニングに対応！
Text-to-Speech (TTS)モデルのサポートも追加
新しいDynamic 2.0量子化手法が利用可能

⚙️ 環境設定

重要な注意事項

⚠️ このノートブックはGoogle Colab L4 GPU（22GB VRAM）で動作確認済みです。
T4 GPUでは動作しないため、必ずL4インスタンスを選択してください。

1. インストール

最新のPyTorch、Triton、Transformers、Unslothをインストールします：

%env CUDA_LAUNCH_BLOCKING=1

%%capture
# 最新版のライブラリをインストール
!pip install --upgrade -qqq uv
try: import numpy; install_numpy = f"numpy=={numpy.__version__}"
except: install_numpy = "numpy"
!uv pip install -qqq \
    "torch>=2.8.0" "triton>=3.4.0" {install_numpy} \
    "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \
    "unsloth[base] @ git+https://github.com/unslothai/unsloth" \
    torchvision bitsandbytes \
    git+https://github.com/huggingface/transformers \
    git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels

🤖 モデルの読み込み

GPT-OSS 20Bモデルのロード

4ビット量子化を使用してメモリ効率的にモデルを読み込みます：

from unsloth import FastLanguageModel
import torch

max_seq_length = 4096
dtype = None

# 4ビット量子化済みモデル（高速ダウンロード＆OOM回避）
fourbit_models = [
    "unsloth/gpt-oss-20b-unsloth-bnb-4bit",  # bitsandbytes 4ビット量子化
    "unsloth/gpt-oss-120b-unsloth-bnb-4bit",
    "unsloth/gpt-oss-20b",                    # MXFP4フォーマット
    "unsloth/gpt-oss-120b",
]

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gpt-oss-20b",
    dtype = dtype,  # 自動検出
    max_seq_length = max_seq_length,
    load_in_4bit = True,  # 4ビット量子化でメモリ削減
    full_finetuning = False,
)

LoRAアダプターの追加

パラメータ効率的なファインチューニングのため、LoRAアダプターを追加します（全パラメータの約1%のみを訓練）：

model = FastLanguageModel.get_peft_model(
    model,
    r = 8,  # ランク（8, 16, 32, 64, 128から選択）
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0,  # 0が最適化済み
    bias = "none",
    use_gradient_checkpointing = "unsloth",  # 30%少ないVRAM使用
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

🧠 推論努力レベル（Reasoning Effort）

GPT-OSSモデルの特徴的な機能として、「推論努力レベル」を調整できます：

3つのレベル

Low（低）: 高速レスポンス優先、単純なタスク向け
Medium（中）: パフォーマンスと速度のバランス
High（高）: 最高の推論性能、複雑なタスク向け（レイテンシー増）

使用例

from transformers import TextStreamer

messages = [
    {"role": "user", "content": "x^5 + 3x^4 - 10 = 3を解いてください。"},
]

# 推論努力レベルを設定（low/medium/high）
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "medium",  # ここで設定！
).to(model.device)

_ = model.generate(**inputs, max_new_tokens = 1024, streamer = TextStreamer(tokenizer))

📊 データ準備

多言語推論データセットの使用

HuggingFaceのMultilingual-Thinkingデータセットを使用します。このデータセットには、英語から4つの言語に翻訳された推論チェーン・オブ・ソート（CoT）の例が含まれています：

from datasets import load_dataset
from unsloth.chat_templates import standardize_sharegpt

def formatting_prompts_func(examples):
    convos = examples["messages"]
    texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False)
             for convo in convos]
    return {"text": texts}

# データセット読み込み
dataset = load_dataset("HuggingFaceH4/Multilingual-Thinking", split="train")

# データセットの標準化とフォーマット
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched=True)

# 最初のサンプルを確認
print(dataset[0]['text'])

🏋️ モデルの訓練

SFTTrainerの設定

HuggingFace TRLのSFTTrainerを使用して訓練を行います：

from trl import SFTConfig, SFTTrainer
from transformers import DataCollatorForSeq2Seq

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    args = SFTConfig(
        per_device_train_batch_size = 1,  # L4 GPUに最適化
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,  # デモ用に60ステップ（フル訓練はnum_train_epochs=1）
        learning_rate = 2e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",  # WandBなどを使用する場合は変更
    ),
)

メモリ使用状況の確認

# 現在のメモリ統計を表示
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. 最大メモリ = {max_memory} GB.")
print(f"{start_gpu_memory} GB のメモリが予約されています。")

訓練の実行

# 訓練開始
trainer_stats = trainer.train()

# 訓練後のメモリと時間の統計
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
print(f"訓練時間: {round(trainer_stats.metrics['train_runtime']/60, 2)} 分")
print(f"ピーク使用メモリ: {used_memory} GB")
print(f"LoRA訓練用メモリ: {used_memory_for_lora} GB")

🔮 推論の実行

訓練済みモデルで推論を実行します：

messages = [
    {
        "role": "system",
        "content": "reasoning language: French\n\nあなたは数学の問題を解決できる有用なアシスタントです。"
    },
    {"role": "user", "content": "x^5 + 3x^4 - 10 = 3を解いてください。"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "medium",
).to(model.device)

from transformers import TextStreamer
_ = model.generate(**inputs, max_new_tokens = 2048, streamer = TextStreamer(tokenizer))

💡 重要なポイント

メモリ最適化のヒント

L4 GPU（22GB）が必要: T4では動作しません
バッチサイズは1に設定: メモリ制限を考慮
gradient_accumulation_steps=4: 効果的なバッチサイズを増やす
Unslothのgradient checkpointing: 30%のVRAM削減

パフォーマンス統計（L4 GPUでの実測値）

訓練時間: 約24分（60ステップ）
ピークメモリ使用量: 約21.9GB
LoRA訓練用メモリ: 約9.1GB
メモリ使用率: 約98.7%

🔗 その他のリソース

📝 まとめ

このガイドでは、Google Colab L4 GPUを使用してOpenAIのGPT-OSS 20Bモデルをファインチューニングする方法を解説しました。UnslothとLoRAを組み合わせることで、限られたリソースでも大規模言語モデルの訓練が可能になります。

質問がある場合は、Unsloth Discordでお気軽にお尋ねください！

📒ノートブック

Google Colab