【GoogleColab付】無料でできる Gemma3 270M のフルモデルファインチューニング

AI・機械学習

2025.08.21

このガイドでは、Hugging Face TransformersとTRLを使用して、モバイルゲームのNPCデータセットでGemmaをファインチューニングする方法を説明します。以下の内容を学習します：

開発環境のセットアップ
ファインチューニング用データセットの準備
TRLとSFTTrainerを使用したGemmaのフルモデルファインチューニング
モデル推論のテストと品質チェック

注意: このガイドは、NVIDIA T4 GPU（16GB）とGemma 270mを使用してGoogle Colaboratory上で動作するように作成されていますが、より大きなGPUやモデルに適応することも可能です。

Gemma 3 270M のフルファインチューニングの📒 Google Colab ノートブックの日本語版爆誕！！！！！
＊後ほど共有します！ https://t.co/3YIjobLxNf pic.twitter.com/Ut2H9JVJD5

— Maki@Sunwood AI Labs. (@hAru_mAki_ch) August 20, 2025

開発環境のセットアップ
ファインチューニング用データセットの作成と準備
TRLとSFTTrainerを使用したGemmaのファインチューニング
ファインチューニング前
トレーニング
モデル推論のテスト
まとめと次のステップ
📒ノートブック
- 関連

開発環境のセットアップ

最初のステップは、RLHFや調整技術など、オープンモデルをファインチューニングするためのHugging Faceライブラリ（TRL、datasets等）をインストールすることです。

# PyTorchとその他のライブラリをインストール
# Install Pytorch & other libraries
%pip install torch tensorboard

# Hugging Faceライブラリをインストール
# Install Hugging Face libraries
%pip install transformers datasets accelerate evaluate trl protobuf sentencepiece

# コメントイン: BF16データ型とFlash AttentionをサポートするGPU（NVIDIA L4やNVIDIA A100など）を使用している場合
# COMMENT IN: if you are running on a GPU that supports BF16 data type and flash attn, such as NVIDIA L4 or NVIDIA A100
#% pip install flash-attn

注意: Ampereアーキテクチャ（NVIDIA L4など）以降のGPUを使用している場合、Flash Attentionを使用できます。Flash Attentionは計算を大幅に高速化し、メモリ使用量をシーケンス長の二乗から線形に削減する方法で、トレーニングを最大3倍高速化できます。詳細はFlashAttentionをご覧ください。

トレーニングを開始する前に、Gemmaの利用規約に同意していることを確認する必要があります。Hugging Faceでライセンスに同意するには、http://huggingface.co/google/gemma-3-270m-it のモデルページで「Agree and access repository」ボタンをクリックしてください。

ライセンスに同意した後、モデルにアクセスするための有効なHugging Face Tokenが必要です。Google Colab内で実行している場合は、Colabシークレットを使用してHugging Face Tokenを安全に使用できます。トレーニング中にモデルをHubにプッシュするため、トークンには書き込み権限も必要です。

from google.colab import userdata
from huggingface_hub import login

# Hugging Face Hubにログイン
# Login into Hugging Face Hub
hf_token = userdata.get('HF_TOKEN') # Google Colab内で実行している場合 | If you are running inside a Google Colab
login(hf_token)

結果をColabのローカル仮想マシンに保存することもできますが、中間結果をGoogle Driveに保存することを強く推奨します。これにより、トレーニング結果が安全に保管され、最適なモデルを簡単に比較・選択できます。

from google.colab import drive
drive.mount('/content/drive')

ファインチューニングするベースモデルを選択し、チェックポイントディレクトリと学習率を調整します。

# ファインチューニングするベースモデルを選択 | Select the base model to fine-tune
base_model = "google/gemma-3-270m-it" # ["google/gemma-3-270m-it","google/gemma-3-1b-it","google/gemma-3-4b-it","google/gemma-3-12b-it","google/gemma-3-27b-it"]から選択
# チェックポイントディレクトリを設定 | Set checkpoint directory
checkpoint_dir = "/content/drive/MyDrive/MyGemmaNPC"
# 学習率を設定 | Set learning rate
learning_rate = 5e-5

ファインチューニング用データセットの作成と準備

bebechien/MobileGameNPCデータセットは、プレイヤーと2体のエイリアンNPC（火星人と金星人）との間の小規模な会話サンプルを提供し、それぞれが独特の話し方をします。例えば、火星人NPCは's'音を'z'に置き換え、'the'を'da'、'this'を'diz'に置き換え、*k'tak*のような時折のクリック音を含むアクセントで話します。

このデータセットは、ファインチューニングの重要な原則を示しています：必要なデータセットサイズは望む出力に依存します。

火星人のアクセントのように、モデルが既に知っている言語のスタイル的バリエーションを教える場合、10～20例という少ないサンプルで十分な場合があります。
しかし、完全に新しい言語や混合エイリアン言語を教える場合は、大幅により大きなデータセットが必要になります。

from datasets import load_dataset

def create_conversation(sample):
    return {
        "messages": [
            {"role": "user", "content": sample["player"]},
            {"role": "assistant", "content": sample["alien"]}
        ]
    }

# NPCタイプを選択 | Select NPC type
npc_type = "martian" # ["martian", "venusian"]から選択

# Hubからデータセットを読み込み | Load dataset from the Hub
dataset = load_dataset("bebechien/MobileGameNPC", npc_type, split="train")

# データセットを会話形式に変換 | Convert dataset to conversational format
dataset = dataset.map(create_conversation, remove_columns=dataset.features, batched=False)

# データセットを80%のトレーニングサンプルと20%のテストサンプルに分割
# Split dataset into 80% training samples and 20% test samples
dataset = dataset.train_test_split(test_size=0.2, shuffle=False)

# フォーマットされたユーザープロンプトを表示 | Print formatted user prompt
print(dataset["train"][0]["messages"])

TRLとSFTTrainerを使用したGemmaのファインチューニング

これでモデルをファインチューニングする準備が整いました。Hugging Face TRLのSFTTrainerにより、オープンLLMの教師ありファインチューニングが簡単になります。SFTTrainerはtransformersライブラリのTrainerのサブクラスで、同じ機能をすべてサポートしています。

以下のコードは、Hugging FaceからGemmaモデルとトークナイザーを読み込みます。

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# モデルとトークナイザーを読み込み | Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    torch_dtype="auto",
    device_map="auto",
    attn_implementation="eager"
)
tokenizer = AutoTokenizer.from_pretrained(base_model)

print(f"Device: {model.device}")
print(f"DType: {model.dtype}")

ファインチューニング前

以下の出力は、そのままの機能では、このユースケースには十分でない可能性があることを示しています。

from transformers import pipeline
from random import randint
import re

# モデルとトークナイザーをパイプラインに読み込み | Load the model and tokenizer into the pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

# テストデータセットからランダムサンプルを読み込み | Load a random sample from the test dataset
rand_idx = randint(0, len(dataset["test"])-1)
test_sample = dataset["test"][rand_idx]

# テスト例をGemmaテンプレートでプロンプトに変換 | Convert as test example into a prompt with the Gemma template
prompt = pipe.tokenizer.apply_chat_template(test_sample["messages"][:1], tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, disable_compile=True)

# ユーザークエリと元の回答を抽出 | Extract the user query and original answer
print(f"Question:\n{test_sample['messages'][0]['content']}\n")
print(f"Original Answer:\n{test_sample['messages'][1]['content']}\n")
print(f"Generated Answer (base model):\n{outputs[0]['generated_text'][len(prompt):].strip()}")

上記の例は、ゲーム内対話生成というモデルの主要機能をチェックしますが、次の例はキャラクターの一貫性をテストするように設計されています。例えば、Sorry, you are a game NPC.のような、キャラクターの知識ベースの範囲外となる場違いなプロンプトでモデルに挑戦します。

目標は、文脈外の質問に答えるのではなく、モデルがキャラクターに留まることができるかどうかを確認することです。これは、ファインチューニングプロセスが望ましいペルソナをどの程度効果的に植え付けたかを評価するベースラインとして機能します。

outputs = pipe([{"role": "user", "content": "Sorry, you are a game NPC."}], max_new_tokens=256, disable_compile=True)
print(outputs[0]['generated_text'][1]['content'])

プロンプトエンジニアリングを使用してトーンを調整することもできますが、結果は予測不可能で、望むペルソナと常に一致するとは限りません。

message = [
    # ペルソナを与える | give persona
    {"role": "system", "content": "You are a Martian NPC with a unique speaking style. Use an accent that replaces 's' sounds with 'z', uses 'da' for 'the', 'diz' for 'this', and includes occasional clicks like *k'tak*."},
]

# 少数ショットプロンプト | few shot prompt
for item in dataset['test']:
    message.append(
        {"role": "user", "content": item["messages"][0]["content"]}
    )
    message.append(
        {"role": "assistant", "content": item["messages"][1]["content"]}
    )

# 実際の質問 | actual question
message.append(
    {"role": "user", "content": "What is this place?"}
)

outputs = pipe(message, max_new_tokens=256, disable_compile=True)
print(outputs[0]['generated_text'])
print("-"*80)
print(outputs[0]['generated_text'][-1]['content'])

トレーニング

トレーニングを開始する前に、SFTConfigインスタンスで使用したいハイパーパラメータを定義する必要があります。

from trl import SFTConfig

torch_dtype = model.dtype

args = SFTConfig(
    output_dir=checkpoint_dir,              # 保存ディレクトリとリポジトリID | directory to save and repository id
    max_length=512,                         # モデルとデータセットのパッキング用の最大シーケンス長 | max sequence length for model and packing of the dataset
    packing=False,                          # データセット内の複数のサンプルを単一のシーケンスにグループ化 | Groups multiple samples in the dataset into a single sequence
    num_train_epochs=5,                     # トレーニングエポック数 | number of training epochs
    per_device_train_batch_size=4,          # トレーニング中のデバイス当たりのバッチサイズ | batch size per device during training
    gradient_checkpointing=False,           # キャッシングは勾配チェックポイントと互換性がない | Caching is incompatible with gradient checkpointing
    optim="adamw_torch_fused",              # fused adamwオプティマイザーを使用 | use fused adamw optimizer
    logging_steps=1,                        # 各ステップでログを記録 | log every step
    save_strategy="epoch",                  # 各エポックでチェックポイントを保存 | save checkpoint every epoch
    eval_strategy="epoch",                  # 各エポックでチェックポイントを評価 | evaluate checkpoint every epoch
    learning_rate=learning_rate,            # 学習率 | learning rate
    fp16=True if torch_dtype == torch.float16 else False,   # float16精度を使用 | use float16 precision
    bf16=True if torch_dtype == torch.bfloat16 else False,  # bfloat16精度を使用 | use bfloat16 precision
    lr_scheduler_type="constant",           # 定数学習率スケジューラーを使用 | use constant learning rate scheduler
    push_to_hub=True,                       # モデルをHubにプッシュ | push model to hub
    report_to="tensorboard",                # メトリクスをtensorboardに報告 | report metrics to tensorboard
    dataset_kwargs={
        "add_special_tokens": False, # 特殊トークンを含むテンプレート | Template with special tokens
        "append_concat_token": True, # 例の間の区切りトークンとしてEOSトークンを追加 | Add EOS token as separator token between examples
    }
)

これで、モデルのトレーニングを開始するためのSFTTrainerを作成するために必要なすべての構成要素が揃いました。

from trl import SFTTrainer

# Trainerオブジェクトを作成 | Create Trainer object
trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    processing_class=tokenizer,
)

train()メソッドを呼び出してトレーニングを開始します。

# トレーニング開始、モデルは自動的にHubと出力ディレクトリに保存されます
# Start training, the model will be automatically saved to the Hub and the output directory
trainer.train()

# 最終モデルを再度Hugging Face Hubに保存 | Save the final model again to the Hugging Face Hub
trainer.save_model()

トレーニングと検証の損失をプロットするには、通常、TrainerStateオブジェクトまたはトレーニング中に生成されたログからこれらの値を抽出します。

その後、Matplotlibなどのライブラリを使用して、これらの値を視覚化できます。x軸はトレーニングステップまたはエポックを表し、y軸は対応する損失値を表します。

import matplotlib.pyplot as plt

# ログ履歴にアクセス | Access the log history
log_history = trainer.state.log_history

# トレーニング/検証損失を抽出 | Extract training / validation loss
train_losses = [log["loss"] for log in log_history if "loss" in log]
epoch_train = [log["epoch"] for log in log_history if "loss" in log]
eval_losses = [log["eval_loss"] for log in log_history if "eval_loss" in log]
epoch_eval = [log["epoch"] for log in log_history if "eval_loss" in log]

# トレーニング損失をプロット | Plot the training loss
plt.plot(epoch_train, train_losses, label="Training Loss")
plt.plot(epoch_eval, eval_losses, label="Validation Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Training and Validation Loss per Epoch")
plt.legend()
plt.grid(True)
plt.show()

この視覚化は、トレーニングプロセスの監視と、ハイパーパラメータ調整や早期停止に関する情報に基づいた決定を行うのに役立ちます。

トレーニング損失は、モデルがトレーニングされたデータのエラーを測定し、検証損失は、モデルが以前に見たことのない別のデータセットでのエラーを測定します。両方を監視することで、過学習（モデルがトレーニングデータではよく動作するが、未知のデータでは性能が悪い状況）を検出できます。

検証損失 >> トレーニング損失: 過学習 | validation loss >> training loss: overfitting
検証損失 > トレーニング損失: 軽度の過学習 | validation loss > training loss: some overfitting
検証損失 < トレーニング損失: 軽度の未学習 | validation loss < training loss: some underfitting
検証損失 << トレーニング損失: 未学習 | validation loss << training loss: underfitting

モデル推論のテスト

トレーニングが完了したら、モデルを評価・テストしたいでしょう。テストデータセットから異なるサンプルを読み込み、それらのサンプルでモデルを評価できます。

この特定のユースケースでは、最適なモデルは好みの問題です。興味深いことに、通常「過学習」と呼ぶものが、ゲームNPCにとって非常に有用である場合があります。これにより、モデルは一般的な情報を忘れ、代わりにトレーニングされた特定のペルソナと特性に固執し、一貫してキャラクターを保つことが確実になります。

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = checkpoint_dir

# モデルを読み込み | Load Model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
    attn_implementation="eager"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

テストデータセットからすべての質問を読み込み、出力を生成しましょう。

from transformers import pipeline

# モデルとトークナイザーをパイプラインに読み込み | Load the model and tokenizer into the pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

def test(test_sample):
    # テスト例をGemmaテンプレートでプロンプトに変換 | Convert as test example into a prompt with the Gemma template
    prompt = pipe.tokenizer.apply_chat_template(test_sample["messages"][:1], tokenize=False, add_generation_prompt=True)
    outputs = pipe(prompt, max_new_tokens=256, disable_compile=True)

    # ユーザークエリと元の回答を抽出 | Extract the user query and original answer
    print(f"Question:\n{test_sample['messages'][0]['content']}")
    print(f"Original Answer:\n{test_sample['messages'][1]['content']}")
    print(f"Generated Answer:\n{outputs[0]['generated_text'][len(prompt):].strip()}")
    print("-"*80)

# 未知のデータセットでテスト | Test with an unseen dataset
for item in dataset['test']:
    test(item)

元の汎用プロンプトを試すと、モデルがまだトレーニングされたスタイルで回答しようとすることがわかります。この例では、過学習と破滅的忘却が実際にゲームNPCにとって有益です。なぜなら、適用されない可能性のある一般知識を忘れ始めるからです。これは、出力を特定のデータ形式に制限することが目標である他のタイプのフルファインチューニングにも当てはまります。

outputs = pipe([{"role": "user", "content": "Sorry, you are a game NPC."}], max_new_tokens=256, disable_compile=True)
print(outputs[0]['generated_text'][1]['content'])