Gemini 2.0を使った2D空間理解 (Object detection) 📒Google Colabノートブック付

AI・機械学習

2025.03.02

このノートブックでは、AI Studioの空間理解サンプルやBuilding with Gemini 2.0: 空間理解ビデオで紹介されているような、Gemini APIを使ったオブジェクト検出と空間理解について紹介します。

デモと同じ方法でGeminiを使用し、以下のようなオブジェクト検出を実行する方法を学びます：

このノートブックには、以下のようなさまざまな例が含まれています：

情報を単純にオーバーレイするオブジェクト検出
画像内の検索
複数言語での翻訳と理解
Geminiの思考能力の活用

注意

「魔法のプロンプト」はありません。さまざまなプロンプトを試してみることをお勧めします。ドロップダウンからさまざまなサンプルを選ぶこともできますし、独自のプロンプトを書くこともできます。また、独自の画像をアップロードして試すこともできます。

セットアップ
情報のオーバーレイ
画像内の検索
多言語機能
Geminiの推論能力の活用
予備的機能：ポインティングと3Dボックス
次のステップ
📒ノートブック
1. 関連

セットアップ

SDKのインストール

新しいGoogle Gen AI SDKは、Google AI for DevelopersとVertex AIの両方のAPIを使用して、Gemini 2.0（および以前のモデル）へのプログラムによるアクセスを提供します。いくつかの例外を除いて、一方のプラットフォームで動作するコードはもう一方でも動作します。つまり、Developer APIを使用してアプリケーションのプロトタイプを作成し、コードを書き直すことなくアプリケーションをVertex AIに移行できます。

この新しいSDKの詳細については、ドキュメントまたは入門ノートブックをご覧ください。

# SDKをインストールします
!pip install -U -q google-genai

API keyの設定

次のセルを実行するには、APIキーがGOOGLE_API_KEYという名前のColabシークレットに保存されている必要があります。まだAPIキーをお持ちでない場合、またはColabシークレットの作成方法がわからない場合は、例として認証をご覧ください。

# APIキーを取得します
from google.colab import userdata
import os

GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')

SDKクライアントの初期化

新しいSDKでは、APIキーを使用してクライアントを初期化するだけで済みます。

# クライアントを初期化します
from google import genai
from google.genai import types

client = genai.Client(api_key=GOOGLE_API_KEY)

モデルの選択と設定

空間理解にはGemini 2.0 Flashモデルが最適です。古いモデルも試すことができますが、より一貫性が低くなる可能性があります（前世代ではgemini-1.5-flash-001が最良の結果を示しました）。オブジェクト検出には、以前のモデルが何をできるかの良い例が含まれています。

すべてのGeminiモデルについての詳細な情報は、ドキュメントをご覧ください。

# 使用するモデルを選択します
model_name = "gemini-2.0-flash" # @param ["gemini-1.5-flash-latest","gemini-2.0-flash-lite","gemini-2.0-flash","gemini-2.0-pro-exp-02-05"] {"allow-input":true}

システムインストラクション

新しいSDKでは、system_instructionsとmodelパラメータをすべてのgenerate_content呼び出しで渡す必要があるため、毎回入力しなくても済むように保存しておきましょう。

# バウンディングボックス用のシステムインストラクションを設定します
bounding_box_system_instructions = """
    Return bounding boxes as a JSON array with labels. Never return masks or code fencing. Limit to 25 objects.
    If an object is present multiple times, name them according to their unique characteristic (colors, size, position, unique characteristics, etc..).
      """

# 安全設定を定義します
safety_settings = [
    types.SafetySetting(
        category="HARM_CATEGORY_DANGEROUS_CONTENT",
        threshold="BLOCK_ONLY_HIGH",
    ),
]

システムインストラクションは主に、フォーマットを毎回繰り返す必要がないようにプロンプトを短くするために使用されます。また、類似したオブジェクトの扱い方をモデルに伝える方法でもあり、モデルに創造性を持たせる良い方法です。

空間理解サンプルでは、システムインストラクションなしで長いプロンプトを使用する異なる戦略を用いています。右側の「show raw prompt」ボタンをクリックすると、そのプロンプト全体を見ることができます。最適な解決策はなく、さまざまな戦略を試して、ユースケースに最も適したものを見つけることをお勧めします。

インポート

必要なモジュールをすべてインポートします。

# 必要なモジュールをインポートします
import google.generativeai as genai
from PIL import Image

import io
import os
import requests
from io import BytesIO

ユーティリティ

バウンディングボックスを描画するためにいくつかのスクリプトが必要になります。これらは単なる例であり、自由に独自のものを書くことができます。

例えば、AI Studioの空間理解サンプルでは、バウンディングボックスのレンダリングにHMLを使用しています。そのコードはGithubリポジトリで見ることができます。

# プロット用のユーティリティ関数

# 日本語文字を表示するためのNoto JPフォントをインストール
!apt-get install fonts-noto-cjk  # For Noto Sans CJK JP

#!apt-get install fonts-source-han-sans-jp # For Source Han Sans (Japanese)

import json
import random
import io
from PIL import Image, ImageDraw, ImageFont
from PIL import ImageColor

additional_colors = [colorname for (colorname, colorcode) in ImageColor.colormap.items()]

def plot_bounding_boxes(im, bounding_boxes):
    """
    正規化された座標、異なる色を使用してPILで画像にバウンディングボックスをプロットします。

    Args:
        img_path: 画像ファイルへのパス。
        bounding_boxes: オブジェクトの名前と正規化された[y1 x1 y2 x2]形式の位置を含むバウンディングボックスのリスト。
    """

    # 画像を読み込む
    img = im
    width, height = img.size
    print(img.size)
    # 描画オブジェクトを作成
    draw = ImageDraw.Draw(img)

    # 色のリストを定義
    colors = [
    'red',
    'green',
    'blue',
    'yellow',
    'orange',
    'pink',
    'purple',
    'brown',
    'gray',
    'beige',
    'turquoise',
    'cyan',
    'magenta',
    'lime',
    'navy',
    'maroon',
    'teal',
    'olive',
    'coral',
    'lavender',
    'violet',
    'gold',
    'silver',
    ] + additional_colors

    # マークダウンフェンシングを解析
    bounding_boxes = parse_json(bounding_boxes)

    font = ImageFont.truetype("NotoSansCJK-Regular.ttc", size=14)

    # バウンディングボックスをイテレート
    for i, bounding_box in enumerate(json.loads(bounding_boxes)):
      # リストから色を選択
      color = colors[i % len(colors)]

      # 正規化された座標を絶対座標に変換
      abs_y1 = int(bounding_box["box_2d"][0]/1000 * height)
      abs_x1 = int(bounding_box["box_2d"][1]/1000 * width)
      abs_y2 = int(bounding_box["box_2d"][2]/1000 * height)
      abs_x2 = int(bounding_box["box_2d"][3]/1000 * width)

      if abs_x1 > abs_x2:
        abs_x1, abs_x2 = abs_x2, abs_x1

      if abs_y1 > abs_y2:
        abs_y1, abs_y2 = abs_y2, abs_y1

      # バウンディングボックスを描画
      draw.rectangle(
          ((abs_x1, abs_y1), (abs_x2, abs_y2)), outline=color, width=4
      )

      # テキストを描画
      if "label" in bounding_box:
        draw.text((abs_x1 + 8, abs_y1 + 6), bounding_box["label"], fill=color, font=font)

    # 画像を表示
    img.show()

# JSON出力の解析関数
def parse_json(json_output):
    # マークダウンフェンシングを解析
    lines = json_output.splitlines()
    for i, line in enumerate(lines):
        if line == "```json":
            json_output = "\n".join(lines[i+1:])  # "```json"より前のすべてを削除
            json_output = json_output.split("```")[0]  # 閉じる"```"の後のすべてを削除
            break  # "```json"が見つかったらループを終了
    return json_output

サンプル画像の取得

# サンプル画像をダウンロード
!wget https://storage.googleapis.com/generativeai-downloads/images/socks.jpg -O Socks.jpg -q
!wget https://storage.googleapis.com/generativeai-downloads/images/vegetables.jpg -O Vegetables.jpg -q
!wget https://storage.googleapis.com/generativeai-downloads/images/Japanese_Bento.png -O Japanese_bento.png -q
!wget https://storage.googleapis.com/generativeai-downloads/images/Cupcakes.jpg -O Cupcakes.jpg -q
!wget https://storage.googleapis.com/generativeai-downloads/images/origamis.jpg -O Origamis.jpg -q
!wget https://storage.googleapis.com/generativeai-downloads/images/fruits.jpg -O Fruits.jpg -q
!wget https://storage.googleapis.com/generativeai-downloads/images/cat.jpg -O Cat.jpg -q
!wget https://storage.googleapis.com/generativeai-downloads/images/pumpkins.jpg -O Pumpkins.jpg -q
!wget https://storage.googleapis.com/generativeai-downloads/images/breakfast.jpg -O Breakfast.jpg -q
!wget https://storage.googleapis.com/generativeai-downloads/images/bookshelf.jpg -O Bookshelf.jpg -q
!wget https://storage.googleapis.com/generativeai-downloads/images/spill.jpg -O Spill.jpg -q

情報のオーバーレイ

まず画像を読み込んでみましょう。例として、カップケーキの画像を使用します：

# 画像を選択してサムネイルを表示
image = "Cupcakes.jpg" # @param ["Socks.jpg","Vegetables.jpg","Japanese_bento.png","Cupcakes.jpg","Origamis.jpg","Fruits.jpg","Cat.jpg","Pumpkins.jpg","Breakfast.jpg","Bookshelf.jpg", "Spill.jpg"] {"allow-input":true}

im = Image.open(image)
im.thumbnail([620,620], Image.Resampling.LANCZOS)
im

画像内のすべてのアイテムを見つける簡単なプロンプトから始めましょう。

モデルが自分自身を繰り返さないようにするために、この場合は0.5のように0より大きい温度を使用することをお勧めします。また、アイテムの数を制限する（システムインストラクションで25に制限）ことも、モデルがループするのを防ぎ、バウンディングボックスのデコードを高速化する方法です。これらのパラメータを試して、ユースケースに最適なものを見つけることができます。

# バウンディングボックス検出のためのプロンプト
prompt = "Detect the 2d bounding boxes of the cupcakes (with 'label' as topping description)"  # @param {type:"string"}

# 画像を読み込んでリサイズ
im = Image.open(BytesIO(open(image, "rb").read()))
im.thumbnail([1024,1024], Image.Resampling.LANCZOS)

# バウンディングボックスを見つけるためにモデルを実行
response = client.models.generate_content(
    model=model_name,
    contents=[prompt, im],
    config = types.GenerateContentConfig(
        system_instruction=bounding_box_system_instructions,
        temperature=0.5,
        safety_settings=safety_settings,
    )
)

# 出力を確認
print(response.text)

ご覧のように、フォーマットについての指示がなくても、Geminiはラベルとバウンディングボックスの座標を「box_2d」配列に含むこのフォーマットを常に使用するように訓練されています。

ただし注意が必要なのは、一般的な使用法とは異なり、y座標が最初で、その後にx座標が続く点です。

# バウンディングボックスを描画
plot_bounding_boxes(im, response.text)
im

画像内の検索

さらに複雑にして、特定のオブジェクトを画像内で検索してみましょう。

# 画像内でオブジェクトを検索
image = "Socks.jpg" # @param ["Socks.jpg","Vegetables.jpg","Japanese_bento.png","Cupcakes.jpg","Origamis.jpg","Fruits.jpg","Cat.jpg","Pumpkins.jpg","Breakfast.jpg","Bookshelf.jpg", "Spill.jpg"] {"allow-input":true}
prompt = "Show me the positions of the socks with the face"  # @param ["Detect all rainbow socks", "Find all socks and label them with emojis ", "Show me the positions of the socks with the face","Find the sock that goes with the one at the top"] {"allow-input":true}

# 画像を読み込んでリサイズ
im = Image.open(image)
im.thumbnail([640,640], Image.Resampling.LANCZOS)

# バウンディングボックスを見つけるためにモデルを実行
response = client.models.generate_content(
    model=model_name,
    contents=[prompt, im],
    config = types.GenerateContentConfig(
        system_instruction=bounding_box_system_instructions,
        temperature=0.5,
        safety_settings=safety_settings,
    )
)

# 出力を確認
print(response.text)

# バウンディングボックス付きの画像を生成
plot_bounding_boxes(im, response.text)
im

異なる画像やプロンプトで試してみてください。異なるサンプルが提案されていますが、独自のプロンプトを書くこともできます。

多言語機能

Geminiは複数の言語を理解できるので、空間的推論と多言語機能を組み合わせることができます。

この画像のように、各アイテムに日本語の文字と英語の翻訳でラベルを付けるようにプロンプトできます。モデルは画像自体からテキストを読み取り、写真を認識して、それらを翻訳します。

# 多言語検出
image = "Japanese_bento.png" # @param ["Socks.jpg","Vegetables.jpg","Japanese_bento.png","Cupcakes.jpg","Origamis.jpg","Fruits.jpg","Cat.jpg","Pumpkins.jpg","Breakfast.jpg","Bookshelf.jpg", "Spill.jpg"] {"allow-input":true}
prompt = "Detect food, label them with Japanese characters + english translation."  # @param ["Detect food, label them with Japanese characters + english translation.", "Show me the vegan dishes","Explain what those dishes are with a 5 words description","Find the dishes with allergens and label them accordingly"] {"allow-input":true}

# 画像を読み込んでリサイズ
im = Image.open(image)
im.thumbnail([640,640], Image.Resampling.LANCZOS)

# バウンディングボックスを見つけるためにモデルを実行
response = client.models.generate_content(
    model=model_name,
    contents=[prompt, im],
    config = types.GenerateContentConfig(
        system_instruction=bounding_box_system_instructions,
        temperature=0.5,
        safety_settings=safety_settings,
    )
)

# バウンディングボックス付きの画像を生成
plot_bounding_boxes(im, response.text)
im

Geminiの推論能力の活用

モデルは画像に基づいて推論することもできます。アイテムの位置、用途について質問したり、この例のように特定のアイテムの影を見つけるように依頼したりできます。

# 推論能力の活用
image = "Origamis.jpg" # @param ["Socks.jpg","Vegetables.jpg","Japanese_bento.png","Cupcakes.jpg","Origamis.jpg","Fruits.jpg","Cat.jpg","Pumpkins.jpg","Breakfast.jpg","Bookshelf.jpg", "Spill.jpg"] {"allow-input":true}
prompt = "Draw a square around the fox' shadow"  # @param ["Find the two origami animals.", "Where are the origamis' shadows?","Draw a square around the fox' shadow"] {"allow-input":true}

# 画像を読み込んでリサイズ
im = Image.open(image)
im.thumbnail([640,640], Image.Resampling.LANCZOS)

# バウンディングボックスを見つけるためにモデルを実行
response = client.models.generate_content(
    model=model_name,
    contents=[prompt, im],
    config = types.GenerateContentConfig(
        system_instruction=bounding_box_system_instructions,
        temperature=0.5,
        safety_settings=safety_settings,
    )
)

# バウンディングボックス付きの画像を生成
plot_bounding_boxes(im, response.text)
im

また、返されるラベルを強化するためにGeminiの知識を使用することもできます。この例では、小さなミスを修正する方法についてのアドバイスがGeminiから提供されます。

ご覧のように、今回は画像を1024pxにリサイズしています。これはモデルがより大きな視点を得て、アドバイスを提供するのに役立ちます。いつこれを行うべきかについての明確なルールはありません。実験して、最も適した方法を見つけてください。

# 知識を活用したラベリング
image = "Spill.jpg" # @param ["Socks.jpg","Vegetables.jpg","Japanese_bento.png","Cupcakes.jpg","Origamis.jpg","Fruits.jpg","Cat.jpg","Pumpkins.jpg","Breakfast.jpg","Bookshelf.jpg", "Spill.jpg"] {"allow-input":true}
prompt = "Tell me how to clean my table with an explanation as label. Do not just label the items"  # @param ["Show me where my coffee was spilled.", "Tell me how to clean my table with an explanation as label. Do not just label the items","Draw a square around the fox' shadow"] {"allow-input":true}

# 画像を読み込んでリサイズ
im = Image.open(image)
im.thumbnail([640,640], Image.Resampling.LANCZOS)

# バウンディングボックスを見つけるためにモデルを実行
response = client.models.generate_content(
    model=model_name,
    contents=[prompt, im],
    config = types.GenerateContentConfig(
        system_instruction=bounding_box_system_instructions,
        temperature=0.5,
        safety_settings=safety_settings,
    )
)

# バウンディングボックス付きの画像を生成
plot_bounding_boxes(im, response.text)
im

# レスポンスのテキストを確認
response.text

また、前の例を確認すると、特に日本食の例では、Geminiの推論能力を試すための他のプロンプトサンプルも提供されています。

予備的機能：ポインティングと3Dボックス

ポインティングと3Dバウンディングボックスは実験的なモデル機能です。これらの今後の機能についての先行体験を得るには、この別のノートブックをチェックしてください。

次のステップ

より完全なエンドツーエンドの例については、AI Studio空間理解サンプルのコードがGithubで利用可能です。

Gemini 2.0クックブックには、Gemini 2.0の他の多くの機能例も見つかります。特にLive APIとビデオ理解の例が参考になります。

画像認識と推論に関連して、ジェットバックパックのマーケティングと形を当てるの例もGemini APIの発見を続けるのに価値があります（注：これらの例ではまだ古いSDKを使用しています）。そして、もちろん先ほど参照したポインティングと3Dボックスの例もあります。

📒ノートブック

Google Colab

cookbook/quickstarts/Spatial_understanding.ipynb at 28fc33fbc2189a30a682148165ea6049ffa93db0 · google-gemini/cookbook

Examples and guides for using the Gemini API. Contribute to google-gemini/cookbook development by creating an account on GitHub.