Deepgramでマイクからリアルタイム文字起こしをしてみた

Deepgramは、最先端の音声認識技術を提供するプラットフォームです。この記事では、Deepgramを使ってマイクからの音声をリアルタイムで文字に変換する方法を、初心者にも分かるように解説します。Pythonを使ったシンプルなスクリプトを通じて、Deepgramのセットアップから実際の文字起こし処理までを一緒に学んでいきましょう。

こちらの記事もおすすめ

Style-Bert-VITS2でずんだもんの声を学習させてみた（GoogleColabのノート付き）

近年、AI技術の進化は目覚ましく、その一環として音声合成技術も大きく前進しています。特に、話者の声質や話し方を模倣する技術は、エンターテイメントから教育まで幅広い分野での応用が期待されています。この記事では、Style-Bert-VITS2...

Style-Bert-VITS2で生成した音声をVTube StudioとPython連携して表情を制御してみた

デモ動画
セットアップ
スクリプト概要
スクリプトの詳細
スクリプト全体
まとめ
リポジトリ
参考サイト
1. 関連

デモ動画

DeepGramで高速文字お越ししてみた！！！
これはめっちゃ爆速！！！ https://t.co/jpvRJ55aW4 pic.twitter.com/OAiyM4dDrf

— Maki@Sunwood AI Labs. (@hAru_mAki_ch) March 2, 2024

セットアップ

まずは、Deepgramを使用するための環境をセットアップします。以下のコマンドを実行して、必要なPython環境を作成しましょう。

新しいPython環境を作成します。


conda create -n deepgram python=3.11

作成した環境をアクティブにします。


conda activate deepgram

Deepgram SDKと必要なパッケージをインストールします。


pip install deepgram-sdk
pip install python-dotenv
pip install PyAudio

.envファイルも用意します


DEEPGRAM_API_KEY=XXXXXXXXXXXXXXXXX

これで、Deepgramを使うための準備が整いました。

スクリプト概要

リアルタイムでの文字起こしを実行するスクリプトは、以下の通りです。このスクリプトは、マイクを入力として使用し、話されている内容から会話の洞察を検出します。マイクロフォンを使用するための追加のコンポーネントが必要ですが、上記でインストールしたパッケージで対応可能です。

スクリプトの主要部分を見てみましょう。


# 環境変数をロード
from dotenv import load_dotenv
import logging
from deepgram import DeepgramClient, LiveTranscriptionEvents, LiveOptions, Microphone

load_dotenv()

def main():
    # Deepgramクライアントの設定
    deepgram = DeepgramClient()

    # リアルタイム文字起こしの接続を開始
    dg_connection = deepgram.listen.live.v("1")

    # 各種イベントのコールバック関数を定義
    def on_message(self, result, **kwargs):
        sentence = result.channel.alternatives[0].transcript
        if len(sentence) == 0:
            return
        print(f"話者: {sentence}")

    # その他のイベントハンドラ（省略）

    # リアルタイム文字起こしのオプションを設定
    options = LiveOptions(model="nova-2", punctuate=True, language="ja", ...)
    dg_connection.start(options)

    # マイクロフォンストリームを開始
    microphone = Microphone(dg_connection.send)
    microphone.start()

    # 録音を停止するまで待機
    input("録音を停止するにはEnterキーを押してください...\n\n")
    microphone.finish()
    dg_connection.finish()

    print("完了")

if __name__ == "__main__":
    main()

このスクリプトは、DeepgramのリアルタイムAPIを使用して、マイクからの入力をリアルタイムで文字起こしします。ここでのポイントは、DeepgramClientの設定、リアルタイム接続の開始、そしてマイクロフォンからの音声データの送信方法です。

スクリプトの詳細

環境設定とインポート

from dotenv import load_dotenv
import logging, verboselogs
from time import sleep
from deepgram import (
    DeepgramClient,
    DeepgramClientOptions,
    LiveTranscriptionEvents,
    LiveOptions,
    Microphone,
)

load_dotenv()

このセクションでは、必要なライブラリをインポートしています。dotenvは環境変数をロードするために使用され、loggingとverboselogsはログ出力のために利用されます。deepgramからは、Deepgram SDKのクラスをインポートしています。load_dotenv()は.envファイルから環境変数をロードする関数です。

メイン関数

def main():
    try:
        deepgram: DeepgramClient = DeepgramClient()

main()関数の中で、DeepgramClientのインスタンスを作成しています。このインスタンスは、Deepgram APIとの通信を管理します。

リアルタイム文字起こしの設定

dg_connection = deepgram.listen.live.v("1")

def on_message(self, result, **kwargs):
    ...
def on_metadata(self, metadata, **kwargs):
    ...
def on_speech_started(self, speech_started, **kwargs):
    ...
def on_utterance_end(self, utterance_end, **kwargs):
    ...
def on_error(self, error, **kwargs):
    ...

dg_connection.on(LiveTranscriptionEvents.Transcript, on_message)
...

この部分では、Deepgram APIを使用してリアルタイム文字起こしを行うための接続を設定しています。dg_connectionは、リアルタイムAPIへの接続を表し、様々なイベントが発生したときに呼び出されるコールバック関数を登録しています。例えば、on_message関数は、文字起こしが完了したときに呼び出されます。

文字起こしオプションの設定

options: LiveOptions = LiveOptions(
    model="nova-2",
    punctuate=True,
    language="ja",
    ...
)
dg_connection.start(options)

LiveOptionsを使用して、リアルタイム文字起こしのためのオプションを設定します。ここでは、使用するモデル、句読点の自動挿入、使用言語などを指定しています。

マイクロフォンの開始と終了

microphone = Microphone(dg_connection.send)
microphone.start()
input("Press Enter to stop recording...\n\n")
microphone.finish()
dg_connection.finish()

マイクロフォンからの音声入力を開始し、ユーザーがエンターキーを押すまで録音を続け、その後録音を終了します。Microphoneクラスは、音声データをDeepgram APIに送信するために使用されます。

例外処理

except Exception as e:
    print(f"Could not open socket: {e}")

エラーが発生した場合には、例外をキャッチしてエラーメッセージを表示します。

このコードは、Deepgramを使用してリアルタイムでマイクから音声を文字起こしするための基本的なフレームワークを提供します。Deepgramの強力な音声認識機能を活用することで、リアルタイムでの会話の文字起こしなど、さまざまなアプリケーションを開発することが可能になります。

スクリプト全体

# Copyright 2023 Deepgram SDK contributors. All Rights Reserved.
# Use of this source code is governed by a MIT license that can be found in the LICENSE file.
# SPDX-License-Identifier: MIT

from dotenv import load_dotenv
import logging, verboselogs
from time import sleep

from deepgram import (
    DeepgramClient,
    DeepgramClientOptions,
    LiveTranscriptionEvents,
    LiveOptions,
    Microphone,
)

load_dotenv()

def main():
    try:
        # example of setting up a client config. logging values: WARNING, VERBOSE, DEBUG, SPAM
        # config = DeepgramClientOptions(
        #     verbose=logging.DEBUG, options={"keepalive": "true"}
        # )
        # deepgram: DeepgramClient = DeepgramClient("", config)
        # otherwise, use default config
        deepgram: DeepgramClient = DeepgramClient()

        dg_connection = deepgram.listen.live.v("1")

        def on_message(self, result, **kwargs):
            sentence = result.channel.alternatives[0].transcript
            if len(sentence) == 0:
                return
            print(f"speaker: {sentence}")

        def on_metadata(self, metadata, **kwargs):
            print(f"\n\n{metadata}\n\n")

        def on_speech_started(self, speech_started, **kwargs):
            print(f"\n\n{speech_started}\n\n")

        def on_utterance_end(self, utterance_end, **kwargs):
            print(f"\n\n{utterance_end}\n\n")

        def on_error(self, error, **kwargs):
            print(f"\n\n{error}\n\n")

        dg_connection.on(LiveTranscriptionEvents.Transcript, on_message)
        dg_connection.on(LiveTranscriptionEvents.Metadata, on_metadata)
        dg_connection.on(LiveTranscriptionEvents.SpeechStarted, on_speech_started)
        dg_connection.on(LiveTranscriptionEvents.UtteranceEnd, on_utterance_end)
        dg_connection.on(LiveTranscriptionEvents.Error, on_error)

        options: LiveOptions = LiveOptions(
            model="nova-2",
            punctuate=True,
            language="ja",
            encoding="linear16",
            channels=1,
            sample_rate=16000,
            # To get UtteranceEnd, the following must be set:
            interim_results=True,
            utterance_end_ms="1000",
            vad_events=True,
        )
        dg_connection.start(options)

        # Open a microphone stream on the default input device
        microphone = Microphone(dg_connection.send)

        # start microphone
        microphone.start()

        # wait until finished
        input("Press Enter to stop recording...\n\n")

        # Wait for the microphone to close
        microphone.finish()

        # Indicate that we've finished
        dg_connection.finish()

        print("Finished")
        # sleep(30)  # wait 30 seconds to see if there is any additional socket activity
        # print("Really done!")

    except Exception as e:
        print(f"Could not open socket: {e}")
        return

if __name__ == "__main__":
    main()

まとめ

Deepgramを使用してマイクからリアルタイムで文字起こしを行う方法をご紹介しました。このプロセスは、Deepgram SDKを使って比較的簡単に実装できます。今回の記事が、音声認識技術を活用したいと考えている初心者の方々にとって、役立つ情報であったことを願っています。音声データからの情報抽出は、多くのアプリケーションで非常に有効な手段です。この技術を使って、あなたのプロジェクトやアイデアをさらに発展させてみてください。