DeepSeek-Mathリポジトリの概要

DeepSeek-Mathは、大規模言語モデルDeepSeekをベースに、数学関連タスクで高い性能を発揮するように追加学習したモデルです。
このリポジトリでは、以下のモデルが公開されています。

DeepSeekMath-Base: ベースモデル
DeepSeekMath-Instruct: 命令追従型モデル
DeepSeekMath-RL: 強化学習適用モデル

リポジトリの構成
主要なディレクトリ・ファイルの説明
使い方
1. 評価の実行方法
2. 推論の実行方法
評価スクリプトの解説
1. run_subset_parallel.py
2. eval/eval_script.py
プロンプトのフォーマット
1. cot_gsm_8_shot.py
2. minif2f_isabelle.py
まとめ
参考サイト
1. 関連

リポジトリの構成

DeepSeek-Math/
    cog.yaml
    README.md
    evaluation/
        evaluation_results.json
        README.md
        run_subset_parallel.py
        submit_eval_jobs.py
        summarize_results.py
        unsafe_score_minif2f_isabelle.py
        utils.py
        configs/
        datasets/ 
        data_processing/
        eval/
        few_shot_prompts/
        infer/
    images/
    replicate/

主要なディレクトリ・ファイルの説明

evaluation/: モデルの評価に関するスクリプトとデータが格納されています。
- run_subset_parallel.py: モデルを並列に評価するためのスクリプト。
- submit_eval_jobs.py: 評価ジョブを投入するスクリプト。
- summarize_results.py: 評価結果を集計するスクリプト。
- configs/: 評価設定ファイル。
- datasets/: 評価用データセット。
- data_processing/: データ前処理用スクリプト。
- eval/: 評価指標計算用スクリプト。
- few_shot_prompts/: Few-shot promptingに使用するプロンプト。
- infer/: 推論用スクリプト。
replicate/: モデルの推論インターフェース用スクリプト。
cog.yaml: モデルのビルド設定ファイル。
README.md: リポジトリの説明文書。

使い方

評価の実行方法

必要な環境のセットアップ
```
conda env create -f environment.yml
```
評価の実行
```
python submit_eval_jobs.py --n-gpus 8
```
- --n-gpus: 使用するGPU数を指定。
- モデルやデータセットを変更する場合は、submit_eval_jobs.py内の設定を適宜変更。
評価結果の集計
```
python summarize_results.py [--eval-atp]
```
- --eval-atp: Informal to Formalタスクの評価を行う場合に指定。事前にPISAの設定が必要。
評価結果はevaluation_results.jsonに出力されます。

推論の実行方法

HuggingFaceのtransformersライブラリを用いて推論可能です。

テキスト生成の例:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/deepseek-math-7b-base" 
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

text = "The integral of x^2 from 0 to 2 is"
inputs = tokenizer(text, return_tensors="pt") 
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

会話生成の例:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/deepseek-math-7b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

messages = [
    {"role": "user", "content": "what is the integral of x^2 from 0 to 2?\nPlease reason step by step, and put your final answer within \boxed{}."}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)

もちろんです。主要なコードとプロンプトのフォーマットについて解説します。

評価スクリプトの解説

run_subset_parallel.py

このスクリプトは、モデルを並列に評価するために使用されます。主な処理は以下の通りです。

テストデータの読み込みとプロンプトの作成
- markup_question関数で、質問文にタスクに応じたプロンプトを追加。
並列評価の実行
- do_parallel_sampling関数で、モデルの推論を並列に実行。
- KeyWordsCriteriaクラスで、生成止めの条件を指定。
- generate_completions関数で、実際の推論を実行。
評価指標の計算
- evaluate関数で、推論結果に対して評価指標を計算。

eval/eval_script.py

このスクリプトは、推論結果に対して評価指標を計算するために使用されます。主な処理は以下の通りです。

解答の抽出
- extract_boxed_answers関数で、\boxed{}内の解答を抽出。
- extract_program_output関数で、プログラムの出力を抽出。
解答の正誤判定
- is_correct関数で、予測と正解が一致するかを判定。
- math_equal関数で、数式の一致を判定。
タスクごとの評価指標の計算
- eval_math、eval_last_single_answerなど、タスクごとに評価指標を計算する関数が定義されている。

プロンプトのフォーマット

Few-shot promptingで使用するプロンプトは、few_shot_prompts/ディレクトリに定義されています。プロンプトのフォーマットは、タスクごとに異なります。

cot_gsm_8_shot.py

GSM8Kデータセットでのchain-of-thoughtプロンプトの例です。

few_shot_prompt = """
Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?
A: There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6. The answer is 6.

Q: {question}
A: {answer}
"""

class CoTGSMPrompt(FewShotPrompting):
    def format_prompt(self, task_input, task_output):
        prompt = few_shot_prompt.format(question=task_input, answer=task_output)
        return prompt.strip()

minif2f_isabelle.py

miniF2Fデータセットでのプロンプトの例です。

few_shot_prompt = """
Informal:
(*### Problem
Find the minimum value of $\\frac{9x^2\\sin^2 x + 4}{x\\sin x}$ for $0 < x < \\pi$. Show that it is 12.

### Solution
Let $y = x \\sin x$. It suffices to show that $12 \\leq \\frac{9y^2 + 4}{y}.
It is trivial to see that $y > 0$. 
Then one can multiply both sides by $y$ and it suffices to show $12y \\leq 9y^2 + 4$.
This can be done by the sum of squares method.*)

Formal:
theorem aime_1983_p9:
  fixes x::real
  assumes "0<x" "x<pi"
  shows "12 \\<le> ((9 * (x^2 * (sin x)^2)) + 4) / (x * sin x)"
proof -
  (* Let $y = x \\sin x$. *)
  define y where "y=x * sin x"
  (* It suffices to show that $12 \\leq \\frac{9y^2 + 4}{y}. *)
  have "12 \\<le> (9 * y^2 + 4) / y"
  proof -
    (* It is trivial to see that $y > 0$. *)
    have c0: "y > 0"
      sledgehammer
    (* Then one can multiply both sides by $y$ and it suffices to show $12y \\leq 9y^2 + 4$. *)
    have "(9 * y^2 + 4) \\<ge> 12 * y" 
      sledgehammer
    then show ?thesis
      sledgehammer
  qed
  then show ?thesis
    sledgehammer
qed

Informal:
{informal_statement}
{informal_proof}

Formal:
{formal_statement}
{formal_proof}
"""

class MiniF2FIsabellePrompt(FewShotPrompting):
    def format_prompt(self, task_input, task_output):
        prompt = few_shot_prompt.format(
            informal_statement=task_input["informal_statement"],
            informal_proof=task_input["informal_proof"], 
            formal_statement=task_input["formal_statement"],
            formal_proof=task_output,
        )
        return prompt.strip()

以上が主要コードとプロンプトフォーマットの解説です。
モデルの評価や推論の際は、これらのスクリプトやプロンプトを参考にすると良いでしょう。