Pythonを使った数学問題の自動生成: データ拡張テクニック集

AI・機械学習

2024.06.242024.04.27

Pythonを使った数学問題の自動生成: データ拡張テクニック集

データサイエンスにおいて、質の高いデータを大量に用意することは重要ですが、データが不足している場合もあります。そこで、データ拡張（Data Augmentation）の手法を用いて、元のデータを変形させることで新しいデータを生成することができます。

本記事では、数学の問題とその解答のデータセットを用いて、データ拡張により新しい問題と解答を自動生成する方法を紹介します。Pythonを使用し、pandasやsympyなどのライブラリを活用します。

こちらの記事もおすすめ

CodeGemma - Googleが開発したコード特化の大規模言語モデル

CodeGemmaの概要CodeGemmaとは？Googleが開発したコード特化の大規模言語モデル(LLM)Gemmaモデルをベースに、追加で5000億トークンの英語データ、数学、コードを学習コードの補完や生成に特化し、論理的・数学的な推論...

JAXとWandbとSelf-Consistencyを使ったGemma Instruct 2Bモデルのファインチューニング入門

このノートブックでは、Kaggleの"AI Mathematical Olympiad"コンペティションに向けて、JAXをバックエンドに使用してGemma Instruct 2Bモデルをファインチューニングする方法を解説します。また、Wei...

準備
問題1のデータ拡張
問題2のデータ拡張
問題3のデータ拡張
問題4のデータ拡張
問題5のデータ拡張
問題6のデータ拡張
問題7のデータ拡張
問題8のデータ拡張
問題9のデータ拡張
問題10のデータ拡張
拡張したデータの結合
ノートブック
参考サイト
1. 関連

準備

まず必要なライブラリをインストールし、データセットを読み込みます。

!pip install portion

import numpy as np
import pandas as pd
import sympy as sp
import random
import math
from scipy.optimize import minimize
import portion as P
import fractions
from tqdm.notebook import tqdm

df = pd.read_csv('../input/ai-mathematical-olympiad-prize/train.csv')

for s in df.problem:
    print(s)

問題1のデータ拡張

問題1は二次関数と直線の交点に関する問題です。パラメータを変更することで新しい問題を生成できます。

import sympy as sp
import random

def solve_q1(line=4, distance=6):
    # シンボリックな変数を定義
    k, l, x1, y1, x2, y2, d2 = sp.symbols('k l x1 y1 x2 y2 d2')

    # 二次関数（放物線）の方程式を定義
    parabola = y1 - k*x1**2 + 2*k*x1 - l

    # 直線の方程式を定義
    line_eq = y1 - line

    # 交点の方程式を定義
    eq1 = sp.Eq(parabola.subs({x1: x1, y1: y1}), 0)
    eq2 = sp.Eq(parabola.subs({x1: x2, y1: y2}), 0)
    eq3 = sp.Eq(line_eq.subs({x1: x1, y1: y1}), 0)
    eq4 = sp.Eq(line_eq.subs({x1: x2, y1: y2}), 0)
    eq5 = sp.Eq(x1 + x2, 2)  

    # 距離の方程式を定義
    eq6 = sp.Eq((x2 - x1)**2 + (y2 - y1)**2, distance**2) 
    eq7 = sp.Eq(x1**2 + y1**2 + x2**2 + y2**2, d2)

    # 方程式を連立方程式として解く
    solution = sp.solve((eq1, eq2, eq3, eq4, eq5, eq6, eq7), (k, l, x1, y1, x2, y2, d2))

    # 解の中から、d2の値を取り出す
    res = solution[0][-1]
    return res

q1_template = "Let $k, l > 0$ be parameters. The parabola $y = kx^2 - 2kx + l$ intersects the line $y = {line}$ at two points $A$ and $B$. These points are distance {distance} apart. What is the sum of the squares of the distances from $A$ and $B$ to the origin?"

def aug_q1(q1_template=q1_template, n_rep=1000):
    # n_rep回ループを回す
    for i in range(n_rep):
        # 直線のy切片（1から10の整数）をランダムに選択
        line = random.randint(1, 10)
        # 交点間の距離（1から10の整数）をランダムに選択
        distance = random.randint(1, 10)
        # solve_q1関数で問題を解く
        answer = solve_q1(line, distance)
        # 解が整数であれば、1000で割った余りを答えとし、ループを抜ける
        if int(answer) == answer:
            answer = int(answer) % 1000
            break
    # 問題文のテンプレートに直線のy切片と交点間の距離を代入
    problem = q1_template.format(line=line, distance=distance)
    return problem, answer

solve_q1関数では、sympy libraryを使ってシンボリックな変数を定義し、二次関数（放物線）と直線の方程式を立てます。
交点の座標を求めるために、二次関数と直線の交点の方程式を立てます。また、交点間の距離と原点からの距離に関する方程式も立てます。
sp.solve関数を使って連立方程式を解き、解の中からd2の値（原点からの距離の和）を取り出します。
aug_q1関数では、1000回ループを回して、直線のy切片と交点間の距離をランダムに選択し、solve_q1関数で問題を解きます。
解が整数であれば、1000で割った余りを答えとし、ループを抜けます。
問題文のテンプレートに直線のy切片と交点間の距離を代入し、問題文と答えのペアを返します。

この関数を使うことで、与えられたテンプレートをもとに、直線のy切片と交点間の距離をランダムに変更することで、新しい問題を自動的に生成することができます。

問題2のデータ拡張

問題2は3桁の整数に色を塗る問題です。最小値、最大値、色の名前を変更して新しい問題を生成します。

import random

def solve_q2(min_value=110, max_value=999):
    def is_yellow(x, A, B):
        # xが条件を満たすか判定する関数
        if 2*x in A or any(x + item in A for item in A):
            return False
        if 2*x in B or any(x + item in B for item in A):
            return True
        return False

    # 色1と色2の数字の集合を初期化
    A = set()
    B = set()

    # max_valueからmin_valueまで降順にループ
    for value in range(max_value, min_value - 1, -1):
        # valueが条件を満たすならAに、そうでないならBに追加
        if is_yellow(value, A, B):
            A.add(value)
        else:
            B.add(value)

    # 色1の数字の個数を返す
    return len(A)

q2_template = "Each of the three-digits numbers ${min_value}$ to ${max_value}$ is coloured {color0} or {color1} in such a way that the sum of any two (not necessarily different) {color1} numbers is equal to a {color0} number. What is the maximum possible number of {color1} numbers there can be?"

def aug_q2(q2_template=q2_template, n_rep=1000):
    # 色0と色1をランダムに選択
    color0, color1 = random.sample(["red", "blue", "green", "yellow", "white", "black"], 2)

    # n_rep回ループを回す
    for i in range(n_rep):
        # min_valueとmax_valueをランダムに選択
        min_value = random.randint(100, 998)
        max_value = random.randint(min_value, 999)
        # solve_q2関数で問題を解く
        answer = solve_q2(min_value, max_value)
        # 解が整数で正の値であれば、1000で割った余りを答えとし、ループを抜ける
        if (int(answer) == answer) and (answer > 0):
            answer = int(answer) % 1000
            break

    # 問題文のテンプレートに色と数値の範囲を代入
    problem = q2_template.format(color0=color0, color1=color1, min_value=min_value, max_value=max_value)
    return problem, answer

solve_q2関数では、色1の数字の個数を最大化するために、max_valueからmin_valueまで降順にループを回します。
is_yellow関数で、現在の数字が色1の条件を満たすかどうかを判定します。条件は、「2倍した数が色1の集合に含まれない」かつ「色1の集合の任意の数字との和が色1の集合に含まれない」です。
条件を満たす数字は色1の集合に、そうでない数字は色0の集合に追加します。
最後に、色1の集合の要素数を返します。
aug_q2関数では、1000回ループを回して、色と数値の範囲をランダムに選択し、solve_q2関数で問題を解きます。
解が整数で正の値であれば、1000で割った余りを答えとし、ループを抜けます。
問題文のテンプレートに色と数値の範囲を代入し、問題文と答えのペアを返します。

この関数を使うことで、与えられたテンプレートをもとに、色と数値の範囲をランダムに変更することで、新しい問題を自動的に生成することができます。

問題3のデータ拡張

問題3は、特定の条件を満たす正の整数の個数を求める問題です。問題文中の数値を変更することで、新しい問題を生成します。

import random
import math

def count_numbers(n, s):
    # n桁の数字で各桁の和がs以下となる数字の個数を数えるDP
    dp = [[0 for _ in range(s + 1)] for _ in range(n + 1)]
    dp[0][0] = 1

    for i in range(1, n + 1):
        for sum_ in range(s + 1):
            for digit in range(10):
                if sum_ >= digit:
                    dp[i][sum_] += dp[i - 1][sum_ - digit]

    return dp[n][s]

def sparkle(x):
    # 数字xのスパークル操作を行う関数
    x = sum(int(i) for i in str(x))
    return math.factorial(x)

def solve_q3(max_threshold=6, total_digit=36):  
    # 初項が1以上max_threshold未満の整数のリストを作成
    first_items = [i for i in range(1, max_threshold) if math.factorial(i) < max_threshold]

    # 初項がスペシャル数であるものを列挙
    special_items = []
    for x in first_items:
        seen = set([x])
        next_x = x
        while next_x < max_threshold:
            next_x = sparkle(next_x)
            if next_x in seen:
                special_items.append(x)
                break
            seen.add(next_x)

    # スペシャル数を初項とする数列の個数の総和を計算
    answer = sum(count_numbers(total_digit, i) for i in special_items)
    return answer

q3_template = "Let the `sparkle' operation on positive integer $n$ consist of calculating the sum of the digits of $n$ and taking its factorial, e.g. the sparkle of {eaxmple_value} is ${eaxmple_value_sum}! = {eaxmple_value_sparkle}$. A robot starts with a positive integer on a blackboard, then after each second for the rest of eternity, replaces the number on the board with its sparkle. For some `special' numbers, if they're the first number, then eventually every number that appears will be less than {max_threshold}. How many such special numbers are there with at most {total_digit} digits?"

def aug_q3(q3_template=q3_template, n_rep=1000):
    # サンプル値を生成
    eaxmple_value = random.randint(11, 99)
    eaxmple_value_sum = sum(int(i) for i in str(eaxmple_value))
    eaxmple_value_sparkle = sparkle(eaxmple_value)

    # n_rep回ループを回す
    for i in range(n_rep):
        # max_thresholdとtotal_digitをランダムに選択
        max_threshold = random.randint(2, 10)
        total_digit = random.randint(10, 100)
        # solve_q3関数で問題を解く
        answer = solve_q3(max_threshold, total_digit)
        # 解が整数で正の値であれば、1000で割った余りを答えとし、ループを抜ける
        if (int(answer) == answer) and (answer > 0):
            answer = int(answer) % 1000
            break

    # 問題文のテンプレートにサンプル値と問題のパラメータを代入
    problem = q3_template.format(eaxmple_value=eaxmple_value, eaxmple_value_sum=eaxmple_value_sum,
                                 eaxmple_value_sparkle=eaxmple_value_sparkle, max_threshold=max_threshold,
                                 total_digit=total_digit)
    return problem, answer

count_numbers関数は、動的計画法（DP）を用いて、n桁の数字で各桁の和がs以下となる数字の個数を数えます。
sparkle関数は、正の整数xに対してスパークル操作を行います。スパークル操作とは、xの各桁の和を計算し、その和の階乗を返すことです。
solve_q3関数では、まず初項が1以上max_threshold未満の整数のリストを作成します。次に、そのリストの中からスペシャル数（スパークル操作を繰り返すとmax_threshold未満の数字のみが現れる数）を列挙します。最後に、スペシャル数を初項とする数列の個数の総和を計算します。
aug_q3関数では、1000回ループを回して、サンプル値と問題のパラメータ（max_thresholdとtotal_digit）をランダムに選択し、solve_q3関数で問題を解きます。
解が整数で正の値であれば、1000で割った余りを答えとし、ループを抜けます。
問題文のテンプレートにサンプル値と問題のパラメータを代入し、問題文と答えのペアを返します。

この関数を使うことで、与えられたテンプレートをもとに、サンプル値と問題のパラメータをランダムに変更することで、新しい問題を自動的に生成することができます。

問題4のデータ拡張

問題4は、ある条件を満たす実数$x$と$y$に対する式の最小値を求める問題です。問題文中の数値を変更することで、新しい問題を生成します。

import random
from scipy.optimize import minimize

def solve_q4(a=5, b=5, c=8, d=2, e=2, f=40):  
    def objective(vars, a=a, b=b, c=c):
        # 目的関数を定義
        x, y = vars
        return a * x**2 + b * y**2 - c * x * y

    def constraint(vars, d=d, e=e, f=f):
        # 制約条件を定義
        x, y = vars
        return abs(x - d*y) + abs(y - e*x) - f

    # 初期値を設定
    initial_guess = [0, 0]

    # 制約条件を設定
    con = {'type': 'eq', 'fun': constraint}

    # 最適化問題を解く
    result = minimize(objective, initial_guess, constraints=con, method='SLSQP', options={'disp': False})

    # 最適値を返す
    return result.fun

q4_template = "What is the minimum value of ${a}x^2+{b}y^2-{c}xy$ when $x$ and $y$ range over all real numbers such that $|x-{d}y| + |y-{e}x| = {f}$?"

def aug_q4(q4_template=q4_template, n_rep=10000):
    # n_rep回ループを回す
    for i in range(n_rep):
        # パラメータをランダムに選択
        a = random.randint(1, 10)
        b = random.randint(1, 10)
        c = random.randint(1, 10)
        d = random.randint(1, 10)
        e = random.randint(1, 10)
        f = random.randint(10, 100)
        # solve_q4関数で問題を解く
        answer = solve_q4(a, b, c, d, e, f)
        # 解が整数で正の値かつ10000未満であれば、1000で割った余りを答えとし、ループを抜ける
        if (int(answer) == answer) and (answer < 10000) and (answer > 0):
            answer = int(answer) % 1000
            break

    # 問題文のテンプレートにパラメータを代入
    problem = q4_template.format(a=a, b=b, c=c, d=d, e=e, f=f)
    return problem, answer

解説:

solve_q4関数では、scipyライブラリのminimize関数を使って最適化問題を解きます。
objective関数で目的関数を定義します。ここでは、$ax^2+by^2-cxy$の最小値を求めることが目的です。
constraint関数で制約条件を定義します。ここでは、$|x-dy| + |y-ex| = f$を満たすようにします。
initial_guessで最適化の初期値を設定します。
conで制約条件を設定します。ここでは、等式制約を設定しています。
minimize関数で最適化問題を解きます。method引数で最適化アルゴリズムを指定し、options引数で詳細な設定を行います。
最適値を返します。
aug_q4関数では、10000回ループを回して、パラメータをランダムに選択し、solve_q4関数で問題を解きます。
解が整数で正の値かつ10000未満であれば、1000で割った余りを答えとし、ループを抜けます。
問題文のテンプレートにパラメータを代入し、問題文と答えのペアを返します。

この関数を使うことで、与えられたテンプレートをもとに、パラメータをランダムに変更することで、新しい問題を自動的に生成することができます。

問題5のデータ拡張

問題5は、2桁の正の整数からなる長さ5の等比数列に関する問題です。問題文の表現を変更することで、新しい問題を生成します。

import math
import random

def is_divisible_by(base, divisor):
    """base の二乗が divisor で割り切れるかを判定"""
    if divisor == 0:  # ゼロ除算を避ける
        return False
    return (base**2 // divisor) * divisor == base**2

def is_perfect_square(number):
    """number が平方数かを判定"""
    root = math.isqrt(number)  # number の平方根（整数）
    return root * root == number

def solve_q5():
    min_value = 10
    max_value = 100
    for third_value in range(min_value, max_value):
        for first_value in range(min_value, third_value):
            if is_divisible_by(third_value, first_value):
                fifth_value = third_value**2 // first_value
                if fifth_value < max_value:
                    forth_value_square = third_value * fifth_value
                    second_value_square = third_value * first_value
                    if is_perfect_square(forth_value_square) and is_perfect_square(second_value_square):
                        second_value = math.isqrt(second_value_square)
                        forth_value = math.isqrt(forth_value_square)
                        # 値を返す
                        return sum([first_value, second_value, third_value, forth_value, fifth_value])

q5_template = "There {w0} a unique increasing geometric sequence of five 2-digit positive integers. {w1}"

def aug_q5(q5_template=q5_template, n_rep=10000):
    # 単語のバリエーションを設定
    w0 = random.choice(['is', 'exists'])
    w1 = random.choice(['What is their sum?', 
                        'What is the sum of these integers?', 
                        'What is the sum of these values?', 
                        'What is the sum of these numbers?',
                        'Compute their sum.', 
                        'Compute the sum of these integers.', 
                        'Compute the sum of these values.', 
                        'Compute the sum of these numbers.',
                        'Calculate their sum.', 
                        'Calculate the sum of these integers.', 
                        'Calculate the sum of these values.', 
                        'Calculate the sum of these numbers.',
                        'Answer their sum.', 
                        'Answer the sum of these integers.', 
                        'Answer the sum of these values.', 
                        'Answer the sum of these numbers.'])

    # 問題文を生成
    problem = q5_template.format(w0=w0, w1=w1)

    # 答えを計算
    answer = solve_q5()
    return problem, answer

解説:

is_divisible_by関数は、baseの二乗がdivisorで割り切れるかを判定します。
is_perfect_square関数は、numberが平方数かを判定します。
solve_q5関数では、2重ループを使って、条件を満たす等比数列の項を探索します。
- まず、third_valueを10から99までループします。
- 次に、first_valueを10からthird_value未満までループします。
- third_valueの二乗がfirst_valueで割り切れる場合、fifth_valueを計算します。
- fifth_valueが100未満の場合、forth_valueとsecond_valueの平方数を計算します。
- forth_valueとsecond_valueの平方数が平方数の場合、条件を満たす等比数列が見つかったことになります。
- 見つかった等比数列の項の和を返します。
aug_q5関数では、問題文のテンプレートに単語のバリエーションを代入して問題文を生成します。
solve_q5関数を呼び出して答えを計算し、問題文と答えのペアを返します。

この関数を使うことで、与えられたテンプレートをもとに、単語のバリエーションをランダムに変更することで、新しい問題を自動的に生成することができます。ただし、問題文のテンプレートは固定されており、数値の変更は行われません。

問題6のデータ拡張

問題6は、ある方程式が4つの異なる解を持つような正の整数$m$の個数を求める問題です。問題文中の数値を変更することで、新しい問題を生成します。

import sympy as sp
import random
import portion as P

def solve_q6(a=100):
    m = sp.symbols('m', real=True, nonnegative=True)  # m は非負の実数
    x = sp.symbols('x', real=True)  # x は実数

    # 方程式を定義
    equation = sp.Eq(sp.Abs(sp.Abs(x - 1) - 2), m/a)

    # x について方程式を解く
    solutions = sp.solve(equation, x)

    intervals = [(0, float('inf'))]

    for expr in solutions:
        if isinstance(expr, sp.Piecewise):
            for e, cond in expr.args:
                if cond.has(m):
                    min_m = float('-inf')
                    max_m = float('inf')
                    for inequality in cond.args:
                        if inequality.rel_op == '<=':
                            max_m = min(max_m, inequality.rhs)
                        elif inequality.rel_op == '<':
                            max_m = min(max_m, inequality.rhs - 1)
                        elif inequality.rel_op == '>=':
                            min_m = max(min_m, inequality.rhs)
                        elif inequality.rel_op == '>':
                            min_m = max(min_m, inequality.rhs + 1)
                    intervals.append((min_m, max_m))

    intervals = [P.closed(min_m, max_m) for min_m, max_m in intervals]

    intersection = P.Interval()

    for interval in intervals:
        if intersection.empty:
            intersection = interval
        else:
            intersection = intersection & interval

    answer = 0
    for m_val in range(intersection.lower, intersection.upper + 1):
        # 現在の m の値で解を評価
        evaluated_solutions = [sol.subs(m, m_val) for sol in solutions]
        # 解が一意であるかを確認
        if len(set(evaluated_solutions)) == 4:
            answer += 1
    return answer

q6_template = "For how many positive integers $m$ does the equation \[\\vert \\vert x-1 \\vert -2 \\vert=\\frac{{m}}{{{a}}}\] have $4$ distinct solutions?"

def aug_q6(q6_template=q6_template, n_rep=10000):
    for i in range(n_rep):
        # パラメータ a をランダムに選択
        a = random.randint(10, 1000)
        # solve_q6 関数で問題を解く
        answer = solve_q6(a)
        # 解が整数で正の値であれば、1000 で割った余りを答えとし、ループを抜ける
        if (int(answer) == answer) and (answer > 0):
            answer = int(answer) % 1000
            break
    # 問題文のテンプレートにパラメータを代入
    problem = q6_template.format(a=a)
    return problem, answer

解説:

solve_q6関数では、与えられたパラメータaに対して、方程式を解き、4つの異なる解を持つような正の整数$m$の個数を求めます。
mとxをシンボルとして定義します。mは非負の実数、xは実数とします。
方程式を定義します。ここでは、$|x-1|-2=\frac{m}{a}$を絶対値記号を使って表現しています。
sp.solve関数を使って、方程式をxについて解きます。
解がmの条件式を含む場合、その条件式からmの範囲を求めます。
求めたmの範囲から、閉区間を作成します。
閉区間の共通部分を求めます。
共通部分内の整数mについて、解を評価し、解が4つの異なる値を持つ場合、カウントを増やします。
最終的に、条件を満たすmの個数を返します。
aug_q6関数では、10000回ループを回して、パラメータaをランダムに選択し、solve_q6関数で問題を解きます。
解が整数で正の値であれば、1000で割った余りを答えとし、ループを抜けます。
問題文のテンプレートにパラメータを代入し、問題文と答えのペアを返します。

問題7のデータ拡張

問題7は、サイコロを振って得られる目の最大値に関する確率を求める問題です。サイコロの個数と求める目の値を変更することで、新しい問題を生成します。

import fractions
import random

def count_outcomes(dice_left, highest_roll, max_value):
    # サイコロがない場合、最大値が max_value であれば 1、そうでなければ 0 を返す
    if dice_left == 0:
        return (highest_roll == max_value, 1)
    # サイコロがある場合、次のサイコロの出目を再帰的に計算
    else:
        total_outcomes = 0
        desired_outcomes = 0
        for roll in range(1, 7):
            outcomes = count_outcomes(dice_left - 1, max(highest_roll, roll), max_value)
            desired_outcomes += outcomes[0]
            total_outcomes += outcomes[1]
        return (desired_outcomes, total_outcomes)

def solve_q7(num_dice=4, max_value=5):
    outcomes = count_outcomes(num_dice, 0, max_value)

    # 確率を約分して、分子と分母の和を返す
    probability = fractions.Fraction(outcomes[0], outcomes[1])
    answer = probability.numerator + probability.denominator
    return answer

q7_template = "Suppose that we roll {num_dice} 6-sided fair dice with faces numbered 1 to~6. Let $a/b$ be the probability that the highest roll is a {max_value}, where $a$ and $b$ are relatively prime positive integers. Find $a + b$."

def aug_q7(q7_template=q7_template, n_rep=10000):
    num_dict = {
        'one': 1,
        'two': 2,
        'three': 3,
        'four': 4,
        'five': 5,
        'six': 6,
        'seven': 7,
        'eight': 8,
        'nine': 9,
        'zero': 0
    }
    for i in range(n_rep):
        # max_value をランダムに選択（1から6まで）
        max_value = random.randint(1, 7)
        # num_dice を英語の数字からランダムに選択
        num_dice = random.choice(list(num_dict.keys()))
        num_dice_num = num_dict[num_dice]
        # solve_q7 関数で問題を解く
        answer = solve_q7(num_dice=num_dice_num, max_value=max_value)
        # 解が整数で1より大きく10000未満であれば、1000で割った余りを答えとし、ループを抜ける
        if (int(answer) == answer) and (answer > 1) and (answer < 10000):
            answer = int(answer) % 1000
            break
    # 問題文のテンプレートにパラメータを代入
    problem = q7_template.format(num_dice=num_dice, max_value=max_value)
    return problem, answer

解説:

count_outcomes関数は、再帰的にサイコロを振って、最大値がmax_valueとなる確率を計算します。
- dice_leftはまだ振られていないサイコロの数、highest_rollは今までの最大値です。
- サイコロがない場合、highest_rollがmax_valueと等しければ1、そうでなければ0を返します。
- サイコロがある場合、次のサイコロの出目（1から6まで）を再帰的に計算し、合計の出目数と目的の出目数を更新します。
solve_q7関数では、count_outcomes関数を呼び出して、確率を計算します。
- 確率を約分し、分子と分母の和を答えとして返します。
aug_q7関数では、10000回ループを回して、max_valueとnum_diceをランダムに選択し、solve_q7関数で問題を解きます。
- max_valueは1から6までのランダムな整数です。
- num_diceは英語の数字からランダムに選択され、辞書num_dictを使って整数に変換されます。
- 解が整数で1より大きく10000未満であれば、1000で割った余りを答えとし、ループを抜けます。
問題文のテンプレートにパラメータを代入し、問題文と答えのペアを返します。

問題8のデータ拡張

問題8は、ある方程式を満たす点によって囲まれる凸多角形の面積を求める問題です。方程式中の数値を変更することで、新しい問題を生成します。

from ortools.constraint_solver import pywrapcp
import numpy as np
import sympy as sp

def triangle_area(p1, p2, p3):
    # 三角形の面積を計算
    matrix = sp.Matrix([
        [p1.x, p1.y, 1],
        [p2.x, p2.y, 1],
        [p3.x, p3.y, 1]
    ])
    return sp.Abs(matrix.det()) / 2

def solve_q8(a=20, b=24, c=8, d=8):
    # 点の座標を定義
    A = sp.Point(0, 0)
    B = sp.Point(0, 1)
    C = sp.Point(1, 1)
    D = sp.Point(1, 0)
    P = sp.Point(0, 1/a)
    Q = sp.Point(1/b, 0)

    # 直線を定義
    DP = sp.Line(D, P)
    BQ = sp.Line(B, Q)

    # 交点を求める
    intersection = DP.intersection(BQ)[0]

    # 小さい三角形の面積を計算
    area_small = triangle_area(A, Q, intersection) + triangle_area(A, P, intersection)
    # 大きい三角形の面積を計算
    area_large = triangle_area(C, B, intersection) + triangle_area(C, D, intersection)

    # 面積の比を返す
    answer = area_large / area_small
    return answer

q8_template = "The points $\\left(x, y\\right)$ satisfying $((\\vert x + y \\vert - {a})^2 + ( \\vert x - y \\vert - {b})^2)((\\vert x \\vert - {c})^2 + ( \\vert y \\vert - {d})^2) = 0$ enclose a convex polygon. What is the area of this convex polygon?"

def aug_q8(q8_template=q8_template, n_rep=10000):
    for i in range(n_rep):
        # パラメータをランダムに選択
        a = random.randint(1, 10)
        b = random.randint(1, 10)
        c = random.randint(1, 10)
        d = random.randint(1, 10)
        # solve_q8関数で問題を解く
        answer = solve_q8(a, b, c, d)
        # 解が整数で1より大きく10000未満であれば、1000で割った余りを答えとし、ループを抜ける
        if (int(answer) == answer) and (answer > 1) and (answer < 10000):
            answer = int(answer) % 1000
            break
    # 問題文のテンプレートにパラメータを代入
    problem = q8_template.format(a=a, b=b, c=c, d=d)
    return problem, answer

解説:

triangle_area関数は、3点の座標から三角形の面積を計算します。
- 3点の座標を行列に格納し、行列式の絶対値の半分が面積となります。
solve_q8関数では、与えられたパラメータに対して、凸多角形の面積を計算します。
- 点A, B, C, D, P, Qの座標を定義します。
- 直線DPとBQを定義し、その交点を求めます。
- 交点で分けられる小さい三角形と大きい三角形の面積を計算します。
- 大きい三角形の面積を小さい三角形の面積で割った比を答えとして返します。
aug_q8関数では、10000回ループを回して、パラメータa, b, c, dをランダムに選択し、solve_q8関数で問題を解きます。
- 解が整数で1より大きく10000未満であれば、1000で割った余りを答えとし、ループを抜けます。
問題文のテンプレートにパラメータを代入し、問題文と答えのペアを返します。

問題9のデータ拡張

問題9は、正方形内の点と線分によって分けられる領域の面積比を求める問題です。点の位置を決めるパラメータを変更することで、新しい問題を生成します。

import sympy as sp
import random

def solve_q9(a=20, b=24):
    # 点の座標を定義
    A = sp.Point(0, 0)
    B = sp.Point(0, 1)
    C = sp.Point(1, 1)
    D = sp.Point(1, 0)
    P = sp.Point(0, 1/a)
    Q = sp.Point(1/b, 0)

    # 直線を定義
    DP = sp.Line(D, P)
    BQ = sp.Line(B, Q)

    # 交点を求める
    intersection = DP.intersection(BQ)[0]

    # 面積を計算する関数
    def area(p1, p2, p3):
        return abs(p1.x*(p2.y-p3.y) + p2.x*(p3.y-p1.y) + p3.x*(p1.y-p2.y))/2

    # 各領域の面積を計算
    area1 = area(A, P, intersection)
    area2 = area(P, B, intersection)
    area3 = area(intersection, Q, D)
    area4 = area(intersection, C, Q)

    # 最大面積と最小面積の比を計算
    max_area = max(area1, area2, area3, area4)
    min_area = min(area1, area2, area3, area4)
    answer = max_area / min_area
    return answer

q9_template = "Let $ABCD$ be a unit square. Let $P$ be the point on $AB$ such that $|AP| = 1/{{{a}}}$ and let $Q$ be the point on $AD$ such that $|AQ| = 1/{{{b}}}$. The lines $DP$ and $BQ$ divide the square into four regions. Find the ratio between the areas of the largest region and the smallest region."

def aug_q9(q9_template=q9_template, n_rep=10000):
    for i in range(n_rep):
        # パラメータをランダムに選択
        a = random.randint(1, 100)
        b = random.randint(1, 100)
        # solve_q9関数で問題を解く
        answer = solve_q9(a, b)
        # 解が整数で1より大きく10000未満であれば、1000で割った余りを答えとし、ループを抜ける
        if (int(answer) == answer) and (answer > 1) and (answer < 10000):
            answer = int(answer) % 1000
            break
    # 問題文のテンプレートにパラメータを代入
    problem = q9_template.format(a=a, b=b)
    return problem, answer

解説:

solve_q9関数では、与えられたパラメータに対して、正方形内の4つの領域の面積比を計算します。
- 点A, B, C, D, P, Qの座標を定義します。
- 直線DPとBQを定義し、その交点を求めます。
- area関数を定義して、3点から三角形の面積を計算します。
- 交点で分けられる4つの領域（三角形）の面積を計算します。
- 最大面積と最小面積の比を答えとして返します。
aug_q9関数では、10000回ループを回して、パラメータa, bをランダムに選択し、solve_q9関数で問題を解きます。
- 解が整数で1より大きく10000未満であれば、1000で割った余りを答えとし、ループを抜けます。
問題文のテンプレートにパラメータを代入し、問題文と答えのペアを返します。

問題10のデータ拡張

問題10は、ある関数の値を求める問題です。関数の定義に現れる数値を変更することで、新しい問題を生成します。

import random

def create_memo(n):
    # メモ化用の辞書を作成
    memo = {1: 1}
    value = 1
    while value < n:
        new_value = 2 * value
        memo[new_value] = 2 * memo[value] + 1
        value = new_value
    return memo

def solve_q10(n=100, a=8, b=7):
    def compute_f_third(n, a=a, b=b):
        # f(f(f(n)))を計算
        return a * n - b

    def compute_f(n, memo, memo_third_reverse):
        # f(n)を計算
        if n in memo:
            return memo[n]

        if n % 2 == 0:
            value = 2 * compute_f(n // 2, memo, memo_third_reverse) + 1
        elif n in memo_third_reverse:
            value = compute_f(memo_third_reverse[n], memo, memo_third_reverse)
            if value != -1:
                value = compute_f_third(value)
        else:
            value = -1  # または未定義

        if value > 0:
            memo[n] = value

        return value

    # f(f(f(n)))の逆関数のメモ化用の辞書を作成
    memo_third_reverse = {1: 1}
    value = 1
    while value < n:
        new_value = value + 1
        y = compute_f_third(new_value)
        memo_third_reverse[y] = new_value
        value = new_value

    # メモ化用の辞書を作成
    memo = create_memo(n)
    # f(n)を計算
    answer = compute_f(n, memo, memo_third_reverse)
    return answer

q10_template = "A function $f: \\mathbb N \\to \\mathbb N$ satisfies the following two conditions for all positive integers $n$:$f(f(f(n)))={a}n-{b}$ and $f(2n)=2f(n)+1$. Calculate $f({n})$."

def aug_q10(q10_template=q10_template, n_rep=10000):
    for i in range(n_rep):
        # パラメータをランダムに選択
        a = random.randint(3, 20)
        b = random.randint(3, 20)
        n = random.randint(50, 1000)
        try:
            # solve_q10関数で問題を解く
            answer = solve_q10(n, a, b)
            # 解が整数で1より大きく10000未満であれば、1000で割った余りを答えとし、ループを抜ける
            if (int(answer) == answer) and (answer > 1) and (answer < 10000):
                answer = int(answer) % 1000
                break
        except:
            pass
    # 問題文のテンプレートにパラメータを代入
    problem = q10_template.format(n=n, a=a, b=b)
    return problem, answer

解説:

create_memo関数は、$f(2n)=2f(n)+1$の関係を利用して、$f(n)$の値を効率的に計算するためのメモ化用の辞書を作成します。
solve_q10関数では、与えられたパラメータに対して、$f(n)$の値を計算します。
- compute_f_third関数は、$f(f(f(n)))$を計算します。
- compute_f関数は、メモ化を利用して$f(n)$を計算します。
  - $n$が偶数の場合、$f(2n)=2f(n)+1$の関係を利用して計算します。
  - $n$が奇数の場合、$f(f(f(n)))$の逆関数のメモ化用の辞書を利用して計算します。
  - 計算結果をメモ化用の辞書に保存します。
- $f(f(f(n)))$の逆関数のメモ化用の辞書を作成します。
- メモ化用の辞書を作成し、$f(n)$を計算します。
aug_q10関数では、10000回ループを回して、パラメータa, b, nをランダムに選択し、solve_q10関数で問題を解きます。
- 解が整数で1より大きく10000未満であれば、1000で割った余りを答えとし、ループを抜けます。
- 解が存在しない場合があるため、try-except文を使って例外処理を行っています。
問題文のテンプレートにパラメータを代入し、問題文と答えのペアを返します。

拡張したデータの結合

最後に、拡張したデータを元のデータフレームに結合します。

aug_df = []
for i in tqdm(range(9)):
    new_problems =[]
    new_answers = []
    new_problem, new_answer=aug_q1()
    new_problems.append(new_problem)
    new_answers.append(new_answer)
    new_problem, new_answer=aug_q2()
    new_problems.append(new_problem)
    new_answers.append(new_answer)
    new_problem, new_answer=aug_q3()
    new_problems.append(new_problem)
    new_answers.append(new_answer)
    new_problem, new_answer=aug_q4()
    new_problems.append(new_problem)
    new_answers.append(new_answer)
    new_problem, new_answer=aug_q5()
    new_problems.append(new_problem)
    new_answers.append(new_answer)
    new_problem, new_answer=aug_q6()
    new_problems.append(new_problem)
    new_answers.append(new_answer)
    new_problem, new_answer=aug_q7()
    new_problems.append(new_problem)
    new_answers.append(new_answer)
    new_problem, new_answer=aug_q8()
    new_problems.append(new_problem)
    new_answers.append(new_answer)
    new_problem, new_answer=aug_q9()
    new_problems.append(new_problem)
    new_answers.append(new_answer)
    new_problem, new_answer=aug_q10()
    new_problems.append(new_problem)
    new_answers.append(new_answer)
    aug_df_=pd.DataFrame()
    aug_df_['id']=df.id.values
    aug_df_['problem']=new_problems
    aug_df_['answer']=new_answers
    aug_df.append(aug_df_)
aug_df = pd.concat(aug_df).reset_index(drop=True)