AIを使って不労所得の道へ(14)~自動モデルパラメータ取得~

AIを使って不労所得

はじめに

素人でも戦略的に取引したい

株やFXの知識がないため,戦略的な取引ができない素人が多いと思います.私もその一人です.
そこで,強化学習の力を借りて戦略的に取引をしようと思います.

AIncomeプロジェクト

AIで所得(Income)を得るため「AIncome」と名付けました.

前回までの取り組み

自動取引強化学習の環境構築偏

前回は強化学習とそれをシミュレーションするOpenAI Gymの説明をしました*1.さらには,ランダムで行動を決定して売り買いしてみたところ,ランダムでも利益が出るような結果になっていました*1.また,ランダム投資からPPO2へ深層学習強化学習のモデル投入*2やTensorboardで可視化*3,手数料の考慮*4などを取り組んでいました.

学習高速化偏

さらに,学習回数を10倍に増やしたりしてみましたところ,処理時間が大幅に増えてきました*5ので,並列化の知識を付け*6,FXのデータ*7やビットコインのデータ*8に適用してみました.

長期学習の環境構築偏

ここで,いい感じに順調な結果がでてきておりそろそろ長期的に学習を回そうと思った矢先!
追加学習ができない←(問題解決*9),定期的にモデルを保存できない←(問題解決*10),テンソルボードがおかしいなどの問題←(問題解決*10)に直面していましたが無事に解決することができました.次に,GoogleColabを使った長期学習の環境構築について行っていきます.

今回の取り組みの概要

GoogleColabの無料プランでは稼働制限があり長期学習ができないような仕組みになっています.そのため1週間ぶっ続けの学習などを行うさいには工夫が必要になってきます.そこで今回は,モデルの途中保存や途中からの学習などの機構を作っていきます.これで途中でGoogleColabとの接続が切れた場合でも,再起動して学習したstep数から始めることで長期学習を行っていきます.

必要なもの

Google colab の環境

file

公式ドキュメント

Examples — Stable Baselines 2.10.2 documentation

前回の実装ファイル

AIを使って不労所得の道へ(13)~Tensorboardを追加学習でも同じグラフにプロットしてみた~
テンソルボードが正常に表示できる環境をサンプルのLunarLanderを使って作っていきます,

実装ポイント

CheckpointCallbackの改良

デフォルトのCheckpointCallbackではstep数の桁数が指定されていないため桁数が数値によって変わってしまいます.
これは結構問題でして最新のモデルを読み込む際に面倒な処理が必要になってきてしまいますので桁数を指定するように改良します.

class CheckpointCallback2(BaseCallback):
    """
    Callback for saving a model every `save_freq` steps

    :param save_freq: (int)
    :param save_path: (str) Path to the folder where the model will be saved.
    :param name_prefix: (str) Common prefix to the saved models
    """
    def __init__(self, save_freq: int, save_path: str, name_prefix='rl_model', verbose=0):
        super(CheckpointCallback2, self).__init__(verbose)
        self.save_freq = save_freq
        self.save_path = save_path
        self.name_prefix = name_prefix

    def _init_callback(self) -> None:
        # Create folder if needed
        if self.save_path is not None:
            os.makedirs(self.save_path, exist_ok=True)

    def _on_step(self) -> bool:
        if self.n_calls % self.save_freq == 0:
            path = os.path.join(self.save_path, '{}_{:09d}_steps'.format(self.name_prefix, self.num_timesteps))
            self.model.save(path)
            if self.verbose > 1:
                print("Saving model checkpoint to {}".format(path))
        return True

最新モデルパラメータを取得する関数の定義

上記の定期的に保存する機能で保存されたZIPファイル名から学習モデルの名前,step数などを取得する関数を定義します.

これがあることにより,GoogleColabが停止した際に,再起動をかけるだけで最新モデルのパラメータを取得できるようにします.

def get_latest_model_param():
  # ---------
  # load save model
  model_list = sorted(glob.glob("logs/*.zip"))
  resume_FLAG = False

  if(len(model_list)>0):
    model_latest = model_list[-1].split("/")[-1].split(".zip")[0].split("_step")[0]
    resume_FLAG = True
    return model_latest.split("_")[0], model_latest.split("_")[1], model_latest.split("_")[2], resume_FLAG

  return None, None, None, resume_FLAG

学習ループの改良

get_latest_model_param()関数からパラメータを取得して学習を再開できるように修正します.
これで毎回手動でパラメータを打ち込む必要がなくなりました.

for i in tqdm(range(int(resume_idx),  5)):

  # ====================================================
  # callback
  #
  # -------
  # eval callback
  #
  eval_callback = EvalCallback2(env_marker_test2, best_model_save_path='./logs/best_model',
                              log_path='./logs/results', eval_freq=eval_freq, verbose=2, name_prefix='PPO2_{:09d}_'.format(i))
  # -------
  # checkpoint callback
  #
  checkpoint_callback = CheckpointCallback2(save_freq=save_freq, save_path='./logs/', name_prefix='PPO2_{:09d}'.format(i), verbose=2)

  # -------
  # merge callback
  #
  #callback = CallbackList([checkpoint_callback, eval_callback, ProgressBarCallback(total_timesteps)])
  callback = CallbackList([checkpoint_callback, eval_callback, ])

  _model_name, _resume_idx, _train_num, _resume_FLAG = get_latest_model_param()
  if(_resume_FLAG):
    # ====================================================
    # Model setting
    #
    model = PPO2.load("./logs/{}_{:09d}_{:09d}_steps".format(_model_name, int(_resume_idx), int(_train_num)))
    model.set_env(env)
    model.train_num = _train_num

    # -------
    # tensorboard
    #
    model.tensorboard_log = "logs"
    model.num_timesteps = _train_num

  #tb_log_name = "sampleE{}v".format(i)
  tb_log_name = "AutoLoopTrainTB"
  print("tb_log_name : {}".format(tb_log_name))

  model.learn(total_timesteps=total_timesteps , log_interval=100, tb_log_name=tb_log_name, reset_num_timesteps=False, callback=callback)

  # ====================================================
  # Save model
  #
  model.save("./logs/PPO2_{:09d}_{:09d}_steps".format(i, total_timesteps*(i+1)*env_num))

全体コード

Google Drive setting

Google Driveとの接続

from google.colab import drive
drive.mount('/content/drive')

プロジェクトフォルダの作成/移動

%cd /content/drive/MyDrive/
!mkdir AIncome

プロジェクトフォルダへ移動

%cd /content/drive/MyDrive/AIncome

Colab 便利コマンド

残り時間の確認

!cat /proc/uptime | awk '{print $1 /60 /60 /24 "days (" $1 / 60 / 60 "h)"}'

永遠実行コマンド

"""
function ClickConnect(){

console.log("Working3");
document.querySelector("#comments > span").click()
}
setInterval(ClickConnect,5000)
"""

作業フォルダ

!mkdir AutoTrade14_V7_01
%cd  AutoTrade14_V7_01

必要パッケージのインストール

強化学習 基礎関係

!pip install "gym==0.19.0"
!pip install stable-baselines[mpi]

深層学習関係

!pip uninstall -y tensorflow-gpu
!pip uninstall -y tensorflow
!pip install tensorflow-gpu==1.14.0

強化学習 取引関係

!pip install gym-anytrading

取引分析関係

!pip install QuantStats

Tensorboard関係

!pip uninstall tensorboard-plugin-wit --yes

Ta-lib 関係

!curl -L http://prdownloads.sourceforge.net/ta-lib/ta-lib-0.4.0-src.tar.gz -O && tar xzvf ta-lib-0.4.0-src.tar.gz
!cd ta-lib && ./configure --prefix=/usr && make && make install && cd - && pip install ta-lib

yahoo api 関係

!pip install yahoo_finance_api2
!pip install yfinance

Historic Crypto

!pip install Historic-Crypto

可視化便利関係

!pip install mplfinance

インポート


import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import time

#from tqdm import tqdm
from tqdm.notebook import tqdm

import warnings
import typing
from typing import Union, List, Dict, Any, Optional

import gym
import gym_anytrading
from gym_anytrading.envs import TradingEnv, ForexEnv, StocksEnv, Actions, Positions
from gym_anytrading.datasets import FOREX_EURUSD_1H_ASK, STOCKS_GOOGL

from stable_baselines.bench import Monitor

from stable_baselines.common.vec_env import VecEnv, sync_envs_normalization, DummyVecEnv
from stable_baselines.common.vec_env import DummyVecEnv, SubprocVecEnv
from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common import make_vec_env
from stable_baselines import PPO2
from stable_baselines import ACKTR
from stable_baselines import A2C
from stable_baselines.common.callbacks import CallbackList, CheckpointCallback, EvalCallback, EventCallback
from stable_baselines.common.callbacks import BaseCallback
from stable_baselines.common.evaluation import evaluate_policy

import quantstats as qs
import mplfinance as mpf

import talib
from yahoo_finance_api2 import share
import yfinance
from Historic_Crypto import Cryptocurrencies
from Historic_Crypto import HistoricalData

import tensorflow as tf

設定


# ログフォルダの生成
log_dir = './logs/'
datasets_dir = '../datasets/'

os.makedirs(log_dir, exist_ok=True)
os.makedirs(datasets_dir, exist_ok=True)

# train data idx
idx1 = 100
idx2 = 5000

# test data idx
idx3 = 6000

window_size = 100

trade_fee = 0

env_num = 10

total_timesteps=100000

# tb_log_name = "PPO2_feat19"
tb_log_name = "PPO2_feat57_100_lstm128"

DATASET_GET_FLAG = False

n_steps = 128

save_freq = 100
eval_freq = 1000

Create datasets

Historic Cryto Ver

Check ID
# 過去データが取得可能な通貨レートを確認
crypto_data = Cryptocurrencies().find_crypto_pairs()
# 出力
print(crypto_data)
Setting param
# 引数情報
symbol      = 'XLM-USD'          # 通貨レート
granularity = 300              # 何秒置きにデータを取得するか(60, 300, 900, 3600, 21600, 86400) が指定可能
start_date  = '2020-01-01-00-00' # データ取得範囲:開始日
end_date    = '2022-08-29-00-00' # データ取得範囲:終了日

dataset_name = "../datasets/symbol-{}_granularity-{}_start-{}_end-{}.csv".format(symbol, granularity, start_date, end_date)
Get data & Save data
# データ取得
if DATASET_GET_FLAG :
  data = HistoricalData(symbol,granularity,start_date,end_date).retrieve_data()
# データ取得
if DATASET_GET_FLAG :
  #### Check data
  data.head(5)

  #### Adjust data

  data2 = data.reset_index()
  data2.columns = ["Time", "Low", "High",  "Open", "Close", "volume"]

  data2.head(5)

  # Save dataset
  data2.to_csv(dataset_name)

Read dataset

ビットコインのデータを読み込んでいきます.

#df_dataset = pd.read_csv('./datasets/btc_1M.csv', header=None)
df_dataset = pd.read_csv(dataset_name, index_col=0)
df_dataset

列名を表記します.

#df_dataset.columns = ["Time", "Open", "High", "Low", "Close", "volume"]
df_dataset["Time"] = pd.to_datetime(df_dataset["Time"])
df_dataset = df_dataset.set_index("Time")
print(df_dataset)

Split data

df_train = df_dataset[0:int(len(df_dataset)*0.7)]
df_train
df_test = df_dataset[int(len(df_dataset)*0.7):]
df_test

indexをdatetime型の時間にします.

Create technical indicator

## Create technical indicator func
def create_technical_indicator(_df):

  df = _df.copy()
  """
  # ---------
  # SMA
  df['SMA'] = talib.SMA(df['Close'], timeperiod=5)

  # ---------
  # EMA
  df['EMA'] = talib.EMA(df['Close'], timeperiod=5)

  # ---------
  # BB
  df["BB_UPPER"], df["BB_MIDDLE"], df["BB_LOWER"] = talib.BBANDS(df["Close"], timeperiod=20, matype=talib.MA_Type.EMA)

  # ---------
  # RSI
  df["RSI"] = talib.RSI(df["Close"], timeperiod=5)

  # ---------
  # stochastics
  df["SLOW_K"], df["SLOW_D"] = talib.STOCH(df["High"], df["Low"], df["Close"], fastk_period=5, slowk_period=3, slowd_period=3)

  # ---------
  # DMI
  df["DMI_p_DI"] = talib.PLUS_DI(df["High"], df["Low"], df["Close"], timeperiod=14)
  df["DMI_m_DI"] = talib.MINUS_DI(df["High"], df["Low"], df["Close"], timeperiod=14)
  df["DMI_ADX"] = talib.ADX(df["High"], df["Low"], df["Close"], timeperiod=14)

  # ---------
  # MACD
  df["MACD"], df["SIGNAL"], df["HIST"] = talib.MACD(df["Close"], fastperiod=12, slowperiod=26, signalperiod=9)
  """

  df.dropna(inplace=True)
  return df
df_train_ti = create_technical_indicator(df_train)
df_test_ti = create_technical_indicator(df_test)
def plot_chart(df, term):

  indicators = [
      mpf.make_addplot(df['SMA'][term:], color='skyblue', width=2.5),
      mpf.make_addplot(df['EMA'][term:], color='pink', width=2.5),
  ]

  mpf.plot(df[term:], figratio=(12,4), type='candle', style="yahoo", volume=False, addplot=indicators)
df_train_ti.head(5)
#plot_chart(df=df_train_ti, term=-100)

環境の設定/生成

データ読み込み部分

#####################################
# データ読み込み部分
#
def my_process_data(env):
    start = env.frame_bound[0] - env.window_size
    end = env.frame_bound[1]
    prices = env.df.loc[:, 'Low'].to_numpy()[start:end]

    # -----------------------------
    # 特徴量生成
    #
    df_features = env.df.copy()
    # df_features.drop("")
    ohlc_features = df_features.loc[:, :].to_numpy()[start:end]

    #print(ohlc_features.shape)
    #print(np.diff(ohlc_features, axis=0).shape)
    diff1 = np.insert(np.diff(ohlc_features, axis=0), 0, 0, axis=0)
    diff2 = np.insert(np.diff(diff1, axis=0), 0, 0, axis=0)
    #print(diff)
    #signal_features = env.df.loc[:, ['Close', 'Open', 'High', 'Low', 'Vol']].to_numpy()[start:end]

    signal_features = np.column_stack((ohlc_features, diff1, diff2))
    #signal_features = np.column_stack((ohlc_features, ))

    print(">>> signal_features.shape")
    print( signal_features.shape)
    #print( signal_features.head(5))

    return prices, signal_features

環境クラス定義

#####################################
# 環境クラス
#
class MyBTCEnv(ForexEnv):
    _process_data = my_process_data

環境の生成

len(df_train_ti)
#####################################
# 環境の生成
#
env_marker_train = lambda:  MyBTCEnv(df=df_train_ti, window_size=window_size, frame_bound=(window_size, len(df_train_ti)))
env_marker_train.trade_fee = trade_fee

env_marker_test = lambda:  MyBTCEnv(df=df_test_ti, window_size=window_size, frame_bound=(window_size, len(df_test_ti)))
env_marker_test.trade_fee = trade_fee

env_marker_test2 = env_marker_test()
env_marker_test2.trade_fee = trade_fee

env = DummyVecEnv([env_marker_train for _ in range(env_num)])
#env = SubprocVecEnv([env_marker_train for i in range(env_num)])

env_test = DummyVecEnv([env_marker_test for _ in range(1)])

#env = Monitor(env, log_dir, allow_early_resets=True)

Train trade

Callback

CheckpointCallback

定期的にモデルの保存を行うコールバックです.

デフォルトのCheckpointCallbackでがstep数の桁数が変わってしまうため,若干修正します.

class CheckpointCallback2(BaseCallback):
    """
    Callback for saving a model every `save_freq` steps

    :param save_freq: (int)
    :param save_path: (str) Path to the folder where the model will be saved.
    :param name_prefix: (str) Common prefix to the saved models
    """
    def __init__(self, save_freq: int, save_path: str, name_prefix='rl_model', verbose=0):
        super(CheckpointCallback2, self).__init__(verbose)
        self.save_freq = save_freq
        self.save_path = save_path
        self.name_prefix = name_prefix

    def _init_callback(self) -> None:
        # Create folder if needed
        if self.save_path is not None:
            os.makedirs(self.save_path, exist_ok=True)

    def _on_step(self) -> bool:
        if self.n_calls % self.save_freq == 0:
            path = os.path.join(self.save_path, '{}_{:09d}_steps'.format(self.name_prefix, self.num_timesteps))
            self.model.save(path)
            if self.verbose > 1:
                print("Saving model checkpoint to {}".format(path))
        return True
EvalCallback
class EvalCallback2(EventCallback):
    """
    Callback for evaluating an agent.

    :param eval_env: (Union[gym.Env, VecEnv]) The environment used for initialization
    :param callback_on_new_best: (Optional[BaseCallback]) Callback to trigger
        when there is a new best model according to the `mean_reward`
    :param n_eval_episodes: (int) The number of episodes to test the agent
    :param eval_freq: (int) Evaluate the agent every eval_freq call of the callback.
    :param log_path: (str) Path to a folder where the evaluations (`evaluations.npz`)
        will be saved. It will be updated at each evaluation.
    :param best_model_save_path: (str) Path to a folder where the best model
        according to performance on the eval env will be saved.
    :param deterministic: (bool) Whether the evaluation should
        use a stochastic or deterministic actions.
    :param render: (bool) Whether to render or not the environment during evaluation
    :param verbose: (int)
    """
    def __init__(self, eval_env: Union[gym.Env, VecEnv],
                 callback_on_new_best: Optional[BaseCallback] = None,
                 n_eval_episodes: int = 5,
                 eval_freq: int = 10000,
                 log_path: str = None,
                 best_model_save_path: str = None,
                 deterministic: bool = True,
                 render: bool = False,
                 verbose: int = 1,
                 name_prefix:str = None):

        super(EvalCallback2, self).__init__(callback_on_new_best, verbose=verbose)
        self.n_eval_episodes = n_eval_episodes
        self.eval_freq = eval_freq
        self.best_mean_reward = -np.inf
        self.last_mean_reward = -np.inf
        self.deterministic = deterministic
        self.render = render
        self.name_prefix = name_prefix

        # Convert to VecEnv for consistency
        if not isinstance(eval_env, VecEnv):
            eval_env = DummyVecEnv([lambda: eval_env])

        assert eval_env.num_envs == 1, "You must pass only one environment for evaluation"

        self.eval_env = eval_env
        self.best_model_save_path = best_model_save_path
        # Logs will be written in `evaluations.npz`
        if log_path is not None:
            log_path = os.path.join(log_path, 'evaluations')
        self.log_path = log_path
        self.evaluations_results = []
        self.evaluations_timesteps = []
        self.evaluations_length = []

    def _init_callback(self):
        # Does not work in some corner cases, where the wrapper is not the same
        if not type(self.training_env) is type(self.eval_env):
            warnings.warn("Training and eval env are not of the same type"
                          "{} != {}".format(self.training_env, self.eval_env))

        # Create folders if needed
        if self.best_model_save_path is not None:
            os.makedirs(self.best_model_save_path, exist_ok=True)
        if self.log_path is not None:
            os.makedirs(os.path.dirname(self.log_path), exist_ok=True)

    def _on_step(self) -> bool:

        if self.eval_freq > 0 and self.n_calls % self.eval_freq == 0:
            # Sync training and eval env if there is VecNormalize
            sync_envs_normalization(self.training_env, self.eval_env)

            episode_rewards, episode_lengths = evaluate_policy(self.model, self.eval_env,
                                                               n_eval_episodes=self.n_eval_episodes,
                                                               render=self.render,
                                                               deterministic=self.deterministic,
                                                               return_episode_rewards=True)

            print(">>> log_path >>>")
            self.log_path2 = self.log_path + "-{:09d}".format(self.num_timesteps)
            print(self.log_path)
            print(self.log_path2)

            if self.log_path2 is not None:
                self.evaluations_timesteps.append(self.num_timesteps)
                self.evaluations_results.append(episode_rewards)
                self.evaluations_length.append(episode_lengths)
                np.savez(self.log_path2, timesteps=self.evaluations_timesteps,
                         results=self.evaluations_results, ep_lengths=self.evaluations_length)

            print(">>> episode_rewards >>>")
            print(episode_rewards)

            mean_reward, std_reward = np.mean(episode_rewards), np.std(episode_rewards)
            mean_ep_length, std_ep_length = np.mean(episode_lengths), np.std(episode_lengths)
            # Keep track of the last evaluation, useful for classes that derive from this callback
            self.last_mean_reward = mean_reward

            if self.verbose > 0:
                print("Eval num_timesteps={}, "
                      "episode_reward={:.2f} +/- {:.2f}".format(self.num_timesteps, mean_reward, std_reward))
                print("Episode length: {:.2f} +/- {:.2f}".format(mean_ep_length, std_ep_length))

            if mean_reward > self.best_mean_reward:
                if self.verbose > 0:
                    print("New best mean reward!")
                if self.best_model_save_path is not None:
                    self.model.save(os.path.join(self.best_model_save_path, 'best_model'))
                self.best_mean_reward = mean_reward
                # Trigger callback if needed
                if self.callback is not None:
                    return self._on_event()

        return True

Set model

# ------
# first train
#
policy_kwargs = dict(net_arch=[64*2, 'lstm', dict(vf=[128, 128, 128], pi=[64*2, 64*2])])
#model = PPO2('MlpLstmPolicy', env, verbose=0, policy_kwargs=policy_kwargs, nminibatches=env_num, tensorboard_log=log_dir, n_steps=n_steps)

# ------
# resume
#
#model = PPO2.load("/content/drive/MyDrive/AIncome/AutoTrade13/logs/rl_model_90000_steps")
#model.set_env(env)
total_episode_reward_logger
def total_episode_reward_logger2(rew_acc, rewards, masks, writer, steps, n_steps=128, train_num=None):
    """
    calculates the cumulated episode reward, and prints to tensorflow log the output

    :param rew_acc: (np.array float) the total running reward
    :param rewards: (np.array float) the rewards
    :param masks: (np.array bool) the end of episodes
    :param writer: (TensorFlow Session.writer) the writer to log to
    :param steps: (int) the current timestep
    :return: (np.array float) the updated total running reward
    :return: (np.array float) the updated total running reward
    """

    #print(">>>>> step : {}".format(steps))
    #print("rewards: {}".format(rewards))

    with tf.variable_scope("environment_info", reuse=True):
        for env_idx in range(rewards.shape[0]):
            dones_idx = np.sort(np.argwhere(masks[env_idx]))

            """
            print("masks    : {}".format(masks))
            print("dones_idx: {}".format(dones_idx))
            print("dones_idx: {}".format(len(dones_idx)))
            """

            if len(dones_idx) == 0:
                rew_acc[env_idx] += sum(rewards[env_idx])
            else:
                rew_acc[env_idx] += sum(rewards[env_idx, :dones_idx[0, 0]])
                summary = tf.Summary(value=[tf.Summary.Value(tag="episode_reward", simple_value=rew_acc[env_idx])])

                stepA = int((int(steps/(train_num)) -1)*train_num + (env_idx+1)*(train_num/rewards.shape[0]))

                """
                print("---=== step+   : {}".format(steps + dones_idx[0, 0]))
                print("--->>> step    : {}".format(steps))
                print("------ env_idx : {}".format(env_idx))
                print("---*** stepA   : {}".format(stepA))
                """

                writer.add_summary(summary, stepA)

                for k in range(1, len(dones_idx[:, 0])):
                    rew_acc[env_idx] = sum(rewards[env_idx, dones_idx[k - 1, 0]:dones_idx[k, 0]])
                    summary = tf.Summary(value=[tf.Summary.Value(tag="episode_reward", simple_value=rew_acc[env_idx])])
                    #writer.add_summary(summary, steps + dones_idx[k, 0])
                    writer.add_summary(summary, steps + dones_idx[k, 0])
                    print("---*** step : {}".format(steps + dones_idx[k, 0]))
                rew_acc[env_idx] = sum(rewards[env_idx, dones_idx[-1, 0]:])

    return rew_acc
PPO2
import time

import gym
import numpy as np
import tensorflow as tf

from stable_baselines import logger
from stable_baselines.common import explained_variance, ActorCriticRLModel, tf_util, SetVerbosity, TensorboardWriter
from stable_baselines.common.runners import AbstractEnvRunner
from stable_baselines.common.policies import ActorCriticPolicy, RecurrentActorCriticPolicy
from stable_baselines.common.schedules import get_schedule_fn
#from stable_baselines.common.tf_util import total_episode_reward_logger
from stable_baselines.common.math_util import safe_mean

class PPO2(ActorCriticRLModel):
    """
    Proximal Policy Optimization algorithm (GPU version).
    Paper: https://arxiv.org/abs/1707.06347

    :param policy: (ActorCriticPolicy or str) The policy model to use (MlpPolicy, CnnPolicy, CnnLstmPolicy, ...)
    :param env: (Gym environment or str) The environment to learn from (if registered in Gym, can be str)
    :param gamma: (float) Discount factor
    :param n_steps: (int) The number of steps to run for each environment per update
        (i.e. batch size is n_steps * n_env where n_env is number of environment copies running in parallel)
    :param ent_coef: (float) Entropy coefficient for the loss calculation
    :param learning_rate: (float or callable) The learning rate, it can be a function
    :param vf_coef: (float) Value function coefficient for the loss calculation
    :param max_grad_norm: (float) The maximum value for the gradient clipping
    :param lam: (float) Factor for trade-off of bias vs variance for Generalized Advantage Estimator
    :param nminibatches: (int) Number of training minibatches per update. For recurrent policies,
        the number of environments run in parallel should be a multiple of nminibatches.
    :param noptepochs: (int) Number of epoch when optimizing the surrogate
    :param cliprange: (float or callable) Clipping parameter, it can be a function
    :param cliprange_vf: (float or callable) Clipping parameter for the value function, it can be a function.
        This is a parameter specific to the OpenAI implementation. If None is passed (default),
        then `cliprange` (that is used for the policy) will be used.
        IMPORTANT: this clipping depends on the reward scaling.
        To deactivate value function clipping (and recover the original PPO implementation),
        you have to pass a negative value (e.g. -1).
    :param verbose: (int) the verbosity level: 0 none, 1 training information, 2 tensorflow debug
    :param tensorboard_log: (str) the log location for tensorboard (if None, no logging)
    :param _init_setup_model: (bool) Whether or not to build the network at the creation of the instance
    :param policy_kwargs: (dict) additional arguments to be passed to the policy on creation
    :param full_tensorboard_log: (bool) enable additional logging when using tensorboard
        WARNING: this logging can take a lot of space quickly
    :param seed: (int) Seed for the pseudo-random generators (python, numpy, tensorflow).
        If None (default), use random seed. Note that if you want completely deterministic
        results, you must set `n_cpu_tf_sess` to 1.
    :param n_cpu_tf_sess: (int) The number of threads for TensorFlow operations
        If None, the number of cpu of the current machine will be used.
    """
    def __init__(self, policy, env, gamma=0.99, n_steps=128, ent_coef=0.01, learning_rate=2.5e-4, vf_coef=0.5,
                 max_grad_norm=0.5, lam=0.95, nminibatches=4, noptepochs=4, cliprange=0.2, cliprange_vf=None,
                 verbose=0, tensorboard_log=None, _init_setup_model=True, policy_kwargs=None,
                 full_tensorboard_log=False, seed=None, n_cpu_tf_sess=None, train_num=None):

        self.learning_rate = learning_rate
        self.cliprange = cliprange
        self.cliprange_vf = cliprange_vf
        self.n_steps = n_steps
        self.ent_coef = ent_coef
        self.vf_coef = vf_coef
        self.max_grad_norm = max_grad_norm
        self.gamma = gamma
        self.lam = lam
        self.nminibatches = nminibatches
        self.noptepochs = noptepochs
        self.tensorboard_log = tensorboard_log
        self.full_tensorboard_log = full_tensorboard_log

        self.action_ph = None
        self.advs_ph = None
        self.rewards_ph = None
        self.old_neglog_pac_ph = None
        self.old_vpred_ph = None
        self.learning_rate_ph = None
        self.clip_range_ph = None
        self.entropy = None
        self.vf_loss = None
        self.pg_loss = None
        self.approxkl = None
        self.clipfrac = None
        self._train = None
        self.loss_names = None
        self.train_model = None
        self.act_model = None
        self.value = None
        self.n_batch = None
        self.summary = None
        self.p_id    = 0
        self.train_num = train_num

        super().__init__(policy=policy, env=env, verbose=verbose, requires_vec_env=True,
                         _init_setup_model=_init_setup_model, policy_kwargs=policy_kwargs,
                         seed=seed, n_cpu_tf_sess=n_cpu_tf_sess)

        if _init_setup_model:
            self.setup_model()

    def _make_runner(self):
        return Runner(env=self.env, model=self, n_steps=self.n_steps,
                      gamma=self.gamma, lam=self.lam)

    def _get_pretrain_placeholders(self):
        policy = self.act_model
        if isinstance(self.action_space, gym.spaces.Discrete):
            return policy.obs_ph, self.action_ph, policy.policy
        return policy.obs_ph, self.action_ph, policy.deterministic_action

    def setup_model(self):
        with SetVerbosity(self.verbose):

            assert issubclass(self.policy, ActorCriticPolicy), "Error: the input policy for the PPO2 model must be " \
                                                               "an instance of common.policies.ActorCriticPolicy."

            self.n_batch = self.n_envs * self.n_steps

            self.graph = tf.Graph()
            with self.graph.as_default():
                self.set_random_seed(self.seed)
                self.sess = tf_util.make_session(num_cpu=self.n_cpu_tf_sess, graph=self.graph)

                n_batch_step = None
                n_batch_train = None
                if issubclass(self.policy, RecurrentActorCriticPolicy):
                    assert self.n_envs % self.nminibatches == 0, "For recurrent policies, "\
                        "the number of environments run in parallel should be a multiple of nminibatches."
                    n_batch_step = self.n_envs
                    n_batch_train = self.n_batch // self.nminibatches

                act_model = self.policy(self.sess, self.observation_space, self.action_space, self.n_envs, 1,
                                        n_batch_step, reuse=False, **self.policy_kwargs)
                with tf.variable_scope("train_model", reuse=True,
                                       custom_getter=tf_util.outer_scope_getter("train_model")):
                    train_model = self.policy(self.sess, self.observation_space, self.action_space,
                                              self.n_envs // self.nminibatches, self.n_steps, n_batch_train,
                                              reuse=True, **self.policy_kwargs)

                with tf.variable_scope("loss", reuse=False):
                    self.action_ph = train_model.pdtype.sample_placeholder([None], name="action_ph")
                    self.advs_ph = tf.placeholder(tf.float32, [None], name="advs_ph")
                    self.rewards_ph = tf.placeholder(tf.float32, [None], name="rewards_ph")
                    self.old_neglog_pac_ph = tf.placeholder(tf.float32, [None], name="old_neglog_pac_ph")
                    self.old_vpred_ph = tf.placeholder(tf.float32, [None], name="old_vpred_ph")
                    self.learning_rate_ph = tf.placeholder(tf.float32, [], name="learning_rate_ph")
                    self.clip_range_ph = tf.placeholder(tf.float32, [], name="clip_range_ph")

                    neglogpac = train_model.proba_distribution.neglogp(self.action_ph)
                    self.entropy = tf.reduce_mean(train_model.proba_distribution.entropy())

                    vpred = train_model.value_flat

                    # Value function clipping: not present in the original PPO
                    if self.cliprange_vf is None:
                        # Default behavior (legacy from OpenAI baselines):
                        # use the same clipping as for the policy
                        self.clip_range_vf_ph = self.clip_range_ph
                        self.cliprange_vf = self.cliprange
                    elif isinstance(self.cliprange_vf, (float, int)) and self.cliprange_vf < 0:
                        # Original PPO implementation: no value function clipping
                        self.clip_range_vf_ph = None
                    else:
                        # Last possible behavior: clipping range
                        # specific to the value function
                        self.clip_range_vf_ph = tf.placeholder(tf.float32, [], name="clip_range_vf_ph")

                    if self.clip_range_vf_ph is None:
                        # No clipping
                        vpred_clipped = train_model.value_flat
                    else:
                        # Clip the different between old and new value
                        # NOTE: this depends on the reward scaling
                        vpred_clipped = self.old_vpred_ph + \
                            tf.clip_by_value(train_model.value_flat - self.old_vpred_ph,
                                             - self.clip_range_vf_ph, self.clip_range_vf_ph)

                    vf_losses1 = tf.square(vpred - self.rewards_ph)
                    vf_losses2 = tf.square(vpred_clipped - self.rewards_ph)
                    self.vf_loss = .5 * tf.reduce_mean(tf.maximum(vf_losses1, vf_losses2))

                    ratio = tf.exp(self.old_neglog_pac_ph - neglogpac)
                    pg_losses = -self.advs_ph * ratio
                    pg_losses2 = -self.advs_ph * tf.clip_by_value(ratio, 1.0 - self.clip_range_ph, 1.0 +
                                                                  self.clip_range_ph)
                    self.pg_loss = tf.reduce_mean(tf.maximum(pg_losses, pg_losses2))
                    self.approxkl = .5 * tf.reduce_mean(tf.square(neglogpac - self.old_neglog_pac_ph))
                    self.clipfrac = tf.reduce_mean(tf.cast(tf.greater(tf.abs(ratio - 1.0),
                                                                      self.clip_range_ph), tf.float32))
                    loss = self.pg_loss - self.entropy * self.ent_coef + self.vf_loss * self.vf_coef

                    tf.summary.scalar('entropy_loss', self.entropy)
                    """
                    print("------------------------------")
                    print(" self.entropy ")
                    print(self.entropy)
                    """

                    tf.summary.scalar('policy_gradient_loss', self.pg_loss)
                    tf.summary.scalar('value_function_loss', self.vf_loss)
                    tf.summary.scalar('approximate_kullback-leibler', self.approxkl)
                    tf.summary.scalar('clip_factor', self.clipfrac)
                    tf.summary.scalar('loss', loss)

                    with tf.variable_scope('model'):
                        self.params = tf.trainable_variables()
                        if self.full_tensorboard_log:
                            for var in self.params:
                                tf.summary.histogram(var.name, var)
                    grads = tf.gradients(loss, self.params)
                    if self.max_grad_norm is not None:
                        grads, _grad_norm = tf.clip_by_global_norm(grads, self.max_grad_norm)
                    grads = list(zip(grads, self.params))
                trainer = tf.train.AdamOptimizer(learning_rate=self.learning_rate_ph, epsilon=1e-5)
                self._train = trainer.apply_gradients(grads)

                self.loss_names = ['policy_loss', 'value_loss', 'policy_entropy', 'approxkl', 'clipfrac']

                with tf.variable_scope("input_info", reuse=False):
                    tf.summary.scalar('discounted_rewards', tf.reduce_mean(self.rewards_ph))
                    tf.summary.scalar('learning_rate', tf.reduce_mean(self.learning_rate_ph))
                    tf.summary.scalar('advantage', tf.reduce_mean(self.advs_ph))
                    tf.summary.scalar('clip_range', tf.reduce_mean(self.clip_range_ph))
                    if self.clip_range_vf_ph is not None:
                        tf.summary.scalar('clip_range_vf', tf.reduce_mean(self.clip_range_vf_ph))

                    tf.summary.scalar('old_neglog_action_probability', tf.reduce_mean(self.old_neglog_pac_ph))
                    tf.summary.scalar('old_value_pred', tf.reduce_mean(self.old_vpred_ph))

                    if self.full_tensorboard_log:
                        tf.summary.histogram('discounted_rewards', self.rewards_ph)
                        tf.summary.histogram('learning_rate', self.learning_rate_ph)
                        tf.summary.histogram('advantage', self.advs_ph)
                        tf.summary.histogram('clip_range', self.clip_range_ph)
                        tf.summary.histogram('old_neglog_action_probability', self.old_neglog_pac_ph)
                        tf.summary.histogram('old_value_pred', self.old_vpred_ph)
                        if tf_util.is_image(self.observation_space):
                            tf.summary.image('observation', train_model.obs_ph)
                        else:
                            tf.summary.histogram('observation', train_model.obs_ph)

                self.train_model = train_model
                self.act_model = act_model
                self.step = act_model.step
                self.proba_step = act_model.proba_step
                self.value = act_model.value
                self.initial_state = act_model.initial_state
                tf.global_variables_initializer().run(session=self.sess)  # pylint: disable=E1101

                self.summary = tf.summary.merge_all()

    def _train_step(self, learning_rate, cliprange, obs, returns, masks, actions, values, neglogpacs, update,
                    writer, states=None, cliprange_vf=None):
        """
        Training of PPO2 Algorithm

        :param learning_rate: (float) learning rate
        :param cliprange: (float) Clipping factor
        :param obs: (np.ndarray) The current observation of the environment
        :param returns: (np.ndarray) the rewards
        :param masks: (np.ndarray) The last masks for done episodes (used in recurent policies)
        :param actions: (np.ndarray) the actions
        :param values: (np.ndarray) the values
        :param neglogpacs: (np.ndarray) Negative Log-likelihood probability of Actions
        :param update: (int) the current step iteration
        :param writer: (TensorFlow Summary.writer) the writer for tensorboard
        :param states: (np.ndarray) For recurrent policies, the internal state of the recurrent model
        :return: policy gradient loss, value function loss, policy entropy,
                approximation of kl divergence, updated clipping range, training update operation
        :param cliprange_vf: (float) Clipping factor for the value function
        """
        advs = returns - values
        advs = (advs - advs.mean()) / (advs.std() + 1e-8)
        td_map = {self.train_model.obs_ph: obs, self.action_ph: actions,
                  self.advs_ph: advs, self.rewards_ph: returns,
                  self.learning_rate_ph: learning_rate, self.clip_range_ph: cliprange,
                  self.old_neglog_pac_ph: neglogpacs, self.old_vpred_ph: values}
        if states is not None:
            td_map[self.train_model.states_ph] = states
            td_map[self.train_model.dones_ph] = masks

        if cliprange_vf is not None and cliprange_vf >= 0:
            td_map[self.clip_range_vf_ph] = cliprange_vf

        if states is None:
            update_fac = max(self.n_batch // self.nminibatches // self.noptepochs, 1)
        else:
            update_fac = max(self.n_batch // self.nminibatches // self.noptepochs // self.n_steps, 1)

        if writer is not None:
            # run loss backprop with summary, but once every 10 runs save the metadata (memory, compute time, ...)
            if self.full_tensorboard_log and (1 + update) % 10 == 0:
                run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
                run_metadata = tf.RunMetadata()
                summary, policy_loss, value_loss, policy_entropy, approxkl, clipfrac, _ = self.sess.run(
                    [self.summary, self.pg_loss, self.vf_loss, self.entropy, self.approxkl, self.clipfrac, self._train],
                    td_map, options=run_options, run_metadata=run_metadata)
                #print("update---->{}".format(update))
                writer.add_run_metadata(run_metadata, 'step%d' % (update * update_fac))
            else:
                summary, policy_loss, value_loss, policy_entropy, approxkl, clipfrac, _ = self.sess.run(
                    [self.summary, self.pg_loss, self.vf_loss, self.entropy, self.approxkl, self.clipfrac, self._train],
                    td_map)
            writer.add_summary(summary, (update * update_fac))
            """
            print(">>>>>>>>>>>>>>>>>>>>>>>")
            print("update             : {}".format(update))
            print("update_fac         : {}".format(update_fac))
            print("update*update_fac  : {}".format(update * update_fac))
            print("self.entropy:      : {}".format(self.entropy))
            """
        else:
            policy_loss, value_loss, policy_entropy, approxkl, clipfrac, _ = self.sess.run(
                [self.pg_loss, self.vf_loss, self.entropy, self.approxkl, self.clipfrac, self._train], td_map)

        return policy_loss, value_loss, policy_entropy, approxkl, clipfrac

    def learn(self, total_timesteps, callback=None, log_interval=1, tb_log_name="PPO2",
              reset_num_timesteps=True):
        # Transform to callable if needed
        self.learning_rate = get_schedule_fn(self.learning_rate)
        self.cliprange = get_schedule_fn(self.cliprange)
        cliprange_vf = get_schedule_fn(self.cliprange_vf)

        new_tb_log = self._init_num_timesteps(reset_num_timesteps)
        callback = self._init_callback(callback)

        with SetVerbosity(self.verbose), TensorboardWriter(self.graph, self.tensorboard_log, tb_log_name, new_tb_log) \
                as writer:
            self._setup_learn()

            t_first_start = time.time()
            #n_updates = total_timesteps // self.n_batch
            n_updates = total_timesteps

            callback.on_training_start(locals(), globals())

            """
            print("---------------------------------")
            print("n_updates + 1 : {}".format(n_updates + 1))
            """

            for update in tqdm(range(1, n_updates + 1)):
                assert self.n_batch % self.nminibatches == 0, ("The number of minibatches (`nminibatches`) "
                                                               "is not a factor of the total number of samples "
                                                               "collected per rollout (`n_batch`), "
                                                               "some samples won't be used."
                                                               )
                batch_size = self.n_batch // self.nminibatches
                t_start = time.time()
                frac = 1.0 - (update - 1.0) / n_updates
                lr_now = self.learning_rate(frac)
                cliprange_now = self.cliprange(frac)
                cliprange_vf_now = cliprange_vf(frac)

                callback.on_rollout_start()
                # true_reward is the reward without discount
                rollout = self.runner.run(callback)
                # Unpack
                obs, returns, masks, actions, values, neglogpacs, states, ep_infos, true_reward = rollout

                callback.on_rollout_end()

                # Early stopping due to the callback
                if not self.runner.continue_training:
                    break

                self.ep_info_buf.extend(ep_infos)
                mb_loss_vals = []
                if states is None:  # nonrecurrent version
                    update_fac = max(self.n_batch // self.nminibatches // self.noptepochs, 1)
                    inds = np.arange(self.n_batch)
                    for epoch_num in range(self.noptepochs):
                        np.random.shuffle(inds)
                        for start in range(0, self.n_batch, batch_size):
                            timestep = self.num_timesteps // update_fac + ((epoch_num *
                                                                            self.n_batch + start) // batch_size)
                            end = start + batch_size
                            mbinds = inds[start:end]
                            slices = (arr[mbinds] for arr in (obs, returns, masks, actions, values, neglogpacs))
                            mb_loss_vals.append(self._train_step(lr_now, cliprange_now, *slices, writer=writer,
                                                                 update=timestep, cliprange_vf=cliprange_vf_now))
                else:  # recurrent version
                    update_fac = max(self.n_batch // self.nminibatches // self.noptepochs // self.n_steps, 1)
                    assert self.n_envs % self.nminibatches == 0
                    env_indices = np.arange(self.n_envs)
                    flat_indices = np.arange(self.n_envs * self.n_steps).reshape(self.n_envs, self.n_steps)
                    envs_per_batch = batch_size // self.n_steps
                    for epoch_num in range(1):
                        np.random.shuffle(env_indices)
                        for start in range(0, self.n_envs, envs_per_batch):
                            timestep = self.num_timesteps // update_fac + ((epoch_num *
                                                                            self.n_envs + start) // envs_per_batch)

                            """
                            print(">>>> timestep          :{}".format(timestep))
                            print(">>>> self.num_timesteps:{}".format(self.num_timesteps))
                            print(">>>> update_fac        :{}".format(update_fac))
                            print(">>>> epoch_num         :{}".format(epoch_num))
                            print(">>>> self.n_envs       :{}".format(self.n_envs))
                            print(">>>> start             :{}".format(start))
                            print(">>>> envs_per_batch    :{}".format(envs_per_batch))
                            print(">>>> self.noptepochs   :{}".format(self.noptepochs))
                            print(">>>>>>>> self.num_timesteps // update_fac + ((epoch_num * self.n_envs + start) // envs_per_batch)")
                            """

                            end = start + envs_per_batch
                            mb_env_inds = env_indices[start:end]
                            mb_flat_inds = flat_indices[mb_env_inds].ravel()
                            slices = (arr[mb_flat_inds] for arr in (obs, returns, masks, actions, values, neglogpacs))
                            mb_states = states[mb_env_inds]
                            mb_loss_vals.append(self._train_step(lr_now, cliprange_now, *slices, update=timestep,
                                                                 writer=writer, states=mb_states,
                                                                 cliprange_vf=cliprange_vf_now))

                loss_vals = np.mean(mb_loss_vals, axis=0)
                t_now = time.time()
                fps = int(self.n_batch / (t_now - t_start))

                if writer is not None:

                    #print(">>>>>>> writer >>>>>>> ")

                    total_episode_reward_logger2(self.episode_reward,
                                                true_reward.reshape((self.n_envs, self.n_steps)),
                                                masks.reshape((self.n_envs, self.n_steps)),
                                                writer, self.num_timesteps, n_steps=self.n_steps, train_num=self.train_num)

                if self.verbose >= 1 and (update % log_interval == 0 or update == 1):
                    explained_var = explained_variance(values, returns)
                    logger.logkv("serial_timesteps", update * self.n_steps)
                    logger.logkv("n_updates", update)
                    logger.logkv("total_timesteps", self.num_timesteps)
                    logger.logkv("fps", fps)
                    logger.logkv("explained_variance", float(explained_var))
                    if len(self.ep_info_buf) > 0 and len(self.ep_info_buf[0]) > 0:
                        logger.logkv('ep_reward_mean', safe_mean([ep_info['r'] for ep_info in self.ep_info_buf]))
                        logger.logkv('ep_len_mean', safe_mean([ep_info['l'] for ep_info in self.ep_info_buf]))
                    logger.logkv('time_elapsed', t_start - t_first_start)
                    for (loss_val, loss_name) in zip(loss_vals, self.loss_names):
                        logger.logkv(loss_name, loss_val)
                    logger.dumpkvs()

            callback.on_training_end()
            return self

    def save(self, save_path, cloudpickle=False):
        data = {
            "gamma": self.gamma,
            "n_steps": self.n_steps,
            "vf_coef": self.vf_coef,
            "ent_coef": self.ent_coef,
            "max_grad_norm": self.max_grad_norm,
            "learning_rate": self.learning_rate,
            "lam": self.lam,
            "nminibatches": self.nminibatches,
            "noptepochs": self.noptepochs,
            "cliprange": self.cliprange,
            "cliprange_vf": self.cliprange_vf,
            "verbose": self.verbose,
            "policy": self.policy,
            "observation_space": self.observation_space,
            "action_space": self.action_space,
            "n_envs": self.n_envs,
            "n_cpu_tf_sess": self.n_cpu_tf_sess,
            "seed": self.seed,
            "_vectorize_action": self._vectorize_action,
            "policy_kwargs": self.policy_kwargs
        }

        params_to_save = self.get_parameters()

        self._save_to_file(save_path, data=data, params=params_to_save, cloudpickle=cloudpickle)

class Runner(AbstractEnvRunner):
    def __init__(self, *, env, model, n_steps, gamma, lam):
        """
        A runner to learn the policy of an environment for a model

        :param env: (Gym environment) The environment to learn from
        :param model: (Model) The model to learn
        :param n_steps: (int) The number of steps to run for each environment
        :param gamma: (float) Discount factor
        :param lam: (float) Factor for trade-off of bias vs variance for Generalized Advantage Estimator
        """
        super().__init__(env=env, model=model, n_steps=n_steps)
        self.lam = lam
        self.gamma = gamma

    def _run(self):
        """
        Run a learning step of the model

        :return:
            - observations: (np.ndarray) the observations
            - rewards: (np.ndarray) the rewards
            - masks: (numpy bool) whether an episode is over or not
            - actions: (np.ndarray) the actions
            - values: (np.ndarray) the value function output
            - negative log probabilities: (np.ndarray)
            - states: (np.ndarray) the internal states of the recurrent policies
            - infos: (dict) the extra information of the model
        """
        # mb stands for minibatch
        mb_obs, mb_rewards, mb_actions, mb_values, mb_dones, mb_neglogpacs = [], [], [], [], [], []
        mb_states = self.states
        ep_infos = []

        #print("-----------------------------")
        #print("self.model.num_timesteps : {}".format(self.model.num_timesteps))
        self.model.num_timesteps += self.n_envs       
        ##print("self.model.num_timesteps : {}".format(self.model.num_timesteps))
        #print("self.n_envs : {}".format(self.n_envs))

        #self.model.num_timesteps += 1
        #self.model.num_timesteps += 4

        for _ in range(self.n_steps):
            actions, values, self.states, neglogpacs = self.model.step(self.obs, self.states, self.dones)  # pytype: disable=attribute-error
            mb_obs.append(self.obs.copy())
            mb_actions.append(actions)
            mb_values.append(values)
            mb_neglogpacs.append(neglogpacs)
            mb_dones.append(self.dones)
            clipped_actions = actions
            # Clip the actions to avoid out of bound error
            if isinstance(self.env.action_space, gym.spaces.Box):
                clipped_actions = np.clip(actions, self.env.action_space.low, self.env.action_space.high)
            self.obs[:], rewards, self.dones, infos = self.env.step(clipped_actions)

            #print("**** num_timesteps before : {}".format(self.model.num_timesteps))
            #self.model.num_timesteps += self.n_envs
            #print("**** num_timesteps after  : {}".format(self.model.num_timesteps))

            """
            if self.callback is not None:
                # Abort training early
                self.callback.update_locals(locals())
                if self.callback.on_step() is False:
                    self.continue_training = False
                    # Return dummy values
                    return [None] * 9
            """

            for info in infos:
                maybe_ep_info = info.get('episode')
                if maybe_ep_info is not None:
                    ep_infos.append(maybe_ep_info)
            mb_rewards.append(rewards)
        # batch of steps to batch of rollouts

        if self.callback is not None:
            # Abort training early
            self.callback.update_locals(locals())
            if self.callback.on_step() is False:
                self.continue_training = False
                # Return dummy values
                return [None] * 9

        mb_obs = np.asarray(mb_obs, dtype=self.obs.dtype)
        mb_rewards = np.asarray(mb_rewards, dtype=np.float32)
        mb_actions = np.asarray(mb_actions)
        mb_values = np.asarray(mb_values, dtype=np.float32)
        mb_neglogpacs = np.asarray(mb_neglogpacs, dtype=np.float32)
        mb_dones = np.asarray(mb_dones, dtype=np.bool)
        last_values = self.model.value(self.obs, self.states, self.dones)  # pytype: disable=attribute-error
        # discount/bootstrap off value fn
        mb_advs = np.zeros_like(mb_rewards)
        true_reward = np.copy(mb_rewards)
        last_gae_lam = 0
        for step in reversed(range(self.n_steps)):
            if step == self.n_steps - 1:
                nextnonterminal = 1.0 - self.dones
                nextvalues = last_values
            else:
                nextnonterminal = 1.0 - mb_dones[step + 1]
                nextvalues = mb_values[step + 1]
            delta = mb_rewards[step] + self.gamma * nextvalues * nextnonterminal - mb_values[step]
            mb_advs[step] = last_gae_lam = delta + self.gamma * self.lam * nextnonterminal * last_gae_lam
        mb_returns = mb_advs + mb_values

        mb_obs, mb_returns, mb_dones, mb_actions, mb_values, mb_neglogpacs, true_reward = \
            map(swap_and_flatten, (mb_obs, mb_returns, mb_dones, mb_actions, mb_values, mb_neglogpacs, true_reward))

        return mb_obs, mb_returns, mb_dones, mb_actions, mb_values, mb_neglogpacs, mb_states, ep_infos, true_reward

# obs, returns, masks, actions, values, neglogpacs, states = runner.run()
def swap_and_flatten(arr):
    """
    swap and then flatten axes 0 and 1

    :param arr: (np.ndarray)
    :return: (np.ndarray)
    """
    shape = arr.shape
    return arr.swapaxes(0, 1).reshape(shape[0] * shape[1], *shape[2:])

Learn model

Load model
import glob

# ---------
# load save model
model_list = sorted(glob.glob("logs/*.zip"))
model_list
Extract latest param
if(len(model_list)>0):
  model_latest = model_list[-1].split("/")[-1].split(".zip")[0].split("_step")[0]

else:
  model_latest = None

model_latest
Define get latest model
def get_latest_model_param():
  # ---------
  # load save model
  model_list = sorted(glob.glob("logs/*.zip"))
  resume_FLAG = False

  if(len(model_list)>0):
    model_latest = model_list[-1].split("/")[-1].split(".zip")[0].split("_step")[0]
    resume_FLAG = True
    return model_latest.split("_")[0], int(model_latest.split("_")[1]), int(model_latest.split("_")[2]), resume_FLAG

  return None, None, None, resume_FLAG
get_latest_model_param()
Init param
# ====================================================
# Init 
#
if(model_latest):
  resume_FLAG = True
  model_name, resume_idx, train_num = model_latest.split("_")
else:
  resume_FLAG = False
  resume_idx    = 0
  train_num     = 1

mean_reward_list = []

print("train_num : {}".format(train_num))
Create Env & Model
# ====================================================
# Create Env & Model
#
# env = gym.make('LunarLander-v2')
#model = DQN('MlpPolicy', env, learning_rate=1e-3, prioritized_replay=True, verbose=0, tensorboard_log=log_dir, full_tensorboard_log=True)
model = PPO2('MlpLstmPolicy', env, verbose=1, policy_kwargs=policy_kwargs, nminibatches=env_num, tensorboard_log=log_dir, n_steps=n_steps, noptepochs=4, train_num=train_num)
Train lopp
for i in tqdm(range(int(resume_idx),  5)):

  # ====================================================
  # callback
  #
  # -------
  # eval callback
  #
  eval_callback = EvalCallback2(env_marker_test2, best_model_save_path='./logs/best_model',
                              log_path='./logs/results', eval_freq=eval_freq, verbose=2, name_prefix='PPO2_{:09d}_'.format(i))
  # -------
  # checkpoint callback
  #
  checkpoint_callback = CheckpointCallback2(save_freq=save_freq, save_path='./logs/', name_prefix='PPO2_{:09d}'.format(i), verbose=2)

  # -------
  # merge callback
  #
  #callback = CallbackList([checkpoint_callback, eval_callback, ProgressBarCallback(total_timesteps)])
  callback = CallbackList([checkpoint_callback, eval_callback, ])

  _model_name, _resume_idx, _train_num, _resume_FLAG = get_latest_model_param()
  if(_resume_FLAG):
    # ====================================================
    # Model setting
    #
    model = PPO2.load("./logs/{}_{:09d}_{:09d}_steps".format(_model_name, int(_resume_idx), int(_train_num)))
    model.set_env(env)
    model.train_num = _train_num

    # -------
    # tensorboard
    #
    model.tensorboard_log = "logs"
    model.num_timesteps = _train_num

  #tb_log_name = "sampleE{}v".format(i)
  tb_log_name = "AutoLoopTrainTB"
  print("tb_log_name : {}".format(tb_log_name))

  model.learn(total_timesteps=total_timesteps , log_interval=100, tb_log_name=tb_log_name, reset_num_timesteps=False, callback=callback)

  # ====================================================
  # Save model
  #
  model.save("./logs/PPO2_{:09d}_{:09d}_steps".format(i, total_timesteps*(i+1)*env_num))

Save model

model.save("ppo2_forex-v0_lstm")

Load model

model = PPO2.load("ppo2_forex-v0_lstm")

Test trade

import numpy as np
from datetime import datetime
import seaborn as sns
#sns.set_theme(style="darkgrid")
#sns.set_theme(style="whitegrid", palette="RdYlBu")
palette = sns.color_palette("mako_r", 3)
sns.set_theme(style="whitegrid", palette=palette)

Define simulator

def simulator(action_num=1):
  df_info_list = []

  for i_env in tqdm(range(action_num)):

    env = env_marker_test()
    observation = env.reset()

    step=0
    total_reward = 0
    info_box = []

    while True:
        # observation = observation[np.newaxis, ...]
        observation = np.tile(observation, (env_num, 1, 1))

        # ----------------------------
        # observation
        #
        _act, _states = model.predict(observation)

        if(action_num == 1):
          action = 1 if np.mean(_act) >= 0.5 else 0
        else:
          action = _act[i_env]

        observation, reward, done, info = env.step(action)
        total_reward = total_reward + reward
        info_box.append([step, action, reward, total_reward])

        step = step + 1

        if done:
            print("info:", info)
            break

    df_info_box = pd.DataFrame(info_box)

    if(action_num == 1):
      df_info_box.columns = ["time_step", "action0", "reward0", "total_reward0"]
    else:
      df_info_box.columns = ["time_step", "action{}".format(i_env+1), "reward{}".format(i_env+1), "total_reward{}".format(i_env+1)]

    print("------------------")
    print(df_info_box)
    df_info_list.append(df_info_box)

  return pd.concat(df_info_list, axis=1)

Single action

df_info_box_single  = simulator(action_num=env_num)
df_info_box_single.head(5)

Mean action

df_info_box_mean    = simulator(action_num=1)
df_info_box_mean.head(5)

Merge action

df_info_box_merge = pd.concat([df_info_box_single, df_info_box_mean], axis=1)
df_info_box_merge.head(5)
df_info_box_merge.to_csv("{}.csv".format(tb_log_name))
df_reward_box = df_info_box_merge[["reward{}".format(i) for i in range(env_num+1)]]
df_reward_box.head(5)
df_total_reward_box = df_info_box_merge[["total_reward{}".format(i) for i in range(env_num+1)]]
df_total_reward_box.head(5)
df_action_box = df_info_box_merge[["action{}".format(i) for i in range(env_num+1)]]
df_action_box.head(5)
plt.figure(figsize=(30,5))
ax = sns.lineplot(data=df_total_reward_box)
ax.set_title("Total_reward")
plt.xlim(0, len(df_total_reward_box))
plt.figure(figsize=(30,5))
ax = sns.lineplot(x="time_step", y="total_reward0", data=df_info_box_mean, dashes=False)
ax = sns.scatterplot(data=df_info_box_mean, x="time_step", y="total_reward0", hue="action0", sizes=(10, 10), ax=ax)
ax.set_title("Total_reward")
plt.xlim(0, len(df_info_box_mean))

Analysis Using quantstats

ポートフォリオのプロファイリングを行い,詳細な分析とリスクメトリクスを算出します.

"""
print(env)
print(len(env.history['total_profit']))
print(len(df_dataset.index))
print(len(df_dataset.index[idx1:idx2]))
"""
"""
qs.extend_pandas()

net_worth = pd.Series(env.history['total_profit'], index=df_test.index[idx1+1:idx2])
returns = net_worth.pct_change().iloc[1:]

# qs.reports.full(returns)
qs.reports.html(returns, output='a2c_quantstats.html')
"""

Plot Results

"""
plt.figure(figsize=(30, 10))
env.render_all()
plt.show()
"""

Tensorboard

# Load the TensorBoard notebook extension
%load_ext tensorboard
#!kill 6654
%tensorboard --logdir logs/

おわりに

今回で長期学習に向けて自動パラメータを取得をできました.
取りあえず,再起動は手動でやりつつ現状のコードを活用して回してみます.
これで特に問題なさそうなら再起動の部分も自動化していきたいと思います.

参考サイト

コメント

タイトルとURLをコピーしました