预测股市 | 如何避免p-Hacking，为什么你要看涨？

原创 QIML编辑部量化投资与机器学习 2022-05-14

请标星公众号★，第一时间获取最新推文

时间就这样悄无声息的溜了

2018年，就只剩下5天了

作者 | Branko Blagojevic

编译 | 1+1=6

我们计算标普500指数过去一年的表现及每日回报率。但是过去两年的估值并不那么不稳定

import datetime as dt
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
stock = pd.read_csv("SPY.csv", index_col="Date")
cutoff = len(stock)//2
prices = pd.Series(stock.Close)
log_prices = np.log(prices)
deltas = pd.Series(np.diff(prices), index=stock.index[1:])
log_deltas = pd.Series(np.diff(log_prices), index=stock.index[1:])
latest_prices = stock.Close[cutoff:]
latest_log_prices = np.log(latest_prices)
latest_log_deltas = deltas[cutoff:]
prior_log_deltas = log_deltas[:cutoff]
prior_log_mean = np.mean(prior_log_deltas)
prior_log_std = np.std(prior_log_deltas)
f, axes = plt.subplots(ncols=2, figsize=(15,5))
prices.plot(ax=axes[0])
deltas.hist(bins=50, ax=axes[1])
f.autofmt_xdate()
f.tight_layout()

一些人尝试使用神经网络，特别是递归神经网络来预测市场回报。由于递归神经网络考虑了历史数据，因此对于时间序列数据是有用的。但这似乎有些过头了。神经网络不必要那么复杂。让我们看看是否可以用随机数来拟合一个更简单的模型！

模型随机数发生器

下面的预测函数根据历史标准差和平均收益率创建一组随机的正态分布日收益率。

def predict(mean, std, size, seed=None):
    """ Returns a normal distribution based on given mean, standard deviation and size"""
    np.random.seed(seed)
    return np.random.normal(loc=mean, scale=std, size=size)

apply_returns函数只将我们的收益应用于一个起始价格，从而得到一个预测的股票价格。

def apply_returns(start, returns):
    """ Applies given periodic returns """
    cur = start
    prices = [start]
    for r in returns:
        cur += r
        prices.append(cur)
    return prices

最后，我们想要得到回报。有几种可能的分数我们可以用，但是我们用均方误差（MSE）。

def score(actual, prediction):
    # mean square error
    return np.mean(np.square(actual - prediction))

把我们的预测可视化，通过结果来看总是很有用的。这就是比较的作用。

def compare(prediction, actual):
    # plots a prediction over the actual series
    plt.plot(prediction, label="prediction")
    plt.plot(actual, label="actual")
    plt.legend()

我们来看看seed=0是怎么出来的。

predict_deltas = predict(prior_log_deltas_mean, prior_log_deltas_std, latest_size, seed = 0)
start = latest_log_prices[0]
prediction = apply_returns(start, predict_deltas)
print("MSE: {:0.08f}".format(score(latest_log_prices, prediction)))
compare(prediction=prediction, actual=latest_log_prices.values)
MSE: 0.00797138

虽然不是很好，但这只是一个开始。我们的模型预测了今年早些时候的增长，但它肯定超过了预期。这就是我们希望优化模型seed值，以更好地去预测市场。

predict_partial = lambda s: predict(mean = prior_log_deltas_mean, std = prior_log_deltas_std, size = latest_size, seed = s)
def find_best_seed(actual, predict_partial, score, start_seed, end_seed):
    best_so_far = None
    best_score = float("inf")
    start = actual[0]
    for s in range(start_seed, end_seed):
        print('\r{} / {}'.format(s, end_seed), end="")
        predict_deltas = predict_partial(s)
        predict_prices = apply_returns(start, predict_deltas)
        predict_score = score(actual, predict_prices)
        if predict_score < best_score:
            best_score = predict_score
            best_so_far = s
    return best_so_far, best_score
best_seed, best_score = find_best_seed(latest_log_prices, predict_partial, score, start_seed=0, end_seed=500000)
print("best seed: {} best MSE: {:0.08f}".format(best_seed, best_score))
best seed: 68105 best MSE: 0.00035640

经过500k的试验，我们将p值从0.00797降到了0.0003564 。这是一个很大的进步。

历史性的表现是不错的，但我们想看看神奇的seed在接下来的几个月里会有什么表现。

returns = predict(mean=prior_log_deltas_mean, std=prior_log_deltas_std, size=400, seed=best_seed)
prediction = apply_returns(start, returns)
compare(prediction, latest_log_prices.values)
compare(prediction, log_prices.values)

根据我们的模型，到今年年底，市场应该会开始回升，达到新的高度。

p-Hacking

在这里，公众号给大家普及一下什么事p-Hacking：

我们在统计时经常用到P值，一般认为P≦0.05有统计学意义。但是现在很多统计学家并不是这样认为，对于P值的滥用和误用进行了苛刻的批评。因此出现了一个新词：P-hacking。

P-hacking 最早应该是美国宾夕法尼亚大学的Simmons和他的团队提出来的：

P-hacking 按照字面的意思来看是「P值黑客]，但是实际上的意思科研动力认为是「P值篡改」或者「P值操纵」。这可能是在线都市词典收录的第一个统计词汇：

Exploiting –perhaps unconsciously - researcher degrees of freedom until p<.05.

从词典给出的意思来看P-hacking是科研人员不断的尝试统计计算直到p<.05，当然有时这可能是无意识的。在线都市词典还给出了例句：

That finding seems to have been obtained through p-hacking, the authors dropped one of the conditions so that the overall p-value would be less than .05.
She is a p-hacker, she always monitors data while it is being collected.

Simmons 等人也对P-hacking进行了定义：

P-hacking refers to the practice of reanalyzing data in many different ways to yield a target result. They, and more recently Motulsky, have described the variations on P-hacking, and the hazards, notably the likelihood of false positives—findings that statistics suggest are meaningful when they are no.

有一些可重现性危机，一些科学家无法复制一些关键实验：

今年8月，《科学》杂志发表了一项名为“可复制性项目”（re - bility Project）的计划的成果。该计划是由非营利组织开放科学中心（Center for Open Science）协调的合作项目。参与者试图复制100个与实验相关的心理学研究，这些研究已经在三家著名的心理学杂志上发表。

媒体广泛报道的结果令人深思。只有39项研究被成功复制。

我猜想，大部分原因是研究人员在发表具有统计学意义的结果之前进行了大量的试验。或者在试验过程中改变不同的参数。这就是所谓的The garden of forking paths，并不总是经过深思熟虑的。

文献地址：

http://101.96.10.63/www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf

Researcher degrees of freedom can lead to a multiple comparisons problem, even in settings where researchers perform only a single analysis on their data. The problem is there can be a large number of potential comparisons when the details of data analysis are highly contingent on data, without the researcher having to perform any conscious procedure of fishing or examining multiple p-values

经济和金融预测很容易受到这些偏差的影响。我听经济学家经常说，因子X的N个月滞后是y的一个指标，为什么N个月滞后？1 - (N-1)的滞后还没有解决。

避免p-Hacking

在你自己的研究中避免p-Hacking的一个好的方法是：从一开始就对自己诚实。仔细考虑并记录你想要测试的所有内容。如果你想测试20个不同的因素，请在开始测试之前指定这些因素，并在评估指标时考虑所有20个因素。

但最重要的是，想想你自己的模型在做什么。神经网络有时被认为是黑箱，从某种意义上说，的确如此，但是你应该批判性地回顾每一步。如果你正在进行图像识别，请观察每一层的激活，大致了解该层的激活基于什么。如果你正在进行强化学习来玩游戏，看看你是否能够大致理解逻辑是如何工作的。如果你正在进行自然语言处理，请考虑与同义词、反义词和相关单词有关的词向量。

如果你在做股票市场分析，问问自己你真正想从模型中得到什么。为什么某些因素的第n个延迟是一个预测因素？为什么以前的收益会影响未来的收益？你为什么只考虑最近的N次收益？为什么要预测一个（每天、每小时、每分钟）的周期？你为什么要考虑从X到Y这段时间？为什么验证到Z？

将股票增量输入到一个递归神经网络中就可以达到减少损失的目的，但是有了解释，你也可以将这些值拟合到一个随机数生成器中。

文章来源：

https://medium.com/ml-everything/predicting-the-stock-market-p-hacking-and-why-you-should-be-bullish-90fddc583838

推荐阅读

01、经过多年交易之后你应该学到的东西（深度分享）

02、监督学习标签在股市中的应用（代码+书籍）

03、全球投行顶尖机器学习团队全面分析

04、使用Tensorflow预测股票市场变动

05、使用LSTM预测股票市场基于Tensorflow

06、美丽的回测——教你定量计算过拟合概率

07、利用动态深度学习预测金融时间序列基于Python