30万行数据，Python 分析科比二十年职业生涯 | 原力计划 | 自由微信

30万行数据，Python 分析科比二十年职业生涯 | 原力计划

Original 高羊羊羊羊羊杨 CSDN 2020-10-16

作者 | 高羊羊羊羊羊杨

来源 | CSDN博客

头图 | 付费下载自视觉中国

出品 | CSDN（ID:CSDNnews）

前段时间，湖人当家球星科比·布莱恩特不幸遇难。这对于无数的球迷来说无疑使晴天霹雳，他逆天终究也没能改命,但命运也从来都没改得了他，曼巴精神会一直延续下去。随着大数据时代的到来，好像任何事情都可以和大数据这三个字挂钩。早在很久以前，大数据分析就已经广泛的应用在运动员职业生涯规划、医疗、金融等方面，在本文中将会使用Python对球星科比进行对维度分析，向 “老大” 致敬！

前景提要

那天，是2020年1月27日凌晨，我失眠了，足足在床上打滚到4点钟还是睡不着，解锁屏幕，盯着刺眼的手机打算刷刷微博，但却得到了一个令人震惊的消息：球星科比不幸遇难。换做是往常，我当然是举报三连，这种标题党罪有应得，但却刷到了越来越多条类似的消息，直到看到官方发布的消息。

正如我的文案所说，我没有见过凌晨四点的洛杉矶，可我在凌晨四点听闻了你去世的消息，1978-2020。

作为球迷，我们能做的只有惋惜与缅怀。不散播谣言，不消费 “曼巴精神”

数据获取

来源：NBA官方提供了的科比布莱恩特近二十年职业生涯数据资料集（数据量比较庞大，大约有3万行）

数据处理

翻阅文档时不难发现其中有很多空缺值，简单粗暴的方式是直接删除有空值的行，但为了样本完整性与预测结果的正确率。

首先我们对投篮距离做一个简单的异常值检测，这里采用的是箱线图呈现

 1#-*- coding: utf-8 -*-
 2catering_sale = '2.csv'
 3data = pd.read_csv(catering_sale, index_col = 'shot_id') #读取数据，指定“shot_id”列为索引列
 4
 5import matplotlib.pyplot as plt #导入图像库
 6plt.rcParams['font.sans-serif'] = ['SimHei'] #用来正常显示中文标签
 7plt.rcParams['axes.unicode_minus'] = False #用来正常显示负号
 8#
 9plt.figure() #建立图像
10p = data.boxplot(return_type='dict') #画箱线图，直接使用DataFrame的方法
11x = p['fliers'][0].get_xdata() # 'flies'即为异常值的标签
12y = p['fliers'][0].get_ydata()
13y.sort() #从小到大排序，该方法直接改变原对象
14print('共有30687个数据,其中异常值的个数为{}'.format(len(y)))
15
16#用annotate添加注释
17#其中有些相近的点，注解会出现重叠，难以看清，需要一些技巧来控制。
18
19for i in range(len(x)):
20  if i>0:
21    plt.annotate(y[i], xy = (x[i],y[i]), xytext=(x[i]+0.05 -0.8/(y[i]-y[i-1]),y[i]))
22  else:
23    plt.annotate(y[i], xy = (x[i],y[i]), xytext=(x[i]+0.08,y[i]))
24
25plt.show() #展示箱线图

我们将得到这样的结果：

根据判断，该列数据有68个异常值，这里采取的操作是将这些异常值所在行删除，其他列属性同理。

数据整合

将数据导入，并按我们的需求对数据进行合并、添加新列名的操作

 1import pandas as pd
 2
 3
 4allData = pd.read_csv('data.csv')
 5data = allData[allData['shot_made_flag'].notnull()].reset_index()
 6
 7# 添加新的列名
 8data['game_date_DT'] = pd.to_datetime(data['game_date'])
 9data['dayOfWeek'] = data['game_date_DT'].dt.dayofweek
10data['dayOfYear'] = data['game_date_DT'].dt.dayofyear
11data['secondsFromPeriodEnd'] = 60 * data['minutes_remaining'] + data['seconds_remaining']
12data['secondsFromPeriodStart'] = 60 * (11 - data['minutes_remaining']) + (60 - data['seconds_remaining'])
13data['secondsFromGameStart'] = (data['period'] <= 4).astype(int) * (data['period'] - 1) * 12 * 60 + (
14        data['period'] > 4).astype(int) * ((data['period'] - 4) * 5 * 60 + 3 * 12 * 60) + data['secondsFromPeriodStart']
15
16'''
17其中：
18secondsFromPeriodEnd 一个周期结束后的秒
19secondsFromPeriodStart 一个周期开始时的秒
20secondsFromGameStart 一场比赛开始后的秒数
21'''
22
23#对数据进行验证
24print(data.loc[:10, ['period', 'minutes_remaining', 'seconds_remaining', 'secondsFromGameStart']])

运行有如下结果：

看起来还是一切正常的

绘制投篮尝试图

根据不同的时间变化(从比赛开始)来绘制投篮的尝试图

这里我们将用到matplotlib包

 1import pandas as pd
 2import numpy as np
 3import matplotlib.pyplot as plt
 4
 5
 6plt.rcParams['figure.figsize'] = (16, 16)
 7plt.rcParams['font.size'] = 16
 8binsSizes = [24, 12, 6]
 9plt.figure()
10
11for k, binSizeInSeconds in enumerate(binsSizes):
12    timeBins = np.arange(0, 60 * (4 * 12 + 3 * 5), binSizeInSeconds) + 0.01
13    attemptsAsFunctionOfTime, b = np.histogram(data['secondsFromGameStart'], bins=timeBins)
14
15    maxHeight = max(attemptsAsFunctionOfTime) + 30
16    barWidth = 0.999 * (timeBins[1] - timeBins[0])
17    plt.subplot(len(binsSizes), 1, k + 1)
18    plt.bar(timeBins[:-1], attemptsAsFunctionOfTime, align='edge', width=barWidth)
19    plt.title(str(binSizeInSeconds) + ' second time bins')
20    plt.vlines(x=[0, 12 * 60, 2 * 12 * 60, 3 * 12 * 60, 4 * 12 * 60, 4 * 12 * 60 + 5 * 60, 4 * 12 * 60 + 2 * 5 * 60,
21                  4 * 12 * 60 + 3 * 5 * 60], ymin=0, ymax=maxHeight, colors='r')
22    plt.xlim((-20, 3200))
23    plt.ylim((0, maxHeight))
24    plt.ylabel('attempts')
25plt.xlabel('time [seconds from start of game]')
26plt.show()

看下效果：

可以看出随着比赛时间的进行，科比的出手次数呈现增长状态。

绘制命中率对比图

这里们将做一个对比来判断一下科比的命中率如何

 1# 在比赛中，根据时间的函数绘制出投篮精度。
 2# 绘制精度随时间变化的函数
 3plt.rcParams['figure.figsize'] = (15, 10)
 4plt.rcParams['font.size'] = 16
 5
 6binSizeInSeconds = 20
 7timeBins = np.arange(0, 60 * (4 * 12 + 3 * 5), binSizeInSeconds) + 0.01
 8attemptsAsFunctionOfTime, b = np.histogram(data['secondsFromGameStart'], bins=timeBins)
 9madeAttemptsAsFunctionOfTime, b = np.histogram(data.loc[data['shot_made_flag'] == 1, 'secondsFromGameStart'],
10                                               bins=timeBins)
11attemptsAsFunctionOfTime[attemptsAsFunctionOfTime < 1] = 1
12accuracyAsFunctionOfTime = madeAttemptsAsFunctionOfTime.astype(float) / attemptsAsFunctionOfTime
13accuracyAsFunctionOfTime[attemptsAsFunctionOfTime <= 50] = 0  # zero accuracy in bins that don't have enough samples
14
15maxHeight = max(attemptsAsFunctionOfTime) + 30
16barWidth = 0.999 * (timeBins[1] - timeBins[0])
17
18plt.figure()
19plt.subplot(2, 1, 1)
20plt.bar(timeBins[:-1], attemptsAsFunctionOfTime, align='edge', width=barWidth);
21plt.xlim((-20, 3200))
22plt.ylim((0, maxHeight))
23
24#上面图的y轴 投篮次数
25plt.ylabel('attempts')
26plt.title(str(binSizeInSeconds) + ' second time bins')
27plt.vlines(x=[0, 12 * 60, 2 * 12 * 60, 3 * 12 * 60, 4 * 12 * 60, 4 * 12 * 60 + 5 * 60, 4 * 12 * 60 + 2 * 5 * 60,
28              4 * 12 * 60 + 3 * 5 * 60], ymin=0, ymax=maxHeight, colors='r')
29plt.subplot(2, 1, 2)
30plt.bar(timeBins[:-1], accuracyAsFunctionOfTime, align='edge', width=barWidth);
31plt.xlim((-20, 3200))
32#下面图的y轴 命中率
33plt.ylabel('accuracy')
34plt.xlabel('time [seconds from start of game]')
35plt.vlines(x=[0, 12 * 60, 2 * 12 * 60, 3 * 12 * 60, 4 * 12 * 60, 4 * 12 * 60 + 5 * 60, 4 * 12 * 60 + 2 * 5 * 60,
36              4 * 12 * 60 + 3 * 5 * 60], ymin=0.0, ymax=0.7, colors='r')
37plt.show()

看一下效果怎么样

分析可得出科比的投篮命中率大概徘徊在0.4左右，但这并不是我们想要的效果

为了进一步对数据进行挖掘，我们需要使用一些算法了。

GMM聚类

那么什么是GMM聚类呢？

GMM是高斯混合模型（或者是混合高斯模型）的简称。大致的意思就是所有的分布可以看做是多个高斯分布综合起来的结果。这样一来，任何分布都可以分成多个高斯分布来表示。

因为我们知道，按照大自然中很多现象是遵从高斯（即正态）分布的，但是，实际上，影响一个分布的原因是多个的，甚至有些是人为的，可能每一个影响因素决定了一个高斯分布，多种影响结合起来就是多个高斯分布。（个人理解）

因此，混合高斯模型聚类的原理：通过样本找到K个高斯分布的期望和方差，那么K个高斯模型就确定了。在聚类的过程中，不会明确的指定一个样本属于哪一类，而是计算这个样本在某个分布中的可能性。

高斯分布一般还要结合EM算法作为其似然估计算法。

 1'''
 2现在，让我们继续我们的初步探索，研究一下科比投篮的空间位置。
 3我们将通过构建一个高斯混合模型来实现这一点，该模型试图对科比的射门位置进行简单的总结。
 4用GMM在科比的投篮位置上对他们的投篮尝试进行聚类
 5'''
 6
 7numGaussians = 13
 8gaussianMixtureModel = mixture.GaussianMixture(n_components=numGaussians, covariance_type='full',
 9                                               init_params='kmeans', n_init=50,
10                                               verbose=0, random_state=5)
11gaussianMixtureModel.fit(data.loc[:, ['loc_x', 'loc_y']])
12
13# 将GMM集群作为字段添加到数据集中
14data['shotLocationCluster'] = gaussianMixtureModel.predict(data.loc[:, ['loc_x', 'loc_y']])

球场可视化

这里借鉴了MichaelKrueger的excelent脚本里的draw_court()函数

draw_court()函数

 1def draw_court(ax=None, color='black', lw=2, outer_lines=False):
 2    # 如果没有提供用于绘图的axis对象，就获取当前对象
 3    if ax is None:
 4        ax = plt.gca()
 5
 6    # 创建一个NBA的球场
 7    # 建一个篮筐
 8    # 直径是18，半径是9
 9    # 7.5在坐标系内
10    hoop = Circle((0, 0), radius=7.5, linewidth=lw, color=color, fill=False)
11
12    # 创建篮筐
13    backboard = Rectangle((-30, -7.5), 60, -1, linewidth=lw, color=color)
14
15    # The paint
16    # 为球场外部上色， width=16ft, height=19ft
17    outer_box = Rectangle((-80, -47.5), 160, 190, linewidth=lw, color=color,
18                          fill=False)
19    # 为球场内部上色, width=12ft, height=19ft
20    inner_box = Rectangle((-60, -47.5), 120, 190, linewidth=lw, color=color,
21                          fill=False)
22
23
24    #创建发球顶弧
25    top_free_throw = Arc((0, 142.5), 120, 120, theta1=0, theta2=180,
26                         linewidth=lw, color=color, fill=False)
27
28    #创建发球底弧
29    bottom_free_throw = Arc((0, 142.5), 120, 120, theta1=180, theta2=0,
30                            linewidth=lw, color=color, linestyle='dashed')
31
32    # 这是一个距离篮筐中心4英尺半径的弧线
33    restricted = Arc((0, 0), 80, 80, theta1=0, theta2=180, linewidth=lw,
34                     color=color)
35
36    # 三分线
37    # 创建边3pt的线，14英尺长
38    corner_three_a = Rectangle((-220, -47.5), 0, 140, linewidth=lw,
39                               color=color)
40    corner_three_b = Rectangle((220, -47.5), 0, 140, linewidth=lw, color=color)
41
42    # 圆弧到圆心是个圆环，距离为23'9"
43    # 调整一下thetal的值，直到它们与三分线对齐
44    three_arc = Arc((0, 0), 475, 475, theta1=22, theta2=158, linewidth=lw,
45                    color=color)
46
47
48    # 中场部分
49    center_outer_arc = Arc((0, 422.5), 120, 120, theta1=180, theta2=0,
50                           linewidth=lw, color=color)
51    center_inner_arc = Arc((0, 422.5), 40, 40, theta1=180, theta2=0,
52                           linewidth=lw, color=color)
53
54
55    # 要绘制到坐标轴上的球场元素的列表
56    court_elements = [hoop, backboard, outer_box, inner_box, top_free_throw,
57                      bottom_free_throw, restricted, corner_three_a,
58                      corner_three_b, three_arc, center_outer_arc,
59                      center_inner_arc]
60
61    if outer_lines:
62
63        # 划出半场线、底线和边线
64        outer_lines = Rectangle((-250, -47.5), 500, 470, linewidth=lw,
65                                color=color, fill=False)
66        court_elements.append(outer_lines)
67
68
69    # 将球场元素添加到轴上
70    for element in court_elements:
71        ax.add_patch(element)
72
73    return ax

二维高斯图

建立绘制画二维高斯图的函数

Draw2DGaussians（）

 1def Draw2DGaussians(gaussianMixtureModel, ellipseColors, ellipseTextMessages):
 2    fig, h = plt.subplots()
 3    for i, (mean, covarianceMatrix) in enumerate(zip(gaussianMixtureModel.means_, gaussianMixtureModel.covariances_)):
 4        # 得到协方差矩阵的特征向量和特征值
 5        v, w = np.linalg.eigh(covarianceMatrix)
 6        v = 2.5 * np.sqrt(v)  # go to units of standard deviation instead of variance 用标准差的单位代替方差
 7
 8        # 计算椭圆角和两轴长度并画出它
 9        u = w[0] / np.linalg.norm(w[0])
10        angle = np.arctan(u[1] / u[0])
11        angle = 180 * angle / np.pi  # convert to degrees 转换成度数
12        currEllipse = mpl.patches.Ellipse(mean, v[0], v[1], 180 + angle, color=ellipseColors[i])
13        currEllipse.set_alpha(0.5)
14        h.add_artist(currEllipse)
15        h.text(mean[0] + 7, mean[1] - 1, ellipseTextMessages[i], fontsize=13, color='blue')

下面开始绘制2D高斯投篮次数图，图中的每个椭圆都是离高斯分布中心2.5个标准差远的计数，每个蓝色的数字代表从该高斯分布观察到的所占百分比

 1# 显示投篮尝试的高斯混合椭圆
 2plt.rcParams['figure.figsize'] = (13, 10)
 3plt.rcParams['font.size'] = 15
 4
 5ellipseTextMessages = [str(100 * gaussianMixtureModel.weights_[x])[:4] + '%' for x in range(numGaussians)]
 6ellipseColors = ['red', 'green', 'purple', 'cyan', 'magenta', 'yellow', 'blue', 'orange', 'silver', 'maroon', 'lime',
 7                 'olive', 'brown', 'darkblue']
 8Draw2DGaussians(gaussianMixtureModel, ellipseColors, ellipseTextMessages)
 9draw_court(outer_lines=True)
10plt.ylim(-60, 440)
11plt.xlim(270, -270)
12plt.title('shot attempts')
13plt.show()

看一下成果：

我们可以看到，着色后的2D高斯图中，科比在球场的左侧（或者从他看来是右侧）做了更多的投篮尝试。这可能是因为他是右撇子。此外，我们还可以看到，大量的投篮尝试（16.8%）是直接从篮下进行的，5.06%的额外投篮尝试是从非常接近篮下的位置投出去的。

它看起来并不完美，但确实显示了一些有用的东西

对于绘制的每个高斯集群的投篮精度，蓝色数字将代表从这个集群中获取到的准确性，因此我们可以了解哪些是容易的，哪些是困难的。

对于每个集群，计算一下它的精度并绘图

 1plt.rcParams['figure.figsize'] = (13, 10)
 2plt.rcParams['font.size'] = 15
 3
 4variableCategories = data['shotLocationCluster'].value_counts().index.tolist()
 5
 6clusterAccuracy = {}
 7for category in variableCategories:
 8    shotsAttempted = np.array(data['shotLocationCluster'] == category).sum()
 9    shotsMade = np.array(data.loc[data['shotLocationCluster'] == category, 'shot_made_flag'] == 1).sum()
10    clusterAccuracy[category] = float(shotsMade) / shotsAttempted
11
12ellipseTextMessages = [str(100 * clusterAccuracy[x])[:4] + '%' for x in range(numGaussians)]
13Draw2DGaussians(gaussianMixtureModel, ellipseColors, ellipseTextMessages)
14draw_court(outer_lines=True)
15plt.ylim(-60, 440)
16plt.xlim(270, -270)
17plt.title('shot accuracy')
18plt.show()

看一下效果图

我们可以清楚地看到投篮距离和精度之间的关系。

绘制二维时空图

另一个有趣的事实是：科比不仅在右侧做了更多的投篮尝试（从他看来的那边），而且他在这些投篮尝试上更擅长

现在让我们绘制一个科比职业生涯的二维时空图。在X轴上，将从比赛开始时计时；在y轴上有科比投篮的集群指数(根据集群精度排序)；图片的深度将反映科比在那个特定的时间从那个特定的集群中尝试的次数；图中的红色垂线分割比赛的每节

 1# 制科比整个职业生涯比赛中的二维时空直方图
 2plt.rcParams['figure.figsize'] = (18, 10) #设置图像显示的大小
 3plt.rcParams['font.size'] = 18 #字体大小
 4
 5
 6# 根据集群的准确性对它们进行排序
 7sortedClustersByAccuracyTuple = sorted(clusterAccuracy.items(), key=operator.itemgetter(1), reverse=True)
 8sortedClustersByAccuracy = [x[0] for x in sortedClustersByAccuracyTuple]
 9
10binSizeInSeconds = 12
11timeInUnitsOfBins = ((data['secondsFromGameStart'] + 0.0001) / binSizeInSeconds).astype(int)
12locationInUintsOfClusters = np.array(
13    [sortedClustersByAccuracy.index(data.loc[x, 'shotLocationCluster']) for x in range(data.shape[0])])
14
15
16# 建立科比比赛的时空直方图
17shotAttempts = np.zeros((gaussianMixtureModel.n_components, 1 + max(timeInUnitsOfBins)))
18for shot in range(data.shape[0]):
19    shotAttempts[locationInUintsOfClusters[shot], timeInUnitsOfBins[shot]] += 1
20
21
22# 让y轴有更大的面积，这样会更明显
23shotAttempts = np.kron(shotAttempts, np.ones((5, 1)))
24
25# 每节结束的位置
26vlinesList = 0.5001 + np.array([0, 12 * 60, 2 * 12 * 60, 3 * 12 * 60, 4 * 12 * 60, 4 * 12 * 60 + 5 * 60]).astype(
27    int) / binSizeInSeconds
28
29plt.figure(figsize=(13, 8)) #设置宽和高
30plt.imshow(shotAttempts, cmap='copper', interpolation="nearest")  #设置了边界的模糊度，或者是图片的模糊度
31plt.xlim(0, float(4 * 12 * 60 + 6 * 60) / binSizeInSeconds)
32plt.vlines(x=vlinesList, ymin=-0.5, ymax=shotAttempts.shape[0] - 0.5, colors='r')
33plt.xlabel('time from start of game [sec]')
34plt.ylabel('cluster (sorted by accuracy)')
35plt.show()

看一下运行结果：

集群按精度降序排序。高准确度的投篮在最上面，而低准确度的半场投篮在最下面,我们现在可以看到，在第一、第二和第三节中的“最后一秒出手”实际上是从很远的地方“绝杀”, 然而，有趣的是，在第4节中，最后一秒的投篮并不属于“绝杀”的投篮群，而是属于常规的3分投篮（这仍然比较难命中，但不是毫无希望的)。

在以后的分析中，我们将根据投篮属性来评估投篮难度(如投篮类型和投篮距离）

下面将为投篮难度模型创建一个新表格

 1def FactorizeCategoricalVariable(inputDB, categoricalVarName):
 2    opponentCategories = inputDB[categoricalVarName].value_counts().index.tolist()
 3
 4    outputDB = pd.DataFrame()
 5    for category in opponentCategories:
 6        featureName = categoricalVarName + ': ' + str(category)
 7        outputDB[featureName] = (inputDB[categoricalVarName] == category).astype(int)
 8
 9    return outputDB
10
11
12featuresDB = pd.DataFrame()
13featuresDB['homeGame'] = data['matchup'].apply(lambda x: 1 if (x.find('@') < 0) else 0)
14featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'opponent')], axis=1)
15featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'action_type')], axis=1)
16featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'shot_type')], axis=1)
17featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'combined_shot_type')], axis=1)
18featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'shot_zone_basic')], axis=1)
19featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'shot_zone_area')], axis=1)
20featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'shot_zone_range')], axis=1)
21featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'shotLocationCluster')], axis=1)
22
23featuresDB['playoffGame'] = data['playoffs']
24featuresDB['locX'] = data['loc_x']
25featuresDB['locY'] = data['loc_y']
26featuresDB['distanceFromBasket'] = data['shot_distance']
27featuresDB['secondsFromPeriodEnd'] = data['secondsFromPeriodEnd']
28
29featuresDB['dayOfWeek_cycX'] = np.sin(2 * np.pi * (data['dayOfWeek'] / 7))
30featuresDB['dayOfWeek_cycY'] = np.cos(2 * np.pi * (data['dayOfWeek'] / 7))
31featuresDB['timeOfYear_cycX'] = np.sin(2 * np.pi * (data['dayOfYear'] / 365))
32featuresDB['timeOfYear_cycY'] = np.cos(2 * np.pi * (data['dayOfYear'] / 365))
33
34labelsDB = data['shot_made_flag']

根据FeaturesDB表构建模型，并确保它不会过度匹配（即训练误差与测试误差相同）

使用一个额外的分类器

建立一个简单的模型，并确保它不超载

 1randomSeed = 1
 2numFolds = 4
 3
 4stratifiedCV = model_selection.StratifiedKFold(n_splits=numFolds, shuffle=True, random_state=randomSeed)
 5
 6mainLearner = ensemble.ExtraTreesClassifier(n_estimators=500, max_depth=5,
 7                                            min_samples_leaf=120, max_features=120,
 8                                            criterion='entropy', bootstrap=False,
 9                                            n_jobs=-1, random_state=randomSeed)
10
11startTime = time.time()
12trainAccuracy = []
13validAccuracy = []
14trainLogLosses = []
15validLogLosses = []
16for trainInds, validInds in stratifiedCV.split(featuresDB, labelsDB):
17    # 分割训练和有效的集合
18    X_train_CV = featuresDB.iloc[trainInds, :]
19    y_train_CV = labelsDB.iloc[trainInds]
20    X_valid_CV = featuresDB.iloc[validInds, :]
21    y_valid_CV = labelsDB.iloc[validInds]
22
23    # 训练
24    mainLearner.fit(X_train_CV, y_train_CV)
25
26    # 作出预测
27    y_train_hat_mainLearner = mainLearner.predict_proba(X_train_CV)[:, 1]
28    y_valid_hat_mainLearner = mainLearner.predict_proba(X_valid_CV)[:, 1]
29
30    # 储存结果
31    trainAccuracy.append(accuracy(y_train_CV, y_train_hat_mainLearner > 0.5))
32    validAccuracy.append(accuracy(y_valid_CV, y_valid_hat_mainLearner > 0.5))
33    trainLogLosses.append(log_loss(y_train_CV, y_train_hat_mainLearner))
34    validLogLosses.append(log_loss(y_valid_CV, y_valid_hat_mainLearner))
35
36print("-----------------------------------------------------")
37print("total (train,valid) Accuracy = (%.5f,%.5f). took %.2f minutes" % (
38    np.mean(trainAccuracy), np.mean(validAccuracy), (time.time() - startTime) / 60))
39print("total (train,valid) Log Loss = (%.5f,%.5f). took %.2f minutes" % (
40    np.mean(trainLogLosses), np.mean(validLogLosses), (time.time() - startTime) / 60))
41print("-----------------------------------------------------")
42
43mainLearner.fit(featuresDB, labelsDB)
44data['shotDifficulty'] = mainLearner.predict_proba(featuresDB)[:, 1]
45
46# 为了深入了解，我们来看看特性选择
47featureInds = mainLearner.feature_importances_.argsort()[::-1]
48featureImportance = pd.DataFrame(
49    np.concatenate((featuresDB.columns[featureInds, None], mainLearner.feature_importances_[featureInds, None]),
50                   axis=1),
51    columns=['featureName', 'importanceET'])
52
53print(featureImportance.iloc[:30, :])**看看运行结果如何**：

 1total (train,valid) Accuracy = (0.67912,0.67860). took 0.29 minutes
 2total (train,valid) Log Loss = (0.60812,0.61100). took 0.29 minutes
 3-----------------------------------------------------
 4                         featureName importanceET
 50             action_type: Jump Shot     0.578036
 61            action_type: Layup Shot     0.173274
 72           combined_shot_type: Dunk     0.113341
 83                           homeGame    0.0288043
 94             action_type: Dunk Shot    0.0161591
105             shotLocationCluster: 9    0.0136386
116          combined_shot_type: Layup   0.00949568
127                 distanceFromBasket    0.0084703
138         shot_zone_range: 16-24 ft.    0.0072107
149        action_type: Slam Dunk Shot   0.00690316
1510     combined_shot_type: Jump Shot   0.00592586
1611              secondsFromPeriodEnd   0.00589391
1712    action_type: Running Jump Shot   0.00544904
1813           shotLocationCluster: 11   0.00449125
1914                              locY   0.00388509
2015   action_type: Driving Layup Shot   0.00364757
2116  shot_zone_range: Less Than 8 ft.   0.00349615
2217      combined_shot_type: Tip Shot   0.00260399
2318         shot_zone_area: Center(C)    0.0011585
2419                     opponent: DEN  0.000882106
2520    action_type: Driving Dunk Shot  0.000848156
2621  shot_zone_basic: Restricted Area  0.000650022
2722            shotLocationCluster: 2  0.000513476
2823             action_type: Tip Shot  0.000489918
2924        shot_zone_basic: Mid-Range  0.000487306
3025     action_type: Pullup Jump shot  0.000453641
3126         shot_zone_range: 8-16 ft.  0.000452574
3227                   timeOfYear_cycX  0.000432267
3328                    dayOfWeek_cycX   0.00039668
3429            shotLocationCluster: 8  0.000254077
35
36Process finished with exit code 0

在这里想谈谈科比·布莱恩特在决策过程中的一些问题；为此，我们将收集两组不同的效果图，并分析它们之间的差异：

在一次成功的投篮后马上继续投篮
在一次不成功的投篮后马上马上投篮

考虑到科比投进或投失了最后一球，我收集了一些数据

 1timeBetweenShotsDict = {}
 2timeBetweenShotsDict['madeLast'] = []
 3timeBetweenShotsDict['missedLast'] = []
 4
 5changeInDistFromBasketDict = {}
 6changeInDistFromBasketDict['madeLast'] = []
 7changeInDistFromBasketDict['missedLast'] = []
 8
 9changeInShotDifficultyDict = {}
10changeInShotDifficultyDict['madeLast'] = []
11changeInShotDifficultyDict['missedLast'] = []
12
13afterMadeShotsList = []
14afterMissedShotsList = []
15
16for shot in range(1, data.shape[0]):
17
18    # 确保当前的投篮和最后的投篮都在同一场比赛的同一时间段
19    sameGame = data.loc[shot, 'game_date'] == data.loc[shot - 1, 'game_date']
20    samePeriod = data.loc[shot, 'period'] == data.loc[shot - 1, 'period']
21
22    if samePeriod and sameGame:
23        madeLastShot = data.loc[shot - 1, 'shot_made_flag'] == 1
24        missedLastShot = data.loc[shot - 1, 'shot_made_flag'] == 0
25
26        timeDifferenceFromLastShot = data.loc[shot, 'secondsFromGameStart'] - data.loc[shot - 1, 'secondsFromGameStart']
27        distDifferenceFromLastShot = data.loc[shot, 'shot_distance'] - data.loc[shot - 1, 'shot_distance']
28        shotDifficultyDifferenceFromLastShot = data.loc[shot, 'shotDifficulty'] - data.loc[shot - 1, 'shotDifficulty']
29
30        # check for currupt data points (assuming all samples should have been chronologically ordered)
31        # 检查数据(假设所有样本都按时间顺序排列)
32        if timeDifferenceFromLastShot < 0:
33            continue
34
35        if madeLastShot:
36            timeBetweenShotsDict['madeLast'].append(timeDifferenceFromLastShot)
37            changeInDistFromBasketDict['madeLast'].append(distDifferenceFromLastShot)
38            changeInShotDifficultyDict['madeLast'].append(shotDifficultyDifferenceFromLastShot)
39            afterMadeShotsList.append(shot)
40
41        if missedLastShot:
42            timeBetweenShotsDict['missedLast'].append(timeDifferenceFromLastShot)
43            changeInDistFromBasketDict['missedLast'].append(distDifferenceFromLastShot)
44            changeInShotDifficultyDict['missedLast'].append(shotDifficultyDifferenceFromLastShot)
45            afterMissedShotsList.append(shot)
46
47afterMissedData = data.iloc[afterMissedShotsList, :]
48afterMadeData = data.iloc[afterMadeShotsList, :]
49
50shotChancesListAfterMade = afterMadeData['shotDifficulty'].tolist()
51totalAttemptsAfterMade = afterMadeData.shape[0]
52totalMadeAfterMade = np.array(afterMadeData['shot_made_flag'] == 1).sum()
53
54shotChancesListAfterMissed = afterMissedData['shotDifficulty'].tolist()
55totalAttemptsAfterMissed = afterMissedData.shape[0]
56totalMadeAfterMissed = np.array(afterMissedData['shot_made_flag'] == 1).sum()

柱状图

为他们绘制“上次投篮后的时间”的柱状图

 1plt.rcParams['figure.figsize'] = (13, 10)
 2
 3jointHist, timeBins = np.histogram(timeBetweenShotsDict['madeLast'] + timeBetweenShotsDict['missedLast'], bins=200)
 4barWidth = 0.999 * (timeBins[1] - timeBins[0])
 5
 6timeDiffHist_GivenMadeLastShot, b = np.histogram(timeBetweenShotsDict['madeLast'], bins=timeBins)
 7timeDiffHist_GivenMissedLastShot, b = np.histogram(timeBetweenShotsDict['missedLast'], bins=timeBins)
 8maxHeight = max(max(timeDiffHist_GivenMadeLastShot), max(timeDiffHist_GivenMissedLastShot)) + 30
 9
10plt.figure()
11plt.subplot(2, 1, 1)
12plt.bar(timeBins[:-1], timeDiffHist_GivenMadeLastShot, width=barWidth)
13plt.xlim((0, 500))
14plt.ylim((0, maxHeight))
15plt.title('made last shot')
16plt.ylabel('counts')
17plt.subplot(2, 1, 2)
18plt.bar(timeBins[:-1], timeDiffHist_GivenMissedLastShot, width=barWidth)
19plt.xlim((0, 500))
20plt.ylim((0, maxHeight))
21plt.title('missed last shot')
22plt.xlabel('time since last shot')
23plt.ylabel('counts')
24plt.show()

看一下运行结果：

从图中可以看出：科比投了一个球之后有些着急去投下一个，而图中的一些比较平缓的值可能是球权在另一只队伍手中，需要一些时间来夺回。

累计柱状图

为了更好地可视化柱状图之间的差异，我们来看看累积柱状图。

 1plt.rcParams['figure.figsize'] = (13, 6)
 2
 3timeDiffCumHist_GivenMadeLastShot = np.cumsum(timeDiffHist_GivenMadeLastShot).astype(float)
 4timeDiffCumHist_GivenMadeLastShot = timeDiffCumHist_GivenMadeLastShot / max(timeDiffCumHist_GivenMadeLastShot)
 5timeDiffCumHist_GivenMissedLastShot = np.cumsum(timeDiffHist_GivenMissedLastShot).astype(float)
 6timeDiffCumHist_GivenMissedLastShot = timeDiffCumHist_GivenMissedLastShot / max(timeDiffCumHist_GivenMissedLastShot)
 7
 8maxHeight = max(timeDiffCumHist_GivenMadeLastShot[-1], timeDiffCumHist_GivenMissedLastShot[-1])
 9
10plt.figure()
11madePrev = plt.plot(timeBins[:-1], timeDiffCumHist_GivenMadeLastShot, label='made Prev')
12plt.xlim((0, 500))
13missedPrev = plt.plot(timeBins[:-1], timeDiffCumHist_GivenMissedLastShot, label='missed Prev')
14plt.xlim((0, 500))
15plt.ylim((0, 1))
16plt.title('cumulative density function - CDF')
17plt.xlabel('time since last shot')
18plt.legend(loc='lower right')
19plt.show()

运行效果如下：

虽然可以观察到密度有差异，但好像不太清楚，所以还是转换成高斯格式来显示数据吧

 1# 显示投中后和失球后的投篮次数
 2plt.rcParams['figure.figsize'] = (13, 10)
 3
 4variableCategories = afterMadeData['shotLocationCluster'].value_counts().index.tolist()
 5clusterFrequency = {}
 6for category in variableCategories:
 7    shotsAttempted = np.array(afterMadeData['shotLocationCluster'] == category).sum()
 8    clusterFrequency[category] = float(shotsAttempted) / afterMadeData.shape[0]
 9
10ellipseTextMessages = [str(100 * clusterFrequency[x])[:4] + '%' for x in range(numGaussians)]
11Draw2DGaussians(gaussianMixtureModel, ellipseColors, ellipseTextMessages)
12draw_court(outer_lines=True)
13plt.ylim(-60, 440)
14plt.xlim(270, -270)
15plt.title('after made shots')
16
17variableCategories = afterMissedData['shotLocationCluster'].value_counts().index.tolist()
18clusterFrequency = {}
19for category in variableCategories:
20    shotsAttempted = np.array(afterMissedData['shotLocationCluster'] == category).sum()
21    clusterFrequency[category] = float(shotsAttempted) / afterMissedData.shape[0]
22
23ellipseTextMessages = [str(100 * clusterFrequency[x])[:4] + '%' for x in range(numGaussians)]
24Draw2DGaussians(gaussianMixtureModel, ellipseColors, ellipseTextMessages)
25draw_court(outer_lines=True)
26plt.ylim(-60, 440)
27plt.xlim(270, -270)
28plt.title('after missed shots')
29plt.show()
30