Analysing NFL Big Data Bowl and predicting Yards Gained.

  • Image courtsey of medium.com
In [ ]:
import numpy as np # linear algebra
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
In [ ]:
import pandas_profiling

from IPython.display import HTML

HTML('''

''')

In [2]:
from google.colab import drive
drive.mount('/content/drive')
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive

Refrence Image

kd

  • For columns deatils visit here
In [3]:
nfl= pd.read_csv('/content/drive/My Drive/CSV files/nfl_train.csv')
nfl.head()
/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py:2718: DtypeWarning:

Columns (47) have mixed types. Specify dtype option on import or set low_memory=False.

Out[3]:
GameId PlayId Team X Y S A Dis Orientation Dir NflId DisplayName JerseyNumber Season YardLine Quarter GameClock PossessionTeam Down Distance FieldPosition HomeScoreBeforePlay VisitorScoreBeforePlay NflIdRusher OffenseFormation OffensePersonnel DefendersInTheBox DefensePersonnel PlayDirection TimeHandoff TimeSnap Yards PlayerHeight PlayerWeight PlayerBirthDate PlayerCollegeName Position HomeTeamAbbr VisitorTeamAbbr Week Stadium Location StadiumType Turf GameWeather Temperature Humidity WindSpeed WindDirection
0 2017090700 20170907000118 away 73.91 34.84 1.69 1.13 0.40 81.99 177.18 496723 Eric Berry 29 2017 35 1 14:14:00 NE 3 2 NE 0 0 2543773 SHOTGUN 1 RB, 1 TE, 3 WR 6.0 2 DL, 3 LB, 6 DB left 2017-09-08T00:44:06.000Z 2017-09-08T00:44:05.000Z 8 6-0 212 12/29/1988 Tennessee SS NE KC 1 Gillette Stadium Foxborough, MA Outdoor Field Turf Clear and warm 63.0 77.0 8 SW
1 2017090700 20170907000118 away 74.67 32.64 0.42 1.35 0.01 27.61 198.70 2495116 Allen Bailey 97 2017 35 1 14:14:00 NE 3 2 NE 0 0 2543773 SHOTGUN 1 RB, 1 TE, 3 WR 6.0 2 DL, 3 LB, 6 DB left 2017-09-08T00:44:06.000Z 2017-09-08T00:44:05.000Z 8 6-3 288 03/25/1989 Miami DE NE KC 1 Gillette Stadium Foxborough, MA Outdoor Field Turf Clear and warm 63.0 77.0 8 SW
2 2017090700 20170907000118 away 74.00 33.20 1.22 0.59 0.31 3.01 202.73 2495493 Justin Houston 50 2017 35 1 14:14:00 NE 3 2 NE 0 0 2543773 SHOTGUN 1 RB, 1 TE, 3 WR 6.0 2 DL, 3 LB, 6 DB left 2017-09-08T00:44:06.000Z 2017-09-08T00:44:05.000Z 8 6-3 270 01/21/1989 Georgia DE NE KC 1 Gillette Stadium Foxborough, MA Outdoor Field Turf Clear and warm 63.0 77.0 8 SW
3 2017090700 20170907000118 away 71.46 27.70 0.42 0.54 0.02 359.77 105.64 2506353 Derrick Johnson 56 2017 35 1 14:14:00 NE 3 2 NE 0 0 2543773 SHOTGUN 1 RB, 1 TE, 3 WR 6.0 2 DL, 3 LB, 6 DB left 2017-09-08T00:44:06.000Z 2017-09-08T00:44:05.000Z 8 6-3 245 11/22/1982 Texas ILB NE KC 1 Gillette Stadium Foxborough, MA Outdoor Field Turf Clear and warm 63.0 77.0 8 SW
4 2017090700 20170907000118 away 69.32 35.42 1.82 2.43 0.16 12.63 164.31 2530794 Ron Parker 38 2017 35 1 14:14:00 NE 3 2 NE 0 0 2543773 SHOTGUN 1 RB, 1 TE, 3 WR 6.0 2 DL, 3 LB, 6 DB left 2017-09-08T00:44:06.000Z 2017-09-08T00:44:05.000Z 8 6-0 206 08/17/1987 Newberry FS NE KC 1 Gillette Stadium Foxborough, MA Outdoor Field Turf Clear and warm 63.0 77.0 8 SW
In [ ]:
profile=pandas_profiling.ProfileReport(nfl)
profile.to_file('nfl_overview.html')
In [ ]:
nfl.shape
Out[ ]:
(509762, 49)
In [ ]:
nfl.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 509762 entries, 0 to 509761
Data columns (total 49 columns):
GameId                    509762 non-null int64
PlayId                    509762 non-null int64
Team                      509762 non-null object
X                         509762 non-null float64
Y                         509762 non-null float64
S                         509762 non-null float64
A                         509762 non-null float64
Dis                       509762 non-null float64
Orientation               509744 non-null float64
Dir                       509748 non-null float64
NflId                     509762 non-null int64
DisplayName               509762 non-null object
JerseyNumber              509762 non-null int64
Season                    509762 non-null int64
YardLine                  509762 non-null int64
Quarter                   509762 non-null int64
GameClock                 509762 non-null object
PossessionTeam            509762 non-null object
Down                      509762 non-null int64
Distance                  509762 non-null int64
FieldPosition             503338 non-null object
HomeScoreBeforePlay       509762 non-null int64
VisitorScoreBeforePlay    509762 non-null int64
NflIdRusher               509762 non-null int64
OffenseFormation          509652 non-null object
OffensePersonnel          509762 non-null object
DefendersInTheBox         509696 non-null float64
DefensePersonnel          509762 non-null object
PlayDirection             509762 non-null object
TimeHandoff               509762 non-null object
TimeSnap                  509762 non-null object
Yards                     509762 non-null int64
PlayerHeight              509762 non-null object
PlayerWeight              509762 non-null int64
PlayerBirthDate           509762 non-null object
PlayerCollegeName         509762 non-null object
Position                  509762 non-null object
HomeTeamAbbr              509762 non-null object
VisitorTeamAbbr           509762 non-null object
Week                      509762 non-null int64
Stadium                   509762 non-null object
Location                  509762 non-null object
StadiumType               476828 non-null object
Turf                      509762 non-null object
GameWeather               466114 non-null object
Temperature               461230 non-null float64
Humidity                  503602 non-null float64
WindSpeed                 442332 non-null object
WindDirection             429528 non-null object
dtypes: float64(10), int64(15), object(24)
memory usage: 190.6+ MB
In [ ]:
nfl.columns
Out[ ]:
Index(['GameId', 'PlayId', 'Team', 'X', 'Y', 'S', 'A', 'Dis', 'Orientation',
       'Dir', 'NflId', 'DisplayName', 'JerseyNumber', 'Season', 'YardLine',
       'Quarter', 'GameClock', 'PossessionTeam', 'Down', 'Distance',
       'FieldPosition', 'HomeScoreBeforePlay', 'VisitorScoreBeforePlay',
       'NflIdRusher', 'OffenseFormation', 'OffensePersonnel',
       'DefendersInTheBox', 'DefensePersonnel', 'PlayDirection', 'TimeHandoff',
       'TimeSnap', 'Yards', 'PlayerHeight', 'PlayerWeight', 'PlayerBirthDate',
       'PlayerCollegeName', 'Position', 'HomeTeamAbbr', 'VisitorTeamAbbr',
       'Week', 'Stadium', 'Location', 'StadiumType', 'Turf', 'GameWeather',
       'Temperature', 'Humidity', 'WindSpeed', 'WindDirection'],
      dtype='object')
In [ ]:
plt.figure(1,figsize=(8,6))
sns.countplot(x='Team',data=nfl)
plt.title('Count plot of home and away teams:')
plt.show()
In [ ]:
sns.set_style('darkgrid')
In [ ]:
plt.figure(1,figsize=(12,6))
sns.barplot(y='DisplayName',x='S',data=nfl.sort_values(by='S',ascending=False)[:6])
plt.title('Nfl top 6 fastest speed observed of players')
plt.xlim(8.5,9.5)
plt.xlabel('speed')
plt.ylabel('Players Name')
plt.show()
In [ ]:
plt.figure(1,figsize=(16,10))
sns.barplot(y='DisplayName',x='A',data=nfl.sort_values(by='A',ascending=False)[:15])
plt.title('Nfl top 15 max acceleration observed of players')
plt.xlim(8,15)
plt.xlabel('Acceleration')
plt.ylabel('Players Name')
plt.show()

Player with highest weight.

In [ ]:
nfl[nfl.PlayerWeight==max(nfl.PlayerWeight)][['DisplayName','PlayerWeight']][:1]
Out[ ]:
DisplayName PlayerWeight
10888 Trent Brown 380
In [ ]:
plt.figure(figsize=(16,8))
plot = sns.countplot(y ="PlayerCollegeName",data=nfl,order=nfl['PlayerCollegeName'].value_counts().iloc[:10].index, palette = "Set1")
plt.xlim(7500,16500)
plt.title('College with maximum players')
plt.show()
In [ ]:
plt.figure(figsize=(20,12))
plot = sns.countplot(y ="Stadium",data=nfl,order=nfl['Stadium'].value_counts().iloc[:20].index, palette = "Set1")
plt.xlim(13200,25000)
plt.title('Top 20 Stadiums which hosted the maximum games: ')
plt.show()
In [ ]:
nfl.Stadium.unique()
Out[ ]:
array(['Gillette Stadium', 'New Era Field', 'Soldier Field',
       'Paul Brown Stadium', 'FirstEnergy', 'Ford Field', 'NRG Stadium',
       'Nissan Stadium', 'FedExField', 'Los Angeles Memorial Coliseum',
       'Lambeau Field', 'Levis Stadium', 'AT&T Stadium',
       'U.S. Bank Stadium', 'Sports Authority Field at Mile High',
       'M&T Bank Stadium', 'Bank of America Stadium', 'Lucas Oil Stadium',
       'Everbank Field', 'Arrowhead Stadium', 'Mercedes-Benz Superdome',
       'Heinz Field', 'Raymond James Stadium', 'StubHub Center',
       'Oakland-Alameda County Coliseum', 'CenturyLink Field',
       'Mercedes-Benz Dome', 'MetLife Stadium', 'Wembley Stadium',
       'Lincoln Financial Field', 'University of Phoenix Stadium',
       'Mercedes-Benz Stadium', 'M&T Stadium', 'First Energy Stadium',
       'NRG', 'MetLife', 'CenturyLink', 'FirstEnergy Stadium',
       'Hard Rock Stadium', 'EverBank Field', 'Twickenham',
       'Twickenham Stadium', 'Estadio Azteca', 'M & T Bank Stadium',
       'Oakland Alameda-County Coliseum', 'State Farm Stadium',
       'Broncos Stadium At Mile High', 'Los Angeles Memorial Coliesum',
       'Broncos Stadium at Mile High', 'TIAA Bank Field', 'CenturyField',
       'FirstEnergyStadium', 'Paul Brown Stdium', 'Lambeau field',
       'Metlife Stadium'], dtype=object)
In [ ]:
plt.figure(figsize=(20,12))
plot = sns.countplot(y ="Location",data=nfl,order=nfl['Location'].value_counts().iloc[:20].index, palette = "Set3")
plt.xlim(12200,20000)
plt.title('Top 20 Cities which hosted the maximum games: ')
plt.show()

Types of stadiums where games where played

In [ ]:
nfl[nfl.StadiumType=='Outdoor'].shape
Out[ ]:
(267696, 49)
In [ ]:
plt.figure(figsize=(20,12))
plot = sns.countplot(y ='StadiumType',data=nfl,order=nfl['StadiumType'].value_counts().iloc[:20].index, palette = "Set3")
plt.xlim(0,20000)
plt.title('Top 20 types of Stadiums which hosted the maximum games: ')
plt.show()

Distribution of temperature when games played on different Turf.

In [ ]:
plt.figure(figsize=(20,10))
sns.violinplot(y='Temperature',x='Turf',data=nfl[(nfl.Turf=='Natural Grass')|(nfl.Turf=='Artificial')|(nfl.Turf=='Field Turf')|(nfl.Turf=='UBU Sports Speed S5-M')])

plt.title('Distribution of temperature when games played on different Turf.')
plt.show()
In [ ]:
nfl.Turf.unique()
Out[ ]:
array(['Field Turf', 'A-Turf Titan', 'Grass', 'UBU Sports Speed S5-M',
       'Artificial', 'DD GrassMaster', 'Natural Grass',
       'UBU Speed Series-S5-M', 'FieldTurf', 'FieldTurf 360',
       'Natural grass', 'grass', 'Natural', 'Artifical', 'FieldTurf360',
       'Naturall Grass', 'Field turf', 'SISGrass',
       'Twenty-Four/Seven Turf', 'natural grass'], dtype=object)

Average temperature in various types stadium when games where played.

In [ ]:
plt.figure(figsize=(20,8))
sns.violinplot(y='Temperature',x='StadiumType',data=nfl[(nfl.StadiumType=='Outdoors')|(nfl.StadiumType=='Indoors')|(nfl.StadiumType=='Dome')|(nfl.StadiumType=='Retractable Roof')|(nfl.StadiumType=='Open')])

plt.title('Distribution of temperature when games played on different Stadium Types.')
plt.show()
In [ ]:
nfl.StadiumType.unique()
Out[ ]:
array(['Outdoor', 'Outdoors', 'Indoors', 'Retractable Roof', 'Indoor',
       'Retr. Roof-Closed', 'Open', nan, 'Indoor, Open Roof',
       'Retr. Roof - Closed', 'Outddors', 'Dome', 'Domed, closed',
       'Indoor, Roof Closed', 'Retr. Roof Closed',
       'Outdoor Retr Roof-Open', 'Closed Dome', 'Oudoor', 'Ourdoor',
       'Dome, closed', 'Retr. Roof-Open', 'Heinz Field', 'Outdor',
       'Retr. Roof - Open', 'Domed, Open', 'Domed, open', 'Cloudy',
       'Bowl', 'Outside', 'Domed'], dtype=object)
In [ ]:
plt.figure(figsize=(20,8))
sns.violinplot(y='Humidity',x='StadiumType',data=nfl[(nfl.StadiumType=='Outdoors')|(nfl.StadiumType=='Indoors')|(nfl.StadiumType=='Dome')|(nfl.StadiumType=='Retractable Roof')|(nfl.StadiumType=='Open')])

plt.title('Distribution of Humidity when games played on different Stadium Types.')
plt.show()
In [ ]:
sns.set_style('darkgrid')
plt.figure(figsize=(20,8))
sns.kdeplot(nfl.Yards,shade=True)
plt.xlim(-10,25)
plt.title('Distribution of Yards gained')
plt.show()
In [ ]:
plt.figure(1,figsize=(18,8))
sns.lineplot(x='Yards',y='DefendersInTheBox',data=nfl[:1000],label='Defenders in the box',ci=50)
plt.ylim(0,15)
plt.xlim(-5,15)
plt.title('lineplot showing relation between defenders in the box vs yards gained')
plt.xlabel('yards gained')
plt.ylabel('Defenders in The Box')
plt.show()
In [ ]:
nfl.columns
Out[ ]:
Index(['GameId', 'PlayId', 'Team', 'X', 'Y', 'S', 'A', 'Dis', 'Orientation',
       'Dir', 'NflId', 'DisplayName', 'JerseyNumber', 'Season', 'YardLine',
       'Quarter', 'GameClock', 'PossessionTeam', 'Down', 'Distance',
       'FieldPosition', 'HomeScoreBeforePlay', 'VisitorScoreBeforePlay',
       'NflIdRusher', 'OffenseFormation', 'OffensePersonnel',
       'DefendersInTheBox', 'DefensePersonnel', 'PlayDirection', 'TimeHandoff',
       'TimeSnap', 'Yards', 'PlayerHeight', 'PlayerWeight', 'PlayerBirthDate',
       'PlayerCollegeName', 'Position', 'HomeTeamAbbr', 'VisitorTeamAbbr',
       'Week', 'Stadium', 'Location', 'StadiumType', 'Turf', 'GameWeather',
       'Temperature', 'Humidity', 'WindSpeed', 'WindDirection'],
      dtype='object')
In [ ]:
features=['X', 'Y', 'S', 'A', 'Dis', 'Orientation',
       'Dir', 'YardLine',
       'Quarter',   'Down', 'Distance',
        'HomeScoreBeforePlay', 'VisitorScoreBeforePlay', 'OffenseFormation',
       'DefendersInTheBox', 'Yards', 'PlayerWeight', 'Position', 'Week','Temperature', 'Humidity']
In [ ]:
fea_dum=[ 'OffenseFormation','Position']
In [ ]:
nfl.shape
Out[ ]:
(509762, 49)
In [ ]:
nfl=nfl.dropna()
In [ ]:
nfl.shape
Out[ ]:
(375786, 49)

Co-relation Matrix of features

In [ ]:
nfl_c=nfl[features].corr()
nfl_c
Out[ ]:
X Y S A Dis Orientation Dir YardLine Quarter Down Distance HomeScoreBeforePlay VisitorScoreBeforePlay DefendersInTheBox Yards PlayerWeight Week Temperature Humidity
X 1.000000 0.003343 -0.008061 -0.003337 -0.006689 0.026353 0.011966 -0.010656 0.006914 -0.013143 -0.010360 0.004804 0.014879 0.003334 -0.003435 -0.000343 0.002112 -0.003599 -0.005078
Y 0.003343 1.000000 0.001678 0.003074 0.000827 -0.067024 -0.013942 0.000689 0.004816 0.000792 -0.001104 -0.000281 0.002422 -0.001275 0.000421 0.000883 -0.001847 0.002513 -0.001487
S -0.008061 0.001678 1.000000 0.330777 0.932037 0.002138 0.000445 0.046976 -0.005997 -0.046882 0.041838 0.001316 0.002982 -0.001306 0.001152 -0.271308 -0.013287 0.026394 -0.003300
A -0.003337 0.003074 0.330777 1.000000 0.276445 0.001723 0.001488 0.033918 -0.019451 -0.005401 0.037007 -0.003252 -0.011821 -0.077009 0.022749 -0.354285 -0.004165 0.007678 0.011537
Dis -0.006689 0.000827 0.932037 0.276445 1.000000 0.003937 0.001102 0.043329 -0.007217 -0.038510 0.037423 -0.007021 -0.000847 0.002201 -0.000399 -0.250230 -0.020155 0.027174 -0.009425
Orientation 0.026353 -0.067024 0.002138 0.001723 0.003937 1.000000 0.143198 0.003472 -0.001039 -0.002435 0.006008 -0.000348 -0.001040 -0.001023 0.001277 -0.000945 0.000595 -0.001425 0.003275
Dir 0.011966 -0.013942 0.000445 0.001488 0.001102 0.143198 1.000000 -0.002451 0.000284 0.001544 -0.004081 0.000972 0.000987 0.002627 -0.000028 -0.001146 0.000756 -0.002005 -0.000979
YardLine -0.010656 0.000689 0.046976 0.033918 0.043329 0.003472 -0.002451 1.000000 0.005252 0.022947 0.075478 0.003657 0.028863 -0.171998 0.062848 -0.013554 0.019253 -0.012573 -0.003986
Quarter 0.006914 0.004816 -0.005997 -0.019451 -0.007217 -0.001039 0.000284 0.005252 1.000000 0.030819 -0.008115 0.665337 0.657335 0.045046 -0.005070 0.001266 -0.010302 0.001616 -0.002786
Down -0.013143 0.000792 -0.046882 -0.005401 -0.038510 -0.002435 0.001544 0.022947 0.030819 1.000000 -0.498482 0.016061 0.022849 0.015324 -0.025843 -0.002064 0.017018 -0.015818 -0.013128
Distance -0.010360 -0.001104 0.041838 0.037007 0.037423 0.006008 -0.004081 0.075478 -0.008115 -0.498482 1.000000 -0.001244 0.006880 -0.255616 0.071535 -0.019063 -0.020895 0.027455 -0.009115
HomeScoreBeforePlay 0.004804 -0.000281 0.001316 -0.003252 -0.007021 -0.000348 0.000972 0.003657 0.665337 0.016061 -0.001244 1.000000 0.400406 0.007361 -0.006435 -0.002410 -0.045967 0.029781 0.013266
VisitorScoreBeforePlay 0.014879 0.002422 0.002982 -0.011821 -0.000847 -0.001040 0.000987 0.028863 0.657335 0.022849 0.006880 0.400406 1.000000 0.025700 0.011669 -0.002399 -0.034387 0.007968 -0.023962
DefendersInTheBox 0.003334 -0.001275 -0.001306 -0.077009 0.002201 -0.001023 0.002627 -0.171998 0.045046 0.015324 -0.255616 0.007361 0.025700 1.000000 -0.106521 0.069761 0.009451 -0.003849 -0.011440
Yards -0.003435 0.000421 0.001152 0.022749 -0.000399 0.001277 -0.000028 0.062848 -0.005070 -0.025843 0.071535 -0.006435 0.011669 -0.106521 1.000000 -0.008596 0.003864 -0.008292 -0.004580
PlayerWeight -0.000343 0.000883 -0.271308 -0.354285 -0.250230 -0.000945 -0.001146 -0.013554 0.001266 -0.002064 -0.019063 -0.002410 -0.002399 0.069761 -0.008596 1.000000 -0.001571 -0.000722 0.002353
Week 0.002112 -0.001847 -0.013287 -0.004165 -0.020155 0.000595 0.000756 0.019253 -0.010302 0.017018 -0.020895 -0.045967 -0.034387 0.009451 0.003864 -0.001571 1.000000 -0.621173 -0.018272
Temperature -0.003599 0.002513 0.026394 0.007678 0.027174 -0.001425 -0.002005 -0.012573 0.001616 -0.015818 0.027455 0.029781 0.007968 -0.003849 -0.008292 -0.000722 -0.621173 1.000000 -0.118701
Humidity -0.005078 -0.001487 -0.003300 0.011537 -0.009425 0.003275 -0.000979 -0.003986 -0.002786 -0.013128 -0.009115 0.013266 -0.023962 -0.011440 -0.004580 0.002353 -0.018272 -0.118701 1.000000

Heat map to see corelation

In [ ]:
sns.set_style('whitegrid')
In [ ]:
plt.figure(1,figsize=(20,10))

mask = np.zeros_like(nfl_c)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(nfl_c,annot=True,cmap="YlGnBu",mask=mask)
plt.title('Heatmap of features of Nfl dataset')
plt.show()

Lets clean and Train the model to predict.

kd

In [ ]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
In [ ]:
sample_data=nfl.copy()
In [ ]:
features
Out[ ]:
['X',
 'Y',
 'S',
 'A',
 'Dis',
 'Orientation',
 'Dir',
 'YardLine',
 'Quarter',
 'Down',
 'Distance',
 'HomeScoreBeforePlay',
 'VisitorScoreBeforePlay',
 'OffenseFormation',
 'DefendersInTheBox',
 'Yards',
 'PlayerWeight',
 'Position',
 'Week',
 'Temperature',
 'Humidity']
In [ ]:
X=sample_data[features].copy()
In [ ]:
X.head()
Out[ ]:
X Y S A Dis Orientation Dir YardLine Quarter Down ... HomeScoreBeforePlay VisitorScoreBeforePlay OffenseFormation DefendersInTheBox Yards PlayerWeight Position Week Temperature Humidity
0 73.91 34.84 1.69 1.13 0.40 81.99 177.18 35 1 3 ... 0 0 SHOTGUN 6.0 8 212 SS 1 63.0 77.0
1 74.67 32.64 0.42 1.35 0.01 27.61 198.70 35 1 3 ... 0 0 SHOTGUN 6.0 8 288 DE 1 63.0 77.0
2 74.00 33.20 1.22 0.59 0.31 3.01 202.73 35 1 3 ... 0 0 SHOTGUN 6.0 8 270 DE 1 63.0 77.0
3 71.46 27.70 0.42 0.54 0.02 359.77 105.64 35 1 3 ... 0 0 SHOTGUN 6.0 8 245 ILB 1 63.0 77.0
4 69.32 35.42 1.82 2.43 0.16 12.63 164.31 35 1 3 ... 0 0 SHOTGUN 6.0 8 206 FS 1 63.0 77.0

5 rows × 21 columns

In [ ]:
X=X.drop(['Yards'],axis=1)
In [ ]:
X.dtypes
Out[ ]:
X                         float64
Y                         float64
S                         float64
A                         float64
Dis                       float64
Orientation               float64
Dir                       float64
YardLine                    int64
Quarter                     int64
Down                        int64
Distance                    int64
HomeScoreBeforePlay         int64
VisitorScoreBeforePlay      int64
OffenseFormation           object
DefendersInTheBox         float64
PlayerWeight                int64
Position                   object
Week                        int64
Temperature               float64
Humidity                  float64
dtype: object
In [ ]:
y=sample_data[['Yards']].copy()
In [ ]:
y.shape
Out[ ]:
(375786, 1)
In [ ]:
X.shape
Out[ ]:
(375786, 20)
In [ ]:
fea_dum
Out[ ]:
['OffenseFormation', 'Position']
In [ ]:
X=pd.get_dummies(X,columns=fea_dum)
In [ ]:
X.head()
Out[ ]:
X Y S A Dis Orientation Dir YardLine Quarter Down ... Position_OLB Position_OT Position_QB Position_RB Position_S Position_SAF Position_SS Position_T Position_TE Position_WR
0 73.91 34.84 1.69 1.13 0.40 81.99 177.18 35 1 3 ... 0 0 0 0 0 0 1 0 0 0
1 74.67 32.64 0.42 1.35 0.01 27.61 198.70 35 1 3 ... 0 0 0 0 0 0 0 0 0 0
2 74.00 33.20 1.22 0.59 0.31 3.01 202.73 35 1 3 ... 0 0 0 0 0 0 0 0 0 0
3 71.46 27.70 0.42 0.54 0.02 359.77 105.64 35 1 3 ... 0 0 0 0 0 0 0 0 0 0
4 69.32 35.42 1.82 2.43 0.16 12.63 164.31 35 1 3 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 51 columns

In [ ]:
nfl.Position.unique()
Out[ ]:
array(['SS', 'DE', 'ILB', 'FS', 'CB', 'DT', 'WR', 'TE', 'T', 'QB', 'RB',
       'G', 'C', 'OLB', 'NT', 'FB', 'MLB', 'LB', 'OT', 'OG', 'HB', 'DB',
       'S', 'DL', 'SAF'], dtype=object)
In [ ]:
X.dtypes
Out[ ]:
X                              float64
Y                              float64
S                              float64
A                              float64
Dis                            float64
Orientation                    float64
Dir                            float64
YardLine                         int64
Quarter                          int64
Down                             int64
Distance                         int64
HomeScoreBeforePlay              int64
VisitorScoreBeforePlay           int64
DefendersInTheBox              float64
PlayerWeight                     int64
Week                             int64
Temperature                    float64
Humidity                       float64
OffenseFormation_ACE             uint8
OffenseFormation_EMPTY           uint8
OffenseFormation_I_FORM          uint8
OffenseFormation_JUMBO           uint8
OffenseFormation_PISTOL          uint8
OffenseFormation_SHOTGUN         uint8
OffenseFormation_SINGLEBACK      uint8
OffenseFormation_WILDCAT         uint8
Position_C                       uint8
Position_CB                      uint8
Position_DB                      uint8
Position_DE                      uint8
Position_DL                      uint8
Position_DT                      uint8
Position_FB                      uint8
Position_FS                      uint8
Position_G                       uint8
Position_HB                      uint8
Position_ILB                     uint8
Position_LB                      uint8
Position_MLB                     uint8
Position_NT                      uint8
Position_OG                      uint8
Position_OLB                     uint8
Position_OT                      uint8
Position_QB                      uint8
Position_RB                      uint8
Position_S                       uint8
Position_SAF                     uint8
Position_SS                      uint8
Position_T                       uint8
Position_TE                      uint8
Position_WR                      uint8
dtype: object
In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=324)
In [ ]:
X_train.shape
Out[ ]:
(263050, 51)
In [ ]:
X_test.shape
Out[ ]:
(112736, 51)
In [ ]:
y_test.shape
Out[ ]:
(112736, 1)

1st let's check with Linear Regressor

In [ ]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)
Out[ ]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [ ]:
y_prediction = regressor.predict(X_test)
y_prediction
Out[ ]:
array([[5.60823465],
       [3.48037438],
       [6.94031386],
       ...,
       [4.41882401],
       [4.6356321 ],
       [3.72012659]])
In [ ]:
y_test.describe()
Out[ ]:
Yards
count 112736.000000
mean 4.235985
std 6.467165
min -14.000000
25% 1.000000
50% 3.000000
75% 6.000000
max 99.000000
In [ ]:
RMSE_L = sqrt(mean_squared_error(y_true = y_test, y_pred = y_prediction))

Given mean of ( y_test ) as [ 4.23 ], RMSE of { 6.416 } is not acceptable.

In [ ]:
print(RMSE_L)
6.416476173835028

Let's check with Decision tree regressor.

In [ ]:
regressor_d = DecisionTreeRegressor()
regressor_d.fit(X_train, y_train)
Out[ ]:
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')
In [ ]:
y_prediction_d = regressor_d.predict(X_test)
y_prediction_d
Out[ ]:
array([-3.,  1.,  0., ..., -3., -1.,  8.])
In [ ]:
RMSE_D = sqrt(mean_squared_error(y_true = y_test, y_pred = y_prediction_d))

Eureka Finally some progress.


Decision tree Regresor has RMSE of 0.8456


Given mean ( y_test ) as [ 4.23 ] it's good start.

In [ ]:
print(RMSE_D)
0.8456960964766238
In [ ]:
y_predict_lin_data=pd.DataFrame(y_prediction,columns=['predict_linear_regressor'])
In [ ]:
y_predict_dec_data=pd.DataFrame(y_prediction_d,columns=['predict_decision_tree_regressor'])

Here is original line yards vs defenders in the box

In [ ]:
plt.figure(1,figsize=(18,8))
sns.lineplot(x='Yards',y='DefendersInTheBox',data=nfl[:1000],label='Defenders in the box',ci=50,color='red')
plt.ylim(0,15)
plt.title('lineplot showing relation between defenders in the box vs yards gained')
plt.xlabel('yards gained')
plt.ylabel('Defenders in The Box')
plt.show()

Now plotting predicted data of decision tree regressor prediction.

It will help in visually, to see the difference of predicted vs test data.

In [ ]:
sns.set_style('darkgrid')
plt.figure(1,figsize=(18,8))
sns.lineplot(x=y_test.Yards,y=X_test.DefendersInTheBox,label='test data',color='red')
sns.lineplot(x=y_predict_dec_data.predict_decision_tree_regressor,y=X_test.DefendersInTheBox,label='decision tree regressor predicted data',color='blue')
plt.ylim(0,14)
plt.xlim(-8,20)
plt.title('lineplot showing relation between defenders in the box vs yards gained')
plt.xlabel('yards gained')
plt.ylabel('Defenders in The Box')
plt.show()

Summary:


We saw linear regressor failed to predict the yards gained.


But decision tree regressor gave a good RMSE of 0.8457


Which is acceptable given the mean of y_test yards of 4.23


In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: