Project Background: The following Jupyter Notebook uses different classification machine learning modeling techniques to try and correctly identify a NBA players position based off of their physical traits and game stats. Due to the NBA shifting into a position-less basketball style of play, I thought it would be interesting to apply statistical python code to solve this problem.
Approach: The approach is simple, take data obtained from a kaggle open source data set that dates back to 1950 up until the 2017 NBA season. Using this data I will create four different types of models (KNN, SVM, RandomForest, Adaboost) and try to optimize each model using the modeling techniques learned in module 5.
A OSEMN framework was followed for this project:
Obtain: the data from the relevant resources and stakeholders
Scrub: Cleaning the data into formats that can be digested in Python packages such as Sklearn or Statsmodels Remember the “Garbage in, garbage out”.
Explore: Using statistical methods and data analytic techniques explore the data to find significant patterns or trends
Model: Construct models to predict and forecast the data. Here we focus on our target variable which is price!
Interpret: Take the results of the analysis and model and create meaningful visualizations or presentations
This project was eye opening. It cemented many of the Machine Learning techniques learned throughout module 5. I truly believe I have both a technical and fundamental grasp on the following models, Random Forest, KNN, and SVM. I finally got to choose a topic for a project. Of course I chose the NBA. I am a very passionate Lebron James fan and will always say he is the GOAT. Anyway, I thought it would be perfect to classify NBA positions especially in the era of positionless basketball we see in todays game. I will now walk you through my analysis with code snippets that were necessary for the success of this project.
The data set used can be found here, https://www.kaggle.com/drgilermo/nba-players-stats. It is a file with three different csv files. I used the player data and player stats csv’s to run classification models on. The first step was to divide the data into different eras as not every year has the same stats. I decided to divide the data into the following:
- 1950-1956
- 1957-1979
- 1980-1998
- 1999-2017
From here I cleaned each of the data frames and ensured all of the data was digestible by the models I wanted to build. The four models I came up with were K-Nearest Neighbors, Adaboost, Random Forest, and Support Vector Machine. Below are the functions I built which were used for modeling each of the four data sets.
Random Forest
def random_forest(df,X,y): #Random Forest model
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.2) #train test split
clf= RandomForestClassifier(random_state=0) #classifier defined
clf.fit(X_train, y_train) #fitted data
preds= clf.predict(X_test) #predictions of model using test data
acc = round(accuracy_score(preds, y_test) * 100,2) #basic accuracy score
print("Accuracy is :{0}%".format(acc))
feat_importances = pd.DataFrame(clf.feature_importances_, index=X.columns, columns=['Score']) #creating a list of top 10 features from RF model
feat_importances = feat_importances.sort_values(by='Score',ascending=True) #sorting values
feat_importances.plot(kind='barh') #plotting the features in a horizontal bar chart
plt.show()
pd.crosstab(y_test, preds, rownames=['Actual Result'], colnames=['Predicted Result']) #confusion matrix to show where the model predicted what positions
Refined_X=feat_importances.index[-10:] #rerunning the model again this time using only the top 10 features
X_key_features=df[Refined_X]
y_key_features=df['Pos']
X_train, X_test, y_train, y_test = train_test_split(X_key_features,y_key_features, test_size=.2)
clf= RandomForestClassifier()
clf.fit(X_train, y_train)
preds= clf.predict(X_test)
acc = round(accuracy_score(preds, y_test) * 100,2)
print("Accuracy is :{0}%".format(acc))
feat_importances = pd.DataFrame(clf.feature_importances_, index=X_key_features.columns, columns=['Score'])
feat_importances = feat_importances.sort_values(by='Score',ascending=True)
feat_importances.plot(kind='barh')
plt.show()
print(classification_report(y_test,preds))
return pd.crosstab(y_test, preds, rownames=['Actual Result'], colnames=['Predicted Result'])
SVM
def SVM(df,X,y): #Support Vector Machine Model
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.2) #train test split
svclassifier = SVC(kernel='linear') #defining classifier
svclassifier.fit(X_train, y_train) #fitting SVM classifier
y_pred = svclassifier.predict(X_test) #predicting y values using classifier and X_test
pd.crosstab(y_test, y_pred, rownames=['Actual Result'], colnames=['Predicted Result']) #confusion matrix again
acc = round(accuracy_score(y_pred, y_test) * 100,2) #accuracy score
print("Accuracy is :{0}%".format(acc))
print(classification_report(y_test,y_pred)) #classification score with recall, precision, and F1 score
KNN
def KNN(X, y): #K Nearest Neighbor Model
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.2) #train test split
clf= KNeighborsClassifier() #classifier defined
clf.fit(X_train, y_train) #classifier fitted
test_preds= clf.predict(X_test) #predicting y values using classifier and X_test
acc = round(accuracy_score(test_preds, y_test) * 100,2) #accuracy score
print("Accuracy is :{0}%".format(acc))
print(classification_report(y_test,test_preds)) #classification score with recall, precision, and F1 score
k_range =range(1,25) #Finding the optimal k
scores= {} #accuracy scores dictionary
scores_list= [] #scores list
for k in k_range: #for loop to get each of the accuracy scores for different values of k
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
test_preds=knn.predict(X_test)
scores[k] = accuracy_score(y_test, test_preds)
scores_list.append(accuracy_score(y_test, test_preds))
plt.plot(k_range, scores_list)
plt.xlabel('Value of K')
plt.ylabel('Testing Accuracy')
plt.show() #graph to show how the accuracy changes as K increases from 1 to 25
max(scores.items(), key=operator.itemgetter(1))[0] , max(scores.items(), key=operator.itemgetter(1))[1] #max accuracy
optimal_k= max(scores.items(), key=operator.itemgetter(1))[0] #best k variable created
clf= KNeighborsClassifier(n_neighbors=optimal_k) #rerunning the model with the optimal k value
clf.fit(X_train, y_train)
test_preds= clf.predict(X_test)
acc = round(accuracy_score(test_preds, y_test) * 100,2)
print("Accuracy is :{0}%".format(acc))
print(classification_report(y_test,test_preds))
return pd.crosstab(y_test, test_preds, rownames=['Actual Result'], colnames=['Predicted Result'])
Adaboost
def ADABOOST(X, y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2) #train test split
adaboost_clf = AdaBoostClassifier() #adaboost classifier
adaboost_clf.fit(X_train, y_train) #fit the model
y_pred= adaboost_clf.predict(X_test)#predict the value of y using X_test
acc = round(accuracy_score(y_pred, y_test) * 100,2) #accuracy score
print("Accuracy is :{0}%".format(acc))
print(classification_report(y_test,y_pred)) #classification report
return pd.crosstab(y_test, y_pred, rownames=['Actual Result'], colnames=['Predicted Result']) #confusion matrix
The following models were run on scaled data for each of the four eras of basketball. In my analysis, I saw the accuracy along with the classification report (F1-score, precision, and recall) all increase as we progressed through the historical data. The best models were Random Forest and SVM performed on the later eras(1980-1998 and 1999-2017). Historically the two models performed around 75-80 percernt accuracy and roughly a similar 77-79 average percent for precision, recall, and F1-scores.
In my opinion, around 80 percent accuracy is very great for this dataset. Although, I wanted to try and improve it even more. I decided to reduce the position classes by combining SG/SF and PF/C. The code below helped me accomplish this:
NBA_1999_2017[(NBA_1999_2017['Pos']==3)]=2 #Taking the data into three classifications instead of 5.
NBA_1999_2017[(NBA_1999_2017['Pos']==4)]=3
NBA_1999_2017[(NBA_1999_2017['Pos']==5)]=3
The new Position column now has three data classes; 1(PG), 2(SG,SF), 3(PF/C). When I reran all of the models the results were astonishing. I found that my models both improved to 95 percent accuracy on average. With the C/PF almost always being predicted perfectly. Now what does this tell us. I believe the NBA has always been positionless, although it is now more than ever evident. I want to do further analysis and try to improve the five position class models by hyper parameter tuning and doing other preprocessing techniques. In addition, I plan to try and run models on the data . as a whole and take away the eras to see if I end up with different results.
As always there is always more you can do with your data. Data is non-exhaustive. My jounrey through Machine Learning modeling has just begun and I look forward to the future endeavors. As I feel I have just graised the service with this first project. There are mulit dimensional levels to the Data Science World.