- Project Overview
- Project Requirements
- Project Proposal
- Machine Learning Overview
- Future Improvements
- Tech Stack
- Languages
- Libraries
- Dataset
- IDE
- Local Environment Setup Instructions
- Command Line Interface (CLI) Version
- Graphic User Interface Version (GUI) Version (in-progress)
Design and develop a fully functional data product (application) addressing your identified
business problem or organizational need.
The deliverables include the application and
a written report, also located in this repository. The report contains a Letter of Transmittal to Commissioner Goodell,
a project proposal plan, and a post-implementation report.
Data Methods – provide one descriptive method that discerns relationships and characteristics of the past data in at least three forms of visualization. Also, provide one nondescriptive where a decision or trend could be inferred. The descriptive method should be in the domains of cluster or association analysis, and the others could include pruning algorithm, discriminate analysis, regression analysis (linear, logistic), Bayesian methods, neural network, or support vector machines.
Datasets – The use of dataset(s) is a critical element and involves the gathering and measuring of information on targeted variables in a systematic fashion. This could be student collected (Please consider IRB ramifications.) or publicly accessible such as websites (e.g. Kaggle.com), governmental (e.g. Department of Labor), or software related (e.g. GitHub.com).
Analytics – Using the given data, your application needs to enable decisions to be formulated or support for given trends to be provided.
Data Cleaning – if applicable, create a function that will make the data usable prior to actually being used by the application. Things such as featuring, parsing, cleaning, and wrangling the datasets.
Data Visualization – You need at least three real-time (e.g. using the GUI/dashboard) formats to visualize the data in a graphic format. Look at things like charting, mapping, color theory, plots, diagrams, or other methods (tables must include heat mapping).
Real-Time Queries – As part of your GUI enable users to access and manipulate data real-time including data maintenance. This does not deal with data “freshness” but with the query response time being in seconds.
Adaptive Element – if appropriate for the business need, provide the implementation of machine-learning methods and algorithms to enable the application to improve with experience.
Outcome Accuracy – provide functionalities that evaluate the accuracy of the information/outcomes given by the application. What are the parameters for valid output data and how will those be checked by the application?
Dashboard – include a user-friendly, functional dashboard that enables the query and display of the data, as well as other functionality described in this section. This could be stand-alone, CLI, Web-based, or a mobile application interface.
My application aims to utilize machine learning to assist the NFL in predicting the playoff likelihood of any NFL team based on how they allocate their salaries by position. The dataset consists of salary cap data from 2013-2022.
The ML model chosen is a Random Forest Classification model, a subsect of the Supervised Learning branch of machine learning. The features that I used to make the model included the percentage of the cap allocated to the QB position, the percentage of the cap allocated to the offense (as a whole), and the percentage of the cap allocated to the defense (as a whole). I then split the data into training and testing subsets (70/30 split) and fit it to a RandomForestClassifier model imported from scikit-learn. Then I had the model make predictions on random samples of the testing data, the results of which were stored and used to formulate the accuracy score, classification report, and the confusion matrix.
The main improvement I want to make is to increase the accuracy of the application, and that likely includes introducing more data
and adjusting the parameters of the ML model.
Other improvements include:
- Switching to a Regression model
- Develop into a web application (frontend UI and backend API)
- Introduce Ensemble methods
- More methods to monitor reliability
Languages: Python, SQL (database)
Libraries: Pandas, Scikit-learn, Matplotlib, NumPy, Seaborn, SQLite
Dataset: NFL Salary Cap Spending 2013-2022 (link)
IDE: PyCharm 2023.1.12 (Community)
These instructions assumes that git is installed on your computer and you have a basic knowledge of git and terminal navigation.
1. Clone the repository to your local machine.
git clone <ssh key>
2. Open in your chosen IDE. I recommend PyCharm since that is what was used to develop this program.
3. Install 'pip' if you don't already have it.
4. Navigate to the project directory and run the following command:
pip install scikit-learn matplotlib numpy seaborn pandas
pip install scikit-learn matplotlib numpy seaborn pandas