Data Science

Predicting NFL Total Score and Point Spread Bets

Data Science

This project took on the challenge of developing machine learning models aimed at forecasting NFL game outcomes, specifically focusing on total scores and point spreads. Jacob Conrad ’24 utilized a dataset of NFL games spanning from 2018 to 2023, encompassing pre-game player projections, weather conditions, and stadium details. On this dataset, he employed various techniques including linear regression, gradient boosting regression and neural networks.

Overview

This study explores predictive factors like weather, player skill, and other factors to develop a robust betting strategy. By avoiding emotional biases in favor of data-driven approaches, this study seeks to provide bettors with tools to place profitable NFL sports bets.

School

College of Arts & Sciences

Program

BS in Data Science

Author

Jacob Conrad '24

Data Science

College of Arts & Sciences

Predicting NFL Total Score and Point Spread Bets Through Data Science Methods & Machine Learning Models

Introduction

NFL sports betting has grown into a significant market driven by online accessibility and promotion. As of 2003, it ranked as the 11th largest industry in the U.S., and NFL betting has since grown exponentially. By 2016, over $90 billion was wagered on NFL games, with the Super Bowl alone amassing $4.7 billion in bets. As states continue to legalize both online and in-person sports betting, the trajectory of this market appears poised for continued growth.

Some of the most popular bets sports bettors place are pre-game spread or total score bets. For a spread bet, you are betting on the point differential of the game. For example, a bet is placed on a team at -2.5, if they win by 3 or more points the bet wins, and vice versa. For a total score bet, bettors can choose to bet “over” or “under” a designated total amount of points scored in the game.

Additionally, sports bettors need to understand their payout. Notably, bettors face a vigorish, or commission, which necessitates a success rate of 52.38% to cover losses. Despite the market's apparent efficiency, research suggests that bettor tendencies, rather than team strengths, influence line setting.

However, most bettors struggle to profit due to emotional decision-making and the temptation of chasing losses. This study explores predictive factors like weather, player skill, and other factors to develop a robust betting strategy. By avoiding emotional biases in favor of data-driven approaches, this study seeks to provide bettors with tools to place profitable NFL sports bets.

Approach

Dataset

The dataset includes 1,560 NFL games from 2018-2023. All NFL game information originated from Pro Football Focus.
Pre-game player performance projections were acquired through fantasydata.com. Performance projections include offensive, defensive, and special teams player projections.
All over/under lines and spreads were acquired through rotowire.com.
All stadiums were determined using retroseasons.com and then stadium coordinates were determined using latitude.to. All weather data was collected using weatherapi.com.

Predicting NFL total scores and spreads

Limit the number of training variables by selecting only the most relevant features sequentially or by penalizing less important ones.
The dataset was split into 'K' subsets, and the model was trained and tested 'K' times, using a different subset each time for testing and the remaining for training (K-Fold cross-validation).
All three datasets (all features, forward selection features, lasso selected features) were fitted using linear regression, gradient boosting regression, and feed-forward neural networks to predict total score and point differential.
Restrict models to only bet games where their prediction is a specified percentage (threshold) different than the set line.

These were all created in the Python statistical programming language. The linear regression model was created using the statsmodels package, the neural network was created using the TensorFlow package, and the rest of the models were created using the sklearn package.

Identifying the best model of prediction and evaluating profitability

Any models that exceed 52.4% test accuracy are considered profitable and any models achieving lower test accuracy are considered unprofitable.
Root mean squared error gives an idea of how close a model was to predicting the correct total score or point differential.

Results

Models were fit to the data using three different techniques: linear regression, gradient boosting regression, and feedforward neural networks. The models were trained using predictor variables selected through a forward-selection process, a lasso-selection process, and a set of all predictor variables. This resulted in a total of nine unique combinations for both the total score and the spread. Thresholds were applied to these models to only predict games where the projected outcomes deviated by a specified percentage from the over/under and spread lines established by sportsbooks.

Predicting Total Score

Table 1 illustrates the results of the total score models. Only linear regression and gradient boosting regression models demonstrated a root mean squared error better than that of sportsbooks, which was 13.22. However, all models obtained profitable classification accuracy for at least one threshold.

Predicting Spread

Table 2 illustrates the results of the total score models. Only a few models demonstrated a root mean squared error better than that of sportsbooks, which was 12.87. However, all models except the neural network trained on the set of all variables obtained profitable classification accuracy for at least one threshold.

Results Data

Table 1. Best root mean squared errors and classification accuracies that were achieved across all total score models. Thresholds are in parentheses. A minimum sample size of 500 games bet was required.

Table 2. Best root mean squared errors and classification accuracies that were achieved across all spread models. Thresholds are in parentheses. A minimum sample size of 500 games bet was required.

Figure 1. A line plot of average root mean squared error against threshold % by model type for total score predictions.

Figure 2. A line plot of average classification accuracy against threshold % by model type for total score predictions.

Figure 3. A line plot of average root mean squared error against threshold % by model type for spread predictions.

Figure 4. A line plot of average classification accuracy against threshold % by model type for total score predictions.

Figure 5. Line plots of average root mean squared error against threshold % by model type for correct and incorrect predictions. A) Total score predictions, B) Spread predictions.

Discussion

All three types of models achieved profitable accuracy for total score and spread.
Gradient boosting regression achieved the highest accuracy on average, especially at higher thresholds.
Root Mean squared error increases with larger thresholds, but this is expected. The models are betting a higher distribution of uncommon lines and sample sizes are decreasing.
Forward selection tended to produce the best results, then lasso, then the set of all features.
Spread models saw less success than total score models.

Conclusions

Profit can be made using linear regression, gradient boosting regression, and neural networks.
For optimal results betting on the total score, linear regression with forward selection should be used up to a 5% threshold, and gradient boosting regression with forward selection should be used for games above the 5% threshold.
For optimal results betting on the spread, boosting with forward selection should always be used.

Future Directions

Add data on referee tendencies and coaching ability to the training data.
Test models on unseen data (2024-2025 NFL season) for validity.
Test model effectiveness on alternate pre-game total point and spread lines.
Try similar methods on different sports to create year-long profit opportunities

Professional Application

"This project will help me in the future as I work to build my own website where I will license my NFL sports betting predictions. Additionally, it will serve as proof of competence in the data science field when I eventually transition professions from software development to data science." - Jacob Conrad ’24

For Further Disucssion

This serves as an overview of the project and does not include the complete work. To further discuss this project, please email Jacob Conrad.

Course Overview

DS 480: Data Science Capstone serves as a culminating experience for the data science major. Students work on an independent project that will allow them to integrate knowledge from their previous courses in the major and apply that knowledge to a problem in a domain of their interest.

Explore Our Areas of Interest

We've sorted each of our undergraduate, graduate and doctoral programs into unique Areas of Interest. Explore these categories to discover which programs and delivery methods best align with your educational and career goals.

Explore Data and Analytics at Quinnipiac

Explore STEM-Designated Programs at Quinnipiac