Damian Costantino, Senior Systems Engineer, explains how he built an AI-enabled engine that can predict the outcomes of NFL games with remarkable statistical accuracy.
Before I joined the Information Technology field in 1997, I had always been enamored by the mathematics of how we make decisions. I was also an athlete, having played baseball in college and been a member of the Red Sox farm system. My lifelong love of sports, combined with my interest in Artificial Intelligence (AI) and statistical analysis, got me thinking about a way to apply Machine Learning and AI to predict NFL game outcomes.
What started as a series of conversations between friends has morphed into something much larger over the years: an AI-enabled engine that can predict the outcomes of NFL games with remarkable statistical accuracy. I present Project Moosh.
Setting the Stage
In its simplest form, machine learning is about taking in data, processing it, learning to recognize patterns and figure out what’s useful, and making decisions. The engine I built consists of two different statistical models: a logistic regression model and a linear regression model.
The logistic regression model was built to determine a qualitative outcome of the game: which team would win. The linear regression model was built to determine a quantitative outcome of the game: the point spread, or the margin of victory by which the favored team needs to be victorious to win a bet in point spread betting.
The very first step in building these models was to pick the variables that would have a significant impact on the game’s overall scoring and outcome. The decisions as to what variables to choose spawned from my own football knowledge, plus a series of (often heated) conversations I had with my friends as to what variables are most impactful to a game’s outcome. Eventually, we selected the following sets of variables to be fed into their respective statistical models:
Gathering the Data
The next step was to gather data for the above variables. We pulled historical data from a few different sources, but the majority of the statistics were sourced from www.pro-football-reference.com. The data represented a period of 15 years – we went as far back as 2005 and then up to present-day to pull all the game stats we could. Instead of pulling these statistics manually off the sites, we built out a Python scraper to go in and pull the data from reference.
Training and Testing the Models
To train the model, we took data from the odd years within the selected time period – 2005, 2007, 2009, etc. - and used the game stats from those years as the training set of data. Then, we took the even years – 2006, 2008, 2010, etc. – and used the game stats from those years for cross validation purposes, to make sure the math we were doing was correct. Finally, we took the last three full years of the data set – 2017, 2018, and 2019 – and used that data as a test set to check the performance of the models that we built against the already known outcomes of the NFL games.
Results and Outcomes
When we tested our models against the actual results of the games, we discovered that the models correctly predicted 65% (winner of the game outright) and 49% success against the spread. In terms of sports betting, the ability to bet against Vegas odds with this level of accuracy is a near improbability. Here’s what the final output from the models looks like.
The logistic regression model predicts the top 5 bets for winners of each week of the NFL season. Here is the logistic model’s prediction for last week’s (Week 8, 10/29 – 11/2) series of games:
The linear regression model predicts the favored team, the underdog, and the point spread for each of the games during each week of the NFL season.
We named the model Moosh after the famous gambler character Eddie Mush from the movie A Bronx Tale:
“Eddie Mush was a degenerate gambler. He was also the biggest loser in the whole world. They called him mush because everything he touched turned to mush.”
- Robert DeNiro, A Bronx Tale
Here is the linear model’s prediction for last week’s (Week 8, 10/29 – 11/2) series of games, as compared to Vegas odds:
Overall, building this predictive engine was a great experience for me and my friends. By combining our love of sports with our passion for statistics and AI, we built a model that can successfully predict game outcomes with an impressive level of accuracy. We’re looking forward to watching the model continue to make predictions throughout the rest of the 2020 season (Go Pats!).