Data
Sample datasets
- CDC’s BRFSS data
- Description: A sample of size 20,000 with nine variables taken from Open Intro.
- CSV
- College Football
- Games by Conference and Season
- Description: Games results with team names, scores, conference, and year going back to 2017. Obtained from Massey Ratings
- CSV
- CFB Stats
- Games by Conference and Season
- Colors
- Description: This is a small, manufactured data set of RGB values that are close to primary colors; there are also a few missing values values. The intention is to illustrate imputation with, for example, a KNN imputer.
- CSV
- Galileo’s Ramp
- Description: Supposedly, some of Galileo’s actual inclined ramp data taken from Teaching Statistics with Data of Historic Significance: Galileo’s Gravity and Motion Experiments.
- CSV
- Housing prices
- Description: Data from Kaggle’s House Prices competition. There’s also training data for own my variation on the contest.
- Training CSV
- Testing CSV
- Training variation CSV
- Testing variation CSV
- NCAA Basketball data (multiple files for sports analytics)
- The Big South Regular Season with Massey ratings 2026
- Description: Game by game results of the Mens’s 2026 Big South regular season up to but not including the Big South Tournament. Includes Massey Ratings.
- CSV
- The Big South 2026 Partial Season
- Description: Game by game results of the Mens’s 2026 Big South regular season (through Feb 12).
- CSV
- The Big South 2025 Regular Season
- Description: Game by game results of the Mens’s 2025 Big South regular season.
- CSV
- Big South Games with Eigen ratings
- Big South games with score differences and Eigen ratings for the 2023/24 Season
- CSV
- Big South Games with Massey ratings
- Big South games with score differences and Massey ratings for the 2023/24 Season
- CSV
- Paired Tournament Games
- Lists all Men’s NCAA Tournament games from 2010 through 2023. Each game lists the difference of the Massey ratings between Team 1 and Team 2, as well as the seed difference between the teams. There’s also a Boolean label indicating whether Team 1 defeated Team 2. The intention is illustrate a machine learning approach to sports prediction and each game appears twice to maximize the data set.
- CSV
- Processed Kaggle data.
- There are four files here - two for the men’s tournament and two for women. Prior tournament games record actual past tournament games dating back to 2010 up through the 2024 tournament. Each game has a slew of team stats and a label indicating who won the game. Potential tournament games lists all possible pairs of tournament games from the 2025 tournament with the data but no labels. The idea was to generate a Kaggle submission file the 2025 submission.
- Men’s priors
- Men’s potentials
- Women’s priors
- Women’s potentials
- The Big South Regular Season with Massey ratings 2026
- Palmer Penguins
- Wine
- Description: This is a modified version of the data obtained from SciKit Learn’s
sklearn.datasets.load_winefunction. I simply renamed the “target” variable to “variety” and placed it all in one CSV file to obtain a nice classification example. - CSV
- Description: This is a modified version of the data obtained from SciKit Learn’s