I love it when you call me Big Data.
A Data Science project analyzing the "Masterpiece Gap", the statistical difference between an average movie and the top 1% (p99) across different genres.
As software engineers, we often look at averages, but averages lie. If a genre has a polarizing audience (half 1/10, half 10/10), the average is 5/10—same as a "boring" genre.
This project uses Python, Pandas, and Bayesian Statistics to:
- Ingest & Clean raw IMDb datasets (streaming >200MB of data).
- Analyze the risk/reward profile of every major film genre.
- Recommend movies using a CLI tool with adjustable "Risk Tolerance" profiles.
Our analysis revealed distinct "personalities" for each genre:
- The Safe Bets (Documentary, Biography): These have high averages and low variance. It is statistically difficult to make a "bad" documentary.
- The High-Risk/High-Reward (Horror, Sci-Fi): These have the lowest average ratings, but their p99 (Top 1%) ceiling is as high as Drama. This huge "gap" implies execution difficulty.
- The Animation Anomaly: The only genre that consistently maintains both a high floor (average) and a high ceiling (p99).
To prevent movies with 5 votes from skewing the data, we use the official IMDb formula. This penalizes low-volume outliers and pulls them toward the global mean.
Where:
-
$R$ = Average Rating of the movie -
$v$ = Number of votes for the movie -
$m$ = Minimum votes required to be listed (Threshold: 100) -
$C$ = Mean vote across the whole dataset
We calculate the 99th percentile rating for each genre. This represents the "Masterpiece Ceiling"—the maximum potential quality a genre typically achieves.
We use uv, a blazing-fast Python package manager written in Rust, to handle dependencies and environments.
- Language: Python 3.10+
- Manager:
uv(replaces pip/conda) - Libraries: Pandas, Seaborn, Matplotlib, Scipy
You only need to install uv. It will handle Python and all libraries for you.
Mac / Linux:
curl -LsSf https://astral.sh/uv/install.sh | sh
Windows (PowerShell):
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
Clone the repo and sync the dependencies. uv will create a virtual environment (.venv) automatically.
# 1. Sync dependencies
uv sync
You must run this script first.
The repository does not contain the dataset (it is ignored via .gitignore). You need to stream the data from IMDb, clean it, and build the local cache.
# Downloads title.basics and title.ratings, merges them, and filters for movies
uv run src/ingest_data.py
Output: Creates data/processed/movies_merged.csv (~50-100MB)
To see the boxplots, gap analysis charts, and statistical breakdown, open the Jupyter Notebook.
- Open VS Code.
- Open
notebooks/1_exploration.ipynb. - Select the Kernel:
Python Environments->.venv. - Run All Cells.
We built a custom CLI tool to recommend movies based on the statistical findings.
Option A: Low Risk (Crowd Pleasers) Finds movies with high Weighted Ratings. Safe bets.
uv run src/recommender.py --genre "Sci-Fi" --risk low
Option B: High Risk (Hidden Gems) Finds movies with high Raw Ratings but lower vote counts (< 5000). Good for finding cult classics.
uv run src/recommender.py --genre "Horror" --risk high
Option C: Pagination Browse through the results.
uv run src/recommender.py --genre "Comedy" --page 2 -n 10
IMDbayes/
├── data/
│ ├── raw/ # Temp folder for downloads (ignored)
│ └── processed/ # The final CSV used by the app (ignored)
├── notebooks/
│ └── 1_exploration.ipynb # The Data Science Lab (Charts & Analysis)
├── src/
│ ├── ingest_data.py # ETL Script (Download -> Clean -> Save)
│ └── recommender.py # CLI Application Logic
├── .gitignore # Ensures data is never committed
├── pyproject.toml # Dependencies managed by uv
└── README.md # This file
