Skip to content

Python project demonstrating the complete data analysis lifecycle: collection, cleaning, EDA, ML, and reporting.

License

Notifications You must be signed in to change notification settings

EngMoheb/Python-Full-Data-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Dataset Story: U.S. Baby Names (1980s–2010s)

Why We Chose This Dataset

Names are more than labels β€” they reflect culture, identity, and social change. This dataset contains 2.2 million records from U.S. Social Security card applications over three decades, broken down by state, gender, year, and name. Its cultural relevance and approachable nature make it perfect for Python‑based exploratory analysis and storytelling.

What Makes It Special

  • Cultural resonance: Everyone connects to names, making insights relatable.
  • Scale: Millions of records across decades and regions.
  • Diversity: Gender, geography, and time dimensions allow for rich comparisons.
  • Trend potential: Names rise and fall with cultural events, celebrities, and societal shifts.

What We’ll Learn

  • The most popular names of each decade and how they change over time.
  • Names with the biggest jumps and drops in popularity.
  • Regional differences in naming across U.S. states.
  • The rise of gender‑neutral names and evolving cultural preferences.

Planned Actions

  1. Data Cleaning
    • Normalize state codes, gender labels, and handle missing values.
    • Aggregate counts by decade, gender, and state.
  2. Exploratory Analysis
    • Identify top names by decade, gender, and region.
    • Detect names with sharp rises or declines in popularity.
  3. Visualization
    • Line charts for name popularity over time.
    • Heatmaps for state‑wise trends.
    • Word clouds for most popular names per decade.
  4. Advanced Analysis
    • Forecast future name popularity using time series models.
    • Detect cultural spikes linked to events or celebrities.

Expected Results

  • A clear picture of naming trends across decades.
  • Regional storytelling that highlights cultural diversity.
  • Insights into societal shifts (e.g., gender‑neutral naming).
  • Engaging visualizations that make the analysis accessible to all audiences.

Repository Structure

πŸ“‚ us-baby-names-analysis
β”‚
β”œβ”€β”€ πŸ“ data
β”‚ └── raw/ β†’ Original dataset (CSV)
β”‚ └── processed/ β†’ Cleaned and aggregated data (decade, gender, state)
β”‚
β”œβ”€β”€ πŸ“ notebooks
β”‚ └── eda.ipynb β†’ Exploratory Data Analysis (popular names, jumps/drops, gender differences)
β”‚ └── visualization.ipynb β†’ Trend charts, heatmaps, word clouds
β”‚ └── forecasting.ipynb β†’ Predictive modeling for future name popularity
β”‚
β”œβ”€β”€ πŸ“ visuals
β”‚ └── charts/ β†’ Line charts, heatmaps, bar plots
β”‚ └── wordclouds/ β†’ Word clouds of popular names per decade
β”‚
β”œβ”€β”€ πŸ“ docs
β”‚ └── dataset_story.md β†’ Narrative introduction (Dataset Story section)
β”‚ └── analysis_report.md β†’ Final written report with insights and impact
β”‚
└── README.md β†’ Project overview, Dataset Story, workflow, and results

Tools We’ll Use

  • Python (pandas, NumPy) for data cleaning and analysis.
  • Matplotlib/Seaborn/Plotly for visualizations.
  • Jupyter Notebooks for interactive exploration.
  • WordCloud/NLP libraries for text‑based insights.

About

Python project demonstrating the complete data analysis lifecycle: collection, cleaning, EDA, ML, and reporting.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published