Data Modeling with PostgreSQL

Building a relational database on simulated data.

An Udacity Data Engineer Nanodegree project

Content

Description
Data
PostgreSQL Database
Workflow
Repository Organization
Links

Description

In this project I used Python to build an ETL pipeline and PostgreSQL to show data modeling. The subject of the project is the fictional music streaming provider "Sparkify", which is in need of a database to analyze their data generated from the app.
The final database 'sparkifydb' is build in a star schema with fact and dimension tables. The main goal is it to create a database with tables designed to optimize queries on songplay analysis.

Data

The data used for this project consists out of two directories. The log_data are .json files with data on user activity on the app and the song_data which are also .json files with metadata on songs in the app. The song data files are only a small subset of the original million songs dataset. The log data was created with the help of this eventsimulator and based on the song data. All data was provided by Udacity as part of the data engineer nanodegree.

PostgreSQL Database

The database is named "sparkifydb" and consists out of the following tables:

fact table: songplays (records in the log data associated with song plays)
columns: songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
dimension table: users (data on app users)
columns: user_id, first_name, last_name, gender, level
dimension table: songs (data on all songs in library/app)
columns: song_id, title, artist_id, year, duration
dimension table: artists (data on all artists in library/app)
columns: artist_id, name, location, latitude, longitude
dimension table: time (timestamps of songplays in log data brocken down in different units)
columns: start_time, hour, day, week, month, year, weekday

Workflow

I started out with writing the SQL queries to create the tables in sql_queries.py
example: songplay_table_create = ("""CREATE TABLE IF NOT EXISTS songplays(songplay_id SERIAL PRIMARY KEY,\ start_time bigint NOT NULL, user_id int NOT NULL, level varchar , \ song_id varchar, artist_id varchar, session_id int, \ location varchar, user_agent varchar)
Afterwards I added the drop table queries. They are necessary in the create_table.py in order to drop the existing tables before they are created again.
example: songplay_table_drop = "DROP TABLE IF EXISTS songplays"
In the etl.ipynb I developed the etl processes for the different tables and used that code to finalize the etl.py to build the pipeline. In between I used the test.ipynb to varify the different steps and check if the code worked as expected.
In the end I confirmed that the process is successfull by running the files in this order:

create_tables.py
etl.py
test.ipynb (for confirmation)

Repository Organization

The data directory contains seperate directories for the song data and log data
The notebooks directory contains two Jupyter Notebooks. The etl.ipynb was used to build and test step by step the etl code which was than used in the etl.py to build the pipeline. The test.ipynb was used to test the code in the etl.ipynb, sql_queries.py and etl.py files.
The python files consist out of the create_tables.py which creates the sparkify database and necessary tables, the etl.py which extracts the data from the data files and inserts them into the database and the sql_queries.py with the postgres queries used in create_tables.py and etl.py

Links

Repository

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Modeling with PostgreSQL

Content

Description

Data

PostgreSQL Database

Workflow

Repository Organization

Links

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data		data
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
create_tables.py		create_tables.py
etl.py		etl.py
sql_queries.py		sql_queries.py

Ulli-Aech/Data_Modeling_Postgres

Folders and files

Latest commit

History

Repository files navigation

Data Modeling with PostgreSQL

Content

Description

Data

PostgreSQL Database

Workflow

Repository Organization

Links

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages