Authors:
- Annie Wang
- Hans Baumberger
- Mason Lee
- Sileshi Hirpa
Predict the total compensation for a prospective employee based on the professional background, companies and their locations, and macroeconomic factors.
- Goals
- Provide a reasonable expectation for compensation negotiation
- Provide an important benchmark to the companies for competitive compensation package in recruitment.
The technical goal for our project is to maximize R^2 and minimize MSE.
Our project is based on the datasets obtained from:
- Web scraping the Levels.fyi (which lets users compare career levels and compensation packages across different companies) with permission from the administration.
- Inflation rate from rateinflation.com
- Unemployment data from Data World.
The three datasets were merged after data cleaning and EDA.
| Term | Description |
|---|---|
| timestamp | timestamp of compensation record submission |
| company | company names |
| title | employee's job title |
| totalyearlycompensation | total compensation that an employee gets annually |
| location | cities where the companies are located |
| yearsofexperience | years of experience in a career |
| yearsatcompany | experience years of an employee at a particular company |
| year | year of timestamp |
| month | month of timestamp |
| year_month | year month of the data |
| inflation_rate | the percentage at which a currency is devalued during a period |
| inflation_rate_3mos | inflation rate of 3 months prior to record timestamp |
| state | states in the US |
| employment_rate | The percentage of the labor force that is employed |
| employment_rate_3mos | employment rate of 3 months prior to record timestamp |
- Some of the EDA we used include:
-
Top 10 total compensations by title
-
Workers' location (top 10)
-
Nationwide Inflation Rate
-
Nationwide Unemployment Rate
Most of our models took longer than anticipated amount of time during the hyperparameter tuning process and we decided to run a model (RandomForestRegression) on the AWS platform. The following table summarizes the models we evaluated and the best model the team agreed upon for the compensation preidction: Gradient Boosting Regressor (with GridSearch).
| Model | Training Score (R^2) | Testing Score(R^2) | MSE(Train) | MSE(Test) | Comment |
|---|---|---|---|---|---|
| Linear Regression(with no penality) | 0.5193 | -7.2931Xe^28 | 8286.35 | 1.22Xe^27 | |
| Lasso Regularization (CV) | 0.5182 | 0.5143 | 8305.30 | 8157.45 | |
| Ridge Regularization (CV) | 0.52 | 0.5097 | 8274.20 | 8234.28 | |
| Elastic Net Regularization (CV) | 0.4483 | 0.4499 | 9511.19 | 9238.88 | |
| Random Forest Regression (with Gridsearch) | 0.466 | 0.410 | 9060 | 10319 | |
| KNN Regressor (with Gridsearch) | 0.9907 | 0.4762 | 158.35 | 9172.55 | |
| Gradient Boosting Regressor (no gridsearch) | 0.5973 | 0.5318 | 6834.12 | 8198.40 | |
| Gradient Boosting Regressor (with gridsearch) | 0.7131 | 0.5477 | 4867.52 | 7919.98 | Best Model |
| Support Vector Regression (SVR) (without gridsearch) | 0.1368 | -0.1287 | |||
| Support Vector Regression (SVR) (with Gridsearch) | 0.5029 | 0.4745 | |||
| AdaBoost (with Gridsearch) | 0.1930 | 0.1276 | 13693 | 15276 |
- Incorporate more personal background features of the employee into analysis (i.e. Education)
- Incorporate more company and industry background information (i.e. Stock price, Company size, Industry sector)
- Include current data (after Sep. 2020)
- More hyperparameter tuning (GridSearch, RandomizedSearch, BayesSearch)




