To develop a foundational understanding of the insurance dataset, assess its quality, and uncover patterns in risk and profitability.
βInsurance_Risk-Analysis_Predictive_Modelling/ βββ .github/workflows/ β βββ main.yml β # GitHub Actions CI/CD βββ .vscode/ β βββ settings.jsonβ # IDE configuration βββ .dvc/ # Data version control βββ .venv/ # Virtual environment βββ data/ β βββ outputs.csv # Processed data β βββ raw.txt # Raw datasets βββ notebooks/ β βββ insurance_analysis_eda.ipynb # Exploratory analysis βββ src/ # Python modules β βββ init.py β βββ data_loader.py # Data ingestion β βββ data_stats.py # Statistical analysis β βββ data_visualization.py # Plotting utilities βββ tests/ β βββ init.py β βββ test_data_stats.py # Unit tests βββ README.md # This file βββ requirements.txt # Dependencies βββ .gitignore # Version control exclusions
Load necessary Python packages and configure the notebook for data profiling and EDA. using 'requirements.txt' git clone https://github.com/nanecha/Insurance_Risk-Analysis_Predictive_Modelling.git pip install -r requirements.txt
The dataset is given in the txt format and converted to CSV file and stored in the data older. This is loaded from output2.csv with 1,000,098 rows and 52 columns, including UnderwrittenCoverID, PolicyID, TotalPremium, TotalClaims, etc. Memory usage is approximately 390.1 MB.
- Data Types: 1 boolean, 11 floats, 4 integers, 36 objects.
- Descriptive Statistics: Key numerical columns (
TotalPremium,TotalClaims,SumInsured, etc.) show:TotalPremium: Mean ~61.91, Max ~65,282.60TotalClaims: Mean ~64.86, Max ~393,092.10SumInsured: Mean ~604,172.70, Max ~12,636,200
- Missing Values: Significant in
NumberOfVehiclesInFleet(100%),CrossBorder(~99.9%),CustomValueEstimate(~78%). - Duplicates: None found.
- Total Claims: ~64.87M
- Total Premium: ~61.91M
- Overall Loss Ratio: 104.77% (TotalClaims / TotalPremium)
- By Province: Gauteng (1.22), KwaZulu-Natal (1.08), Western Cape (1.06).
- By Vehicle Type: Heavy Commercial (1.63), Passenger Vehicle (1.05), Bus (0.14).
- By Gender: Not specified (1.06), Male (0.88), Female (0.82).
- Numerical Distributions: Visualized for
TotalPremium,TotalClaims,CustomValueEstimate. - Categorical Distributions: Analyzed for
ProvinceandVehicleType. - Outlier Detection: Box plots for
TotalPremium,TotalClaims,CustomValueEstimatewith strict outlier detection (whis=2.0).
- Correlation Matrix: Heatmap of financial variables (
TotalPremium,TotalClaims,SumInsured,CustomValueEstimate). - Scatterplot:
TotalPremiumvs.TotalClaimscolored byPostalCode.
- Claims and Premiums Over Time: Monthly trends show increasing averages from 2013-10 to 2015-08.
- Claim Frequency and Severity: Calculated using
RegistrationYear(noted issue with 'M' deprecated inresample).
- High Claim Vehicles: Toyota Quantum models dominate (e.g., 2.7 SESFIKILE 16s: ~12.04M claims).
- Low Claim Vehicles: Models like Chevrolet Optra 1.6 L and Mercedes-Benz C200K CLASSIC A/T have minimal or negative claims.
- Loss Ratio by Province: Bar plot showing highest ratios in Gauteng.
- Loss Ratio by Province and Vehicle Type: Heatmap with annotations.
- Temporal Trends: Dual-axis plot of average claims (red) and premiums (blue).
- Vehicle Make/Model Performance: Interactive bubble chart with
AvgClaim,LossRatio, andPolicyID. - Top Vehicle Makes by Claims: Bar plot of top 10 makes by total claim amounts.
Results saved to F:/Insurance_Risk-Analysis_Predictive_Modelling/data/eda_summary.txt, including overall loss ratio and breakdowns by province, vehicle type, and gender.
- Significant missing data in
CustomValueEstimate,NumberOfVehiclesInFleet, andCrossBorder. - Visualizations saved in
F:/Insurance_Risk-Analysis_Predictive_Modelling/data/outputs/.
This test examines whether there are significant differences in risk levels (measured by Total Claims) across provinces. The goal is to understand how risk varies regionally, which can inform province-specific policies or risk management strategies.
- Null Hypothesis (Hβ): No risk differences across provinces.
- Alternative Hypothesis (Hβ): Risk differences exist across provinces.
- Test Type: ANOVA (Analysis of Variance)
- Compares the mean
Total Claimsacross multiple provinces.
- Compares the mean
- F-Statistic: 8.626
- Indicates that the variance in
Total Claimsbetween provinces is significantly greater than the variance within provinces.
- Indicates that the variance in
- p-Value: 0.000193
- This value is much smaller than the common significance level of 0.05, suggesting the observed differences are highly unlikely to be due to random chance.
- Decision: Reject the null hypothesis (Hβ).
There is strong evidence to conclude that significant risk differences exist across provinces.
- Risk Management: Provinces with higher average claims may require stricter risk mitigation measures, while lower-risk provinces could benefit from premium reductions.
- Pricing Strategy: Develop province-specific premium structures to reflect the risk profile of each province.
- Further Analysis: Investigate the factors contributing to risk differences, such as demographic, geographic, or economic factors.
This test examines whether there are significant differences in risk levels (measured by Total Claims) between zip codes. The goal is to evaluate how risk varies geographically at a finer level, providing insights for localized strategies.
- Null Hypothesis (Hβ): No risk differences between zip codes.
- Alternative Hypothesis (Hβ): Risk differences exist between zip codes.
This test examines whether there are significant differences in risk levels (measured by Total Claims) between genders. Understanding risk differences by gender can help insurers design gender-specific policies or adjust premiums based on claims data.
- Null Hypothesis (Hβ): No significant risk differences between women and men.
- Alternative Hypothesis (Hβ): Significant risk differences exist between women and men.
- Test Type: T-Test (Independent Samples)
- Compares the mean
Total Claimsbetween two independent groups: women and men.
- Compares the mean
- T-Statistic: 3.569
- Indicates the magnitude of the difference between the means relative to the variability within groups.
- p-Value: 0.000375
- This value is much smaller than the common significance level of 0.05, suggesting that the observed differences are highly unlikely to be due to random chance.
- Decision: Reject the null hypothesis (Hβ).
There is strong evidence to conclude that significant risk differences exist between women and men.
-
Gender-Specific Strategies:
- If men or women exhibit consistently higher claims, tailor policies, premiums, or risk mitigation strategies accordingly.
-
Premium Adjustments:
- For the gender with lower average claims, consider offering reduced premiums to attract more clients.
-
Further Analysis:
- Investigate underlying factors contributing to the differences, such as claim frequency, type of coverage, or demographic influences.
| Model | RMSE | RοΏ½ |
|---|---|---|
| Linear Regression | 33733.4 | 0.292435 |
| Random Forest | 34910.2 | 0.242204 |
| XGBoost | 37746.6 | 0.114064 |
| Model | RMSE | RοΏ½ |
|---|---|---|
| Random Forest | 41.7195 | 0.978197 |
| XGBoost | 37.3255 | 0.982548 |
| Model | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|
| Random Forest | 0.997215 | 1 | 0.00179211 | 0.00357782 |
| XGBoost | 0.99721 | 0 | 0 | 0 |
- VehicleAge: Increases claim by ~2000 Rand per year older
- SumInsured: Increases claim by ~1.5 Rand per 1000 Rand insured
- IsHighRiskProvince: Increases claim by ~5000 Rand in high-risk provinces
- VehicleType_Heavy Commercial: Increases claim by ~10000 Rand vs. Passenger
- PremiumToSumInsuredRatio: Lower ratios increase claim by ~3000 Rand
- VehicleAge: Raise premiums for older vehicles to cover higher claims.
- SumInsured: Scale premiums with insured amounts.
- IsHighRiskProvince: Apply surcharges in high-risk provinces (e.g., Gauteng).
- VehicleType: Higher rates for commercial vehicles.
- PremiumToSumInsuredRatio: Avoid underpricing to reduce claim exposure.