A comprehensive machine learning project for detecting fraudulent financial transactions using Python and scikit-learn.
This project implements a fraud detection system that analyzes financial transaction data to identify potentially fraudulent activities. The system uses a Random Forest classifier to predict fraudulent transactions with high accuracy.
- Data Analysis: Comprehensive exploratory data analysis of 6.3M+ transactions
- Feature Engineering: Creation of relevant features like balance changes and zero balance indicators
- Machine Learning: Random Forest classifier for fraud detection
- Performance Metrics: High accuracy with balanced precision and recall
- Visualization: Multiple charts and plots for data insights
- Total Transactions: 6,362,620
- Fraudulent Transactions: 8,213 (0.13% fraud rate)
- Features: 11 columns including transaction type, amount, balances, and fraud indicators
- Data Quality: Clean dataset with no missing values or duplicates
| Feature | Description |
|---|---|
step |
Time step (hour) |
type |
Transaction type (PAYMENT, TRANSFER, CASH_OUT, DEBIT, CASH_IN) |
amount |
Transaction amount |
nameOrig |
Origin customer ID |
oldbalanceOrg |
Origin account balance before transaction |
newbalanceOrig |
Origin account balance after transaction |
nameDest |
Destination customer ID |
oldbalanceDest |
Destination account balance before transaction |
newbalanceDest |
Destination account balance after transaction |
isFraud |
Fraud indicator (target variable) |
isFlaggedFraud |
Flagged fraud indicator |
- CASH_OUT: Most common transaction type (2.2M transactions)
- TRANSFER: Highest fraud rate (0.77%)
- CASH_IN, DEBIT, PAYMENT: Zero fraud rate
- Average Amount: Fraudulent transactions average $1.47M vs $178K for legitimate
- Balance Changes: Strong correlation (0.36) with fraud detection
- Time Patterns: Fraud occurs in specific time intervals
- Handled categorical variables using Label Encoding
- Created engineered features:
balance_change: Difference between old and new balanceis_zero_balance: Indicator for zero balance transactions
- Applied StandardScaler for feature normalization
- Algorithm: Random Forest Classifier
- Parameters: n_estimators=2, random_state=42
- Train/Test Split: 80/20 with stratification
Accuracy: 99.97%
Precision: 96% (fraud class)
Recall: 76% (fraud class)
F1-Score: 85% (fraud class)
The project includes several visualizations:
- Transaction type distribution
- Amount distribution histograms
- Fraud rate over time
- Balance distribution plots
- Correlation heatmaps
pip install pandas numpy matplotlib seaborn scikit-learn- Ensure you have the
Fraud.csvdataset in the project directory - Open
Fraud.ipynbin Jupyter Notebook or JupyterLab - Run all cells to execute the complete analysis
- Data exploration results
- Feature engineering insights
- Model training and evaluation
- Performance metrics and predictions
Fraud/
├── Fraud.ipynb # Main Jupyter notebook
├── Fraud.csv # Dataset (not included in repo)
└── README.md # This file
The Random Forest model demonstrates excellent performance:
- High Accuracy: 99.97% overall accuracy
- Low False Positives: Only 47 false alarms out of 1.27M predictions
- Good Recall: Captures 76% of actual fraud cases
- Strong Precision: 96% of predicted fraud cases are actual fraud
This fraud detection system can:
- Reduce Financial Losses: Early detection of fraudulent transactions
- Improve Customer Trust: Minimize false positives that affect legitimate customers
- Enhance Security: Real-time monitoring capabilities
- Scale Operations: Handle large transaction volumes efficiently
Potential improvements include:
- Real-time transaction monitoring
- Integration with existing banking systems
- Additional ML algorithms (XGBoost, Neural Networks)
- API development for production deployment
- Advanced feature engineering
- Ensemble methods for improved performance
This project is open source and available under the MIT License.
Contributions are welcome! Please feel free to submit issues and enhancement requests.
Note: The dataset (Fraud.csv) is not included in this repository due to size constraints. Please ensure you have the dataset before running the notebook.