Skip to content

A machine learning-aided bug prediction framework for Java projects combining static code analysis and evolutionary context modeling.

License

Notifications You must be signed in to change notification settings

ttaymaz/JavaMLBugDetective

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JavaMLBugDetective

DOI License Java Maven Build Status

JavaMLBugDetective is a machine learning-aided bug prediction framework for Java projects. It combines static code analysis, process metrics, and evolutionary context modeling to predict defect-prone code.

Developed as part of Ph.D. research at Dokuz Eylül University, this framework is actively maintained and continues to evolve.


🚀 Quick Start

# Clone the repository
git clone https://github.com/ttaymaz/JavaMLBugDetective.git
cd JavaMLBugDetective

# Configure your target repository
cp sample.config.properties config.properties
# Edit config.properties with your settings

# Run the analysis pipeline
chmod +x clean_and_run.sh
./clean_and_run.sh

✨ Key Features

  • SZZ Algorithm: Identifies bug-introducing commits via enhanced pattern matching
  • Version-Based Validation: Uses Git tags for realistic, chronological evaluation
  • Hybrid Metrics: Combines process, static, and diff/churn metrics
    • Process: NR, NDEV, AGE, EXP
    • Static (CK suite): WMC, CBO, RFC, LCOM, CYCLO
    • Diff/Churn: LINES_ADDED, LINES_DELETED, HUNK_COUNT
  • ML Pipeline: RandomForest, J48, NaiveBayes, SMO (via Weka)
  • Class Balancing: SMOTE and ClassBalancer
  • Cost-Sensitive Learning: Configurable FN/FP cost matrix
  • Automated Reporting: Scientific validation and prediction reports
  • Green AI: 32,000x more energy-efficient than LLM-based approaches

📁 Project Structure

JavaMLBugDetective/
├── src/main/java/org/tymz/
│   ├── config/        # Configuration management
│   ├── db/            # SQLite database operations
│   ├── feature/       # Data preprocessing
│   ├── git/           # JGit repository operations
│   ├── main/          # Application entry point
│   ├── metric/        # Metric calculators
│   ├── ml/            # Weka ML training
│   ├── report/        # Report generation
│   ├── szz/           # SZZ algorithm
│   └── version/       # Version management
├── src/test/          # Unit tests
├── pom.xml            # Maven configuration
├── config.properties  # Analysis settings
└── clean_and_run.sh   # Pipeline script

⚙️ Configuration

Edit config.properties to configure your analysis:

# Target repository
repository.url=https://github.com/your-org/your-project.git
repository.local.path=./repositories/your-project
project.name=your-project

# SZZ settings
szz.bug_fix_keywords=fix,bug,issue,defect,error,fault,problem,crash,exception

# ML settings
ml.algorithm=all
ml.balance.classes=true
ml.validation.strategy=version-based
ml.smote.enabled=true

# Cost-sensitive learning
ml.cost.fn=10.0  # False Negative cost
ml.cost.fp=1.0   # False Positive cost

Private Repository Support

github.username=your-username
github.token=ghp_your_token_here

Note: config.properties is excluded from Git via .gitignore


📊 Outputs

Output Description
[project]-dataset.arff ML dataset with all metrics
reports/[project]-report-*.md Scientific validation report
reports/[project]-prediction-*.md Bug prediction report

📈 Verified Results

Cross-project validation results (Hybrid Model with Cost-Sensitive Learning):

Project F1-Score Precision Recall Instances
Apache Kafka 0.742 0.61 0.94 72,705
Google Gson 0.685 0.52 0.99 6,034
Apache Commons-IO 0.570 0.40 0.99 12,920

Ablation Study Highlights:

  • Hybrid model outperforms static-only by up to 128% (Commons-IO)
  • Process metrics consistently outperform static metrics
  • Model maintains robust performance despite 70.8% label noise

🔧 Requirements

  • Java: JDK 21+
  • Maven: 3.9+
  • Git: For repository operations
  • RAM: 4GB+ (recommended for large repos)

📦 Dependencies

  • Eclipse JGit: Git operations
  • PMD: Static code analysis
  • Weka: Machine learning
  • SQLite JDBC: Data persistence

📚 Dataset & Replication Package

The JML-BugDB dataset and complete replication package are permanently archived at Zenodo:

DOI

The package includes:

  • JML-BugDB dataset (91,633 instances across 3 Java projects)
  • Manual validation data and methodology
  • Framework source code snapshot
  • Replication instructions

📖 Citation

If you use this work in your research, please cite:

@software{taymaz2026jmlbugdetective,
  author    = {Taymaz, Turgay and Birant, Kökten Ulaş},
  title     = {JavaMLBugDetective: ML-Aided Bug Prediction Framework},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.18161123},
  url       = {https://doi.org/10.5281/zenodo.18161123}
}

👥 Authors

Turgay Taymaz — Developer & Researcher
Assoc. Prof. Dr. Kökten Ulaş Birant — Advisor

Dokuz Eylül University, The Graduate School of Natural and Applied Sciences


🤝 Contributing

Contributions are welcome! Please:

  1. Open an issue for bugs or feature requests
  2. Submit pull requests for improvements

Contact: turgay[at]taymaz.org


📄 License

This project is released under the MIT License.


Last Updated: January 2026

About

A machine learning-aided bug prediction framework for Java projects combining static code analysis and evolutionary context modeling.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published