DocStringEval: Evaluating the Effectiveness of Language Models for Code Explanation Through DocString Generation
A comprehensive framework for evaluating Large Language Models (LLMs) in automated docstring generation tasks for Python classes.
Authors: Gireesh Sundaram, Balaji Venktesh V, Sundharakumar K B
This project evaluates the performance of various LLMs in generating docstrings for Python classes. The framework includes:
- Data Extraction: Automated extraction of Python classes from repositories
- Code Preprocessing: Cleaning and standardizing code for consistent evaluation
- Docstring Generation: Using multiple LLMs to generate docstrings
- Evaluation Metrics: Comprehensive evaluation using ROUGE, BLEU, and custom metrics
- Analysis Tools: Detailed analysis and comparison of model performance
DocStringEval/
├── README.md # Project documentation
├── requirements.txt # Python dependencies
├── .gitignore # Git ignore rules
├── LICENSE # MIT License
├── classes/ # Python class files for evaluation
│ ├── _BaseEncoder.py
│ ├── Adamax.py
│ ├── AgglomerationTransform.py
│ ├── AveragePooling1D.py
│ ├── AveragePooling2D.py
│ ├── AveragePooling3D.py
│ ├── BayesianGaussianMixture.py
│ ├── Conv.py
│ ├── Conv1D.py
│ ├── Conv1DTranspose.py
│ ├── Conv2D.py
│ ├── Conv2DTranspose.py
│ ├── Conv3D.py
│ ├── Conv3DTranspose.py
│ ├── Cropping1D.py
│ ├── Cropping2D.py
│ ├── Cropping3D.py
│ ├── DBSCAN.py
│ ├── DepthwiseConv2D.py
│ ├── Embedding.py
│ ├── Flask.py
│ ├── FunctionTransformer.py
│ ├── GaussianMixture.py
│ ├── GlobalAveragePooling1D.py
│ ├── GlobalAveragePooling2D.py
│ ├── GlobalAveragePooling3D.py
│ ├── GlobalMaxPooling1D.py
│ ├── GlobalMaxPooling2D.py
│ ├── GlobalMaxPooling3D.py
│ ├── GlobalPooling1D.py
│ ├── GlobalPooling2D.py
│ ├── GlobalPooling3D.py
│ ├── GroupTimeSeriesSplit.py
│ ├── Kmeans.py
│ ├── LabelBinarizer.py
│ ├── LabelEncoder.py
│ ├── LinearRegression.py
│ ├── LogisticRegression.py
│ ├── Loss.py
│ ├── MaxPooling1D.py
│ ├── MaxPooling2D.py
│ ├── MaxPooling3D.py
│ ├── Metric.py
│ ├── MultiLabelBinarizer.py
│ ├── OneHotEncoder.py
│ ├── OPTICS.py
│ ├── OrdinalEncoder.py
│ ├── Pooling1D.py
│ ├── Pooling2D.py
│ ├── Pooling3D.py
│ ├── PrincipalComponentAnalysis.py
│ ├── RMSprop.py
│ ├── SelfTrainingClassifier.py
│ ├── SeparableConv.py
│ ├── SeparableConv1D.py
│ ├── SeparableConv2D.py
│ ├── SequentialFeatureSelector.py
│ ├── SGD.py
│ ├── SoftmaxRegression.py
│ ├── TargetEncoder.py
│ ├── TransactionEncoder.py
│ ├── UpSampling1D.py
│ ├── UpSampling2D.py
│ ├── UpSampling3D.py
│ ├── ZeroPadding1D.py
│ ├── ZeroPadding2D.py
│ └── ZeroPadding3D.py
├── Output/ # Generated docstring outputs
│ ├── code-to-docstring-clean_codegemma-7b-it-dfe.json
│ ├── code-to-docstring-clean_codellama-7b-instruct-hf-yig.json
│ ├── code-to-docstring-clean_deepseek-coder-7b-instruct-v-riq.json
│ ├── code-to-docstring-clean_qwen2-5-coder-7b-instruct-ljc.json
│ ├── code-to-docstring-codegemma-7b-it-dfe.json
│ ├── code-to-docstring-codellama-34b-instruct-hf-kzi_NEW.json
│ ├── code-to-docstring-codellama-7b-instruct-hf-yig.json
│ ├── code-to-docstring-COT_codegemma-7b-it-dfe.json
│ ├── code-to-docstring-COT_codellama-7b-instruct-hf-yig.json
│ ├── code-to-docstring-COT_deepseek-coder-7b-instruct-v-fkg.json
│ ├── code-to-docstring-COT_qwen2-5-coder-7b-instruct-ljc.json
│ ├── code-to-docstring-deepseek-coder-7b-instruct-v-riq.json
│ ├── code-to-docstring-meta-llama-3-8b-instruct-gtq_NEW.json
│ ├── code-to-docstring-qwen2-5-7b-instruct-qne_NEW.json
│ ├── code-to-docstring-qwen2-5-coder-32b-instruct-qyb_NEW.json
│ └── code-to-docstring-qwen2-5-coder-7b-instruct-ljc.json
├── backup/ # Backup files and previous outputs
│ ├── clean_code.json
│ ├── code-to-comments-clean-code-codellama-7b-instruct-hf-xfe.json
│ ├── code-to-comments-clean-code-qwen2-5-coder-7b-instruct-iul.json
│ ├── code-to-comments-codegemma-1-1-7b-it-uof.json
│ ├── code-to-comments-codellama-7b-instruct-hf-xfe.json
│ ├── code-to-comments-deepseek-coder-v2-lite-instr-zxn.json
│ ├── code-to-comments-llama-3-2-3b-instruct-msv.json
│ ├── code-to-comments-mistral-7b-instruct-v0-3-uly.json
│ ├── code-to-comments-qwen2-5-coder-7b-instruct-iul.json
│ ├── code-to-docstring-clean-code-codellama-7b-instruct-hf-xfe.json
│ ├── code-to-docstring-clean-codellama-13b-instruct-hf-abo.json
│ ├── code-to-docstring-clean-deepseek-coder-v2-lite-instr-atp.json
│ ├── code-to-docstring-clean-mistral-7b-instruct-v0-3-uly.json
│ └── code-to-docstring-clean-qwen2-5-coder-7b-instruct-iul-xfe.json
├── Core Scripts/ # Main evaluation and processing scripts
│ ├── analysis.py # Analysis of evaluation results
│ ├── cleanup.py # Code cleaning and preprocessing
│ ├── code to comments evaluation.py
│ ├── code to comments.py
│ ├── code to docstring evaluation.py
│ ├── code to docstring.py
│ ├── docstring generation.py
│ ├── extract_classes.py
│ ├── scoring.py
│ └── sample_class.py
├── Data Files/ # Input data and processed files
│ ├── class collection.csv
│ ├── class collection.xlsx
│ ├── class_files_df.pkl
│ ├── clean_code
│ ├── LLM Hard questions.csv
│ ├── new scoring.xlsx
│ └── sdfsdf.csv
└── Results/ # Evaluation results and reports
├── all llm scoring.xlsx
└── all_scoiring.pkl
git clone https://github.com/GireeshS22/DocStringEval.git
cd DocStringEval
pip install -r requirements.txt- Create Environment File: Copy the example environment file and add your API keys:
cp .env.example .env- Add Your API Keys: Edit the
.envfile and add your Hugging Face API key:
HUGGINGFACE_API_KEY=your_actual_api_key_hereNote: Never commit your .env file to version control. It's already added to .gitignore.
- Extract Classes:
python extract_classes.py- Generate Docstrings:
python "code to docstring.py"- Evaluate Results:
python "code to docstring evaluation.py"- Analyze Results:
python analysis.py- CodeLlama (7B, 13B, 34B)
- Qwen2.5 Coder (7B, 32B)
- DeepSeek Coder (7B)
- CodeGemma (7B)
- Mistral (7B)
- Meta Llama 3 (8B)
- ROUGE-1: Measures word overlap between generated and reference docstrings
- BLEU: Evaluates n-gram precision and brevity penalty
- Conciseness: Measures the length efficiency of generated docstrings
- Custom Metrics: Domain-specific evaluation criteria
extract_classes.py: Extracts Python classes from repositoriescleanup.py: Cleans and standardizes code for evaluationclass_files_df.pkl: Processed dataset of Python classes
code to docstring.py: Main script for generating docstrings using LLMsdocstring generation.py: Alternative generation approachOutput/: Directory containing generated docstrings for each model
code to docstring evaluation.py: Evaluates generated docstrings using ROUGE and BLEUscoring.py: Custom scoring mechanismsanalysis.py: Comprehensive analysis of evaluation results
all llm scoring.xlsx: Comprehensive scoring resultsall_scoiring.pkl: Pickled scoring data for further analysis
Detailed evaluation results and model comparisons are available in the Results/ directory and Output/ directory.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this framework in your research, please cite:
@INPROCEEDINGS{11108633,
author={Sundaram, Gireesh and Venktesh V, Balaji and K B, Sundharakumar},
booktitle={2025 International Conference on Emerging Technologies in Computing and Communication (ETCC)},
title={DocStringEval: Evaluating the Effectiveness of Language Models for Code Explanation Through DocString Generation},
year={2025},
volume={},
number={},
pages={1-7},
keywords={Measurement;Codes;Accuracy;Large language models;Computational modeling;Computer architecture;Benchmark testing;Python;Large Language Models (LLMs);Code Explanation;Docstring Generation;Chain of Thought (CoT) Prompting;Code Summarization},
doi={10.1109/ETCC65847.2025.11108633}
}🚀 Paper Published! Our work "DocStringEval: Evaluating the Effectiveness of Language Models for Code Explanation through DocString Generation" has been published at IEEE ETCC 2025!
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>🧵 🚀 NEW PAPER: "DocStringEval: Evaluating the Effectiveness of Language Models for Code Explanation through DocString Generation"
— Gireesh (@GireeshS22) August 14, 2025
Published at IEEE ETCC 2025! 📄
With @Balajivenky4288
We benchmarked how well LLMs can explain Python code by generating DocStrings.
For questions and support, please:
- Reply to our Twitter thread for project-related questions and discussions
- Open an issue on GitHub for technical bugs and feature requests
- Follow us on X (Twitter) for updates and discussions