DocStringEval: Evaluating the Effectiveness of Language Models for Code Explanation Through DocString Generation

A comprehensive framework for evaluating Large Language Models (LLMs) in automated docstring generation tasks for Python classes.

Authors: Gireesh Sundaram, Balaji Venktesh V, Sundharakumar K B

Overview

This project evaluates the performance of various LLMs in generating docstrings for Python classes. The framework includes:

Data Extraction: Automated extraction of Python classes from repositories
Code Preprocessing: Cleaning and standardizing code for consistent evaluation
Docstring Generation: Using multiple LLMs to generate docstrings
Evaluation Metrics: Comprehensive evaluation using ROUGE, BLEU, and custom metrics
Analysis Tools: Detailed analysis and comparison of model performance

Project Structure

DocStringEval/
├── README.md                    # Project documentation
├── requirements.txt             # Python dependencies
├── .gitignore                  # Git ignore rules
├── LICENSE                     # MIT License
├── classes/                    # Python class files for evaluation
│   ├── _BaseEncoder.py
│   ├── Adamax.py
│   ├── AgglomerationTransform.py
│   ├── AveragePooling1D.py
│   ├── AveragePooling2D.py
│   ├── AveragePooling3D.py
│   ├── BayesianGaussianMixture.py
│   ├── Conv.py
│   ├── Conv1D.py
│   ├── Conv1DTranspose.py
│   ├── Conv2D.py
│   ├── Conv2DTranspose.py
│   ├── Conv3D.py
│   ├── Conv3DTranspose.py
│   ├── Cropping1D.py
│   ├── Cropping2D.py
│   ├── Cropping3D.py
│   ├── DBSCAN.py
│   ├── DepthwiseConv2D.py
│   ├── Embedding.py
│   ├── Flask.py
│   ├── FunctionTransformer.py
│   ├── GaussianMixture.py
│   ├── GlobalAveragePooling1D.py
│   ├── GlobalAveragePooling2D.py
│   ├── GlobalAveragePooling3D.py
│   ├── GlobalMaxPooling1D.py
│   ├── GlobalMaxPooling2D.py
│   ├── GlobalMaxPooling3D.py
│   ├── GlobalPooling1D.py
│   ├── GlobalPooling2D.py
│   ├── GlobalPooling3D.py
│   ├── GroupTimeSeriesSplit.py
│   ├── Kmeans.py
│   ├── LabelBinarizer.py
│   ├── LabelEncoder.py
│   ├── LinearRegression.py
│   ├── LogisticRegression.py
│   ├── Loss.py
│   ├── MaxPooling1D.py
│   ├── MaxPooling2D.py
│   ├── MaxPooling3D.py
│   ├── Metric.py
│   ├── MultiLabelBinarizer.py
│   ├── OneHotEncoder.py
│   ├── OPTICS.py
│   ├── OrdinalEncoder.py
│   ├── Pooling1D.py
│   ├── Pooling2D.py
│   ├── Pooling3D.py
│   ├── PrincipalComponentAnalysis.py
│   ├── RMSprop.py
│   ├── SelfTrainingClassifier.py
│   ├── SeparableConv.py
│   ├── SeparableConv1D.py
│   ├── SeparableConv2D.py
│   ├── SequentialFeatureSelector.py
│   ├── SGD.py
│   ├── SoftmaxRegression.py
│   ├── TargetEncoder.py
│   ├── TransactionEncoder.py
│   ├── UpSampling1D.py
│   ├── UpSampling2D.py
│   ├── UpSampling3D.py
│   ├── ZeroPadding1D.py
│   ├── ZeroPadding2D.py
│   └── ZeroPadding3D.py
├── Output/                     # Generated docstring outputs
│   ├── code-to-docstring-clean_codegemma-7b-it-dfe.json
│   ├── code-to-docstring-clean_codellama-7b-instruct-hf-yig.json
│   ├── code-to-docstring-clean_deepseek-coder-7b-instruct-v-riq.json
│   ├── code-to-docstring-clean_qwen2-5-coder-7b-instruct-ljc.json
│   ├── code-to-docstring-codegemma-7b-it-dfe.json
│   ├── code-to-docstring-codellama-34b-instruct-hf-kzi_NEW.json
│   ├── code-to-docstring-codellama-7b-instruct-hf-yig.json
│   ├── code-to-docstring-COT_codegemma-7b-it-dfe.json
│   ├── code-to-docstring-COT_codellama-7b-instruct-hf-yig.json
│   ├── code-to-docstring-COT_deepseek-coder-7b-instruct-v-fkg.json
│   ├── code-to-docstring-COT_qwen2-5-coder-7b-instruct-ljc.json
│   ├── code-to-docstring-deepseek-coder-7b-instruct-v-riq.json
│   ├── code-to-docstring-meta-llama-3-8b-instruct-gtq_NEW.json
│   ├── code-to-docstring-qwen2-5-7b-instruct-qne_NEW.json
│   ├── code-to-docstring-qwen2-5-coder-32b-instruct-qyb_NEW.json
│   └── code-to-docstring-qwen2-5-coder-7b-instruct-ljc.json
├── backup/                     # Backup files and previous outputs
│   ├── clean_code.json
│   ├── code-to-comments-clean-code-codellama-7b-instruct-hf-xfe.json
│   ├── code-to-comments-clean-code-qwen2-5-coder-7b-instruct-iul.json
│   ├── code-to-comments-codegemma-1-1-7b-it-uof.json
│   ├── code-to-comments-codellama-7b-instruct-hf-xfe.json
│   ├── code-to-comments-deepseek-coder-v2-lite-instr-zxn.json
│   ├── code-to-comments-llama-3-2-3b-instruct-msv.json
│   ├── code-to-comments-mistral-7b-instruct-v0-3-uly.json
│   ├── code-to-comments-qwen2-5-coder-7b-instruct-iul.json
│   ├── code-to-docstring-clean-code-codellama-7b-instruct-hf-xfe.json
│   ├── code-to-docstring-clean-codellama-13b-instruct-hf-abo.json
│   ├── code-to-docstring-clean-deepseek-coder-v2-lite-instr-atp.json
│   ├── code-to-docstring-clean-mistral-7b-instruct-v0-3-uly.json
│   └── code-to-docstring-clean-qwen2-5-coder-7b-instruct-iul-xfe.json
├── Core Scripts/               # Main evaluation and processing scripts
│   ├── analysis.py            # Analysis of evaluation results
│   ├── cleanup.py             # Code cleaning and preprocessing
│   ├── code to comments evaluation.py
│   ├── code to comments.py
│   ├── code to docstring evaluation.py
│   ├── code to docstring.py
│   ├── docstring generation.py
│   ├── extract_classes.py
│   ├── scoring.py
│   └── sample_class.py
├── Data Files/                 # Input data and processed files
│   ├── class collection.csv
│   ├── class collection.xlsx
│   ├── class_files_df.pkl
│   ├── clean_code
│   ├── LLM Hard questions.csv
│   ├── new scoring.xlsx
│   └── sdfsdf.csv
└── Results/                    # Evaluation results and reports
    ├── all llm scoring.xlsx
    └── all_scoiring.pkl

Quick Start

Installation

git clone https://github.com/GireeshS22/DocStringEval.git
cd DocStringEval
pip install -r requirements.txt

Environment Setup

Create Environment File: Copy the example environment file and add your API keys:

cp .env.example .env

Add Your API Keys: Edit the .env file and add your Hugging Face API key:

HUGGINGFACE_API_KEY=your_actual_api_key_here

Note: Never commit your .env file to version control. It's already added to .gitignore.

Basic Usage

Extract Classes:

python extract_classes.py

Generate Docstrings:

python "code to docstring.py"

Evaluate Results:

python "code to docstring evaluation.py"

Analyze Results:

python analysis.py

Supported Models

CodeLlama (7B, 13B, 34B)
Qwen2.5 Coder (7B, 32B)
DeepSeek Coder (7B)
CodeGemma (7B)
Mistral (7B)
Meta Llama 3 (8B)

Evaluation Metrics

ROUGE-1: Measures word overlap between generated and reference docstrings
BLEU: Evaluates n-gram precision and brevity penalty
Conciseness: Measures the length efficiency of generated docstrings
Custom Metrics: Domain-specific evaluation criteria

Project Components

Data Processing

extract_classes.py: Extracts Python classes from repositories
cleanup.py: Cleans and standardizes code for evaluation
class_files_df.pkl: Processed dataset of Python classes

Docstring Generation

code to docstring.py: Main script for generating docstrings using LLMs
docstring generation.py: Alternative generation approach
Output/: Directory containing generated docstrings for each model

Evaluation

code to docstring evaluation.py: Evaluates generated docstrings using ROUGE and BLEU
scoring.py: Custom scoring mechanisms
analysis.py: Comprehensive analysis of evaluation results

Results

all llm scoring.xlsx: Comprehensive scoring results
all_scoiring.pkl: Pickled scoring data for further analysis

Results

Detailed evaluation results and model comparisons are available in the Results/ directory and Output/ directory.

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this framework in your research, please cite:

@INPROCEEDINGS{11108633,
  author={Sundaram, Gireesh and Venktesh V, Balaji and K B, Sundharakumar},
  booktitle={2025 International Conference on Emerging Technologies in Computing and Communication (ETCC)}, 
  title={DocStringEval: Evaluating the Effectiveness of Language Models for Code Explanation Through DocString Generation}, 
  year={2025},
  volume={},
  number={},
  pages={1-7},
  keywords={Measurement;Codes;Accuracy;Large language models;Computational modeling;Computer architecture;Benchmark testing;Python;Large Language Models (LLMs);Code Explanation;Docstring Generation;Chain of Thought (CoT) Prompting;Code Summarization},
  doi={10.1109/ETCC65847.2025.11108633}
}

Latest Updates

🚀 Paper Published! Our work "DocStringEval: Evaluating the Effectiveness of Language Models for Code Explanation through DocString Generation" has been published at IEEE ETCC 2025!

🧵 🚀 NEW PAPER: "DocStringEval: Evaluating the Effectiveness of Language Models for Code Explanation through DocString Generation"
Published at IEEE ETCC 2025! 📄
With @Balajivenky4288
We benchmarked how well LLMs can explain Python code by generating DocStrings.
— Gireesh (@GireeshS22) August 14, 2025

Contact

For questions and support, please:

Reply to our Twitter thread for project-related questions and discussions
Open an issue on GitHub for technical bugs and feature requests
Follow us on X (Twitter) for updates and discussions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocStringEval: Evaluating the Effectiveness of Language Models for Code Explanation Through DocString Generation

Overview

Project Structure

Quick Start

Installation

Environment Setup

Basic Usage

Supported Models

Evaluation Metrics

Project Components

Data Processing

Docstring Generation

Evaluation

Results

Results

Contributing

License

Citation

Latest Updates

Contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data		data
output		output
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

GireeshS22/DocStringEval

Folders and files

Latest commit

History

Repository files navigation

DocStringEval: Evaluating the Effectiveness of Language Models for Code Explanation Through DocString Generation

Overview

Project Structure

Quick Start

Installation

Environment Setup

Basic Usage

Supported Models

Evaluation Metrics

Project Components

Data Processing

Docstring Generation

Evaluation

Results

Results

Contributing

License

Citation

Latest Updates

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages