Reference Hallucination Checker

A multi-step verification pipeline that extracts bibliographic references from research papers (PDFs) using GROBID and verifies their existence using the DBLP API to detect "hallucinated" or incorrect citations.

🚀 Quick Start

Prerequisites

GROBID must be running locally on port 8070:

docker pull grobid/grobid:0.8.2-full
docker run --rm --init -p 8070:8070 grobid/grobid:0.8.2-full

Verify GROBID is running:
```
curl http://localhost:8070/api/isalive
```
(Optional) Set up Gemini API key for advanced verification:
```
export GEMINI_API_KEY="your-api-key"
```

Run the Tool

python3 main_pipeline.py <path_to_pdf>

Example:

python3 main_pipeline.py data/raw/paper.pdf

📂 Project Structure

Reference_Halucinations/
├── assets/                 # Images and static assets
├── data/
│   ├── raw/                # Input PDFs
│   └── output/             # JSON output files (if enabled)
├── extraction/             # Reference extraction modules
│   ├── extractRefData.py   # Sends PDF to GROBID, returns XML
│   ├── extractMetadata.py  # Parses XML to extract full metadata
│   ├── extractTitle.py     # Parses XML to extract paper titles
│   ├── pdfplumber_extract.py # Fallback PDF text extraction
│   └── parser.py           # Reference parsing utilities
├── verification/           # Verification modules
│   ├── dblp.py             # DBLP API queries & classification
│   ├── gemini.py           # Gemini API for advanced verification
│   └── utils.py            # Title cleaning & author matching
├── fluff/                  # Verification reports output
├── tests/                  # Test suite
│   ├── unit/
│   └── integration/
├── main_pipeline.py        # Multi-step verification pipeline (Entry Point)
└── requirements.txt        # Project dependencies

🔄 Verification Pipeline

The verification process follows a multi-stage pipeline designed to minimize false positives while detecting hallucinations. The logic flows from strict API matching to fuzzy metadata comparisons, and finally to AI-based verification if needed.

Step 1: Title Matching

Extracts references from PDF via GROBID
Queries DBLP API with normalized titles
Applies length penalty for short/generic titles
Classifies as: VERIFIED, REVIEW, UNVERIFIED, or SUSPICIOUS

Step 2: Metadata Check (Author & Year)

For references with DBLP candidates, compares author lists
Uses last-name matching with fuzzy comparison
Boosts confidence for matching authors/years
Re-queries DBLP for UNVERIFIED refs (handles transient failures)

Step 3: Regex Re-extraction (Conditional)

Activated if DBLP match is not found (UNVERIFIED)
Extracts raw text from PDF using pdfplumber
Applies regex patterns to find reference titles
Re-verifies against DBLP with corrected titles

Step 4: Gemini Verification (Conditional)

Processed if enabled and references remain REVIEW or UNVERIFIED
Batch processing to avoid rate limiting
Returns verification status based on AI analysis

Step 5: Display Results

Generates comprehensive report with statistics
Sorts references by verification status
Outputs to fluff/verification_report_<timestamp>.txt

📊 Classification Labels

Label	Description
`VERIFIED`	High confidence match (Score ≥0.9 OR ≥0.75 with Author Match)
`REVIEW`	Medium confidence (Score 0.5-0.9) or Ambiguous match
`UNVERIFIED`	No match found or Low confidence (Score <0.5)

🛠 Installation

Create a virtual environment:

python3 -m venv .venv
source .venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```

📦 Dependencies

requests - HTTP client for GROBID and DBLP APIs
beautifulsoup4 - XML parsing for GROBID output
lxml - XML parser backend
pdfplumber - PDF text extraction for fallback
python-dotenv - Environment variable management for Gemini API keys

⚙️ Configuration

Key thresholds in verification/dblp.py:

Parameter	Value	Description
`SIMILARITY_THRESHOLD`	0.7	Minimum score to consider a match
`AMBIGUITY_GAP`	0.05	Gap between top matches to flag ambiguity

GROBID Title Error Fixes

The tool automatically fixes common GROBID extraction errors:

schemabased → schema-based
prompttuning → prompt-tuning
lowresource → low-resource
And many more compound word fixes

🐳 GROBID Setup

Pull and run GROBID with Docker:

# Pull the full image (includes all models)
docker pull grobid/grobid:0.8.2-full

# Run GROBID server
docker run --rm --init -p 8070:8070 grobid/grobid:0.8.2-full

# Verify it's running
curl http://localhost:8070/api/isalive

GROBID will be available at http://localhost:8070.

📈 Typical Results

On academic CS papers, the tool typically achieves:

70-90% VERIFIED - References found in DBLP
0-5% REVIEW - Need manual verification
10-25% UNVERIFIED - Not in DBLP (statistics journals, books, tech reports)
0-2% SUSPICIOUS - Low confidence matches

Common UNVERIFIED Categories (Not Hallucinations)

Statistics journals (JASA, Annals of Statistics)
GIS/Photogrammetry venues (ISPRS)
Pre-1970 classic papers
Books and book chapters
Technical reports
Dataset citations (e.g., Kaggle)

🚀 Usage

Full pipeline

python3 main_pipeline.py paper.pdf

Skip Gemini steps (if API quota exceeded)

python3 main_pipeline.py paper.pdf --skip-gemini

Skip regex re-extraction

python3 main_pipeline.py paper.pdf --skip-regex

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
assets		assets
extraction		extraction
verification		verification
.gitignore		.gitignore
README.md		README.md
main.py		main.py
main_pipeline.py		main_pipeline.py
paper.pdf		paper.pdf
paper2.pdf		paper2.pdf
paper3.pdf		paper3.pdf
paper4.pdf		paper4.pdf
paper5.pdf		paper5.pdf
paper6.pdf		paper6.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reference Hallucination Checker

🚀 Quick Start

Prerequisites

Run the Tool

📂 Project Structure

🔄 Verification Pipeline

Step 1: Title Matching

Step 2: Metadata Check (Author & Year)

Step 3: Regex Re-extraction (Conditional)

Step 4: Gemini Verification (Conditional)

Step 5: Display Results

📊 Classification Labels

🛠 Installation

📦 Dependencies

⚙️ Configuration

GROBID Title Error Fixes

🐳 GROBID Setup

📈 Typical Results

Common UNVERIFIED Categories (Not Hallucinations)

🚀 Usage

Full pipeline

Skip Gemini steps (if API quota exceeded)

Skip regex re-extraction

About

Uh oh!

Releases

Packages

Languages

Ad1th/Reference_Hallucinations

Folders and files

Latest commit

History

Repository files navigation

Reference Hallucination Checker

🚀 Quick Start

Prerequisites

Run the Tool

📂 Project Structure

🔄 Verification Pipeline

Step 1: Title Matching

Step 2: Metadata Check (Author & Year)

Step 3: Regex Re-extraction (Conditional)

Step 4: Gemini Verification (Conditional)

Step 5: Display Results

📊 Classification Labels

🛠 Installation

📦 Dependencies

⚙️ Configuration

GROBID Title Error Fixes

🐳 GROBID Setup

📈 Typical Results

Common UNVERIFIED Categories (Not Hallucinations)

🚀 Usage

Full pipeline

Skip Gemini steps (if API quota exceeded)

Skip regex re-extraction

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages