A multi-step verification pipeline that extracts bibliographic references from research papers (PDFs) using GROBID and verifies their existence using the DBLP API to detect "hallucinated" or incorrect citations.
-
GROBID must be running locally on port 8070:
docker pull grobid/grobid:0.8.2-full docker run --rm --init -p 8070:8070 grobid/grobid:0.8.2-full
-
Verify GROBID is running:
curl http://localhost:8070/api/isalive
-
(Optional) Set up Gemini API key for advanced verification:
export GEMINI_API_KEY="your-api-key"
python3 main_pipeline.py <path_to_pdf>Example:
python3 main_pipeline.py data/raw/paper.pdfReference_Halucinations/
├── assets/ # Images and static assets
├── data/
│ ├── raw/ # Input PDFs
│ └── output/ # JSON output files (if enabled)
├── extraction/ # Reference extraction modules
│ ├── extractRefData.py # Sends PDF to GROBID, returns XML
│ ├── extractMetadata.py # Parses XML to extract full metadata
│ ├── extractTitle.py # Parses XML to extract paper titles
│ ├── pdfplumber_extract.py # Fallback PDF text extraction
│ └── parser.py # Reference parsing utilities
├── verification/ # Verification modules
│ ├── dblp.py # DBLP API queries & classification
│ ├── gemini.py # Gemini API for advanced verification
│ └── utils.py # Title cleaning & author matching
├── fluff/ # Verification reports output
├── tests/ # Test suite
│ ├── unit/
│ └── integration/
├── main_pipeline.py # Multi-step verification pipeline (Entry Point)
└── requirements.txt # Project dependencies
The verification process follows a multi-stage pipeline designed to minimize false positives while detecting hallucinations. The logic flows from strict API matching to fuzzy metadata comparisons, and finally to AI-based verification if needed.
- Extracts references from PDF via GROBID
- Queries DBLP API with normalized titles
- Applies length penalty for short/generic titles
- Classifies as: VERIFIED, REVIEW, UNVERIFIED, or SUSPICIOUS
- For references with DBLP candidates, compares author lists
- Uses last-name matching with fuzzy comparison
- Boosts confidence for matching authors/years
- Re-queries DBLP for UNVERIFIED refs (handles transient failures)
- Activated if DBLP match is not found (
UNVERIFIED) - Extracts raw text from PDF using pdfplumber
- Applies regex patterns to find reference titles
- Re-verifies against DBLP with corrected titles
- Processed if enabled and references remain
REVIEWorUNVERIFIED - Batch processing to avoid rate limiting
- Returns verification status based on AI analysis
- Generates comprehensive report with statistics
- Sorts references by verification status
- Outputs to
fluff/verification_report_<timestamp>.txt
| Label | Description |
|---|---|
VERIFIED |
High confidence match (Score ≥0.9 OR ≥0.75 with Author Match) |
REVIEW |
Medium confidence (Score 0.5-0.9) or Ambiguous match |
UNVERIFIED |
No match found or Low confidence (Score <0.5) |
-
Create a virtual environment:
python3 -m venv .venv source .venv/bin/activate -
Install dependencies:
pip install -r requirements.txt
requests- HTTP client for GROBID and DBLP APIsbeautifulsoup4- XML parsing for GROBID outputlxml- XML parser backendpdfplumber- PDF text extraction for fallbackpython-dotenv- Environment variable management for Gemini API keys
Key thresholds in verification/dblp.py:
| Parameter | Value | Description |
|---|---|---|
SIMILARITY_THRESHOLD |
0.7 | Minimum score to consider a match |
AMBIGUITY_GAP |
0.05 | Gap between top matches to flag ambiguity |
The tool automatically fixes common GROBID extraction errors:
schemabased→schema-basedprompttuning→prompt-tuninglowresource→low-resource- And many more compound word fixes
Pull and run GROBID with Docker:
# Pull the full image (includes all models)
docker pull grobid/grobid:0.8.2-full
# Run GROBID server
docker run --rm --init -p 8070:8070 grobid/grobid:0.8.2-full
# Verify it's running
curl http://localhost:8070/api/isaliveGROBID will be available at http://localhost:8070.
On academic CS papers, the tool typically achieves:
- 70-90% VERIFIED - References found in DBLP
- 0-5% REVIEW - Need manual verification
- 10-25% UNVERIFIED - Not in DBLP (statistics journals, books, tech reports)
- 0-2% SUSPICIOUS - Low confidence matches
- Statistics journals (JASA, Annals of Statistics)
- GIS/Photogrammetry venues (ISPRS)
- Pre-1970 classic papers
- Books and book chapters
- Technical reports
- Dataset citations (e.g., Kaggle)
python3 main_pipeline.py paper.pdf
python3 main_pipeline.py paper.pdf --skip-gemini
python3 main_pipeline.py paper.pdf --skip-regex
