Certification Challenge Project โ AI Engineering Bootcamp Cohort 8
An intelligent question-answering system for GDELT (Global Database of Events, Language, and Tone) documentation, powered by Retrieval-Augmented Generation.
Core Docs
- ๐๏ธ CLAUDE.md โ canonical developer guide for this repository
- ๐ Full Deliverables โ complete certification submissions
- ๐ Task Rubric โ 100-point grading breakdown
- ๐ง docs/initial-architecture.md โ early conceptual design (frozen)
Architecture Documentation (Auto-Generated)
Located in the architecture/ directory โ produced by the Claude Agent SDK Analyzer:
| File | Purpose |
|---|---|
architecture/README.md |
Top-level architecture summary and system lifecycle |
architecture/docs/01_component_inventory.md |
Detailed module inventory |
architecture/docs/03_data_flows.md |
Ingestion โ retrieval โ evaluation flows |
architecture/docs/04_api_reference.md |
Public API reference |
architecture/diagrams/02_architecture_diagrams.md |
Layered system & runtime dependency diagrams |
๐ง Note:
docs/initial-architecture.mdcaptures the original design sketch before automation and is no longer updated.
The currentarchitecture/tree is generated automatically from the live codebase.
- Python 3.11 +
uvpackage managerOPENAI_API_KEY(required)COHERE_API_KEY(optional โ reranking)
git clone https://github.com/aie8-cert-challenge/gdelt-knowledge-base.git
cd cert-challenge
uv venv --python 3.11
source .venv/bin/activate # Linux/WSL/Mac
# .venv\Scripts\activate # Windows
uv pip install -e .cp .env.example .env
# Edit .env with:
# OPENAI_API_KEY=your_key_here
# COHERE_API_KEY=optional# Interactive LangGraph Studio UI
uv add langgraph-cli[inmem]
uv run langgraph dev --allow-blocking
# โ http://localhost:2024
# Studio: https://smith.langchain.com/studio/?baseUrl=http://127.0.0.1:2024
# CLI evaluation
python scripts/run_eval_harness.py
# or
make eval
# Quick validation
make validateThis system follows a 5-layer architecture:
| Layer | Purpose | Key Modules |
|---|---|---|
| Configuration | External services (OpenAI, Qdrant, Cohere) | src/config.py |
| Data | Ingestion + persistence (HF datasets + manifest) | src/utils/ |
| Retrieval | Multi-strategy search (naive, BM25, ensemble, rerank) | src/retrievers.py |
| Orchestration | LangGraph workflows (retrieve โ generate) | src/graph.py, src/state.py |
| Execution | Scripts and LangGraph Server entrypoints | scripts/, app/graph_app.py |
Design Principles
- Factory pattern โ deferred initialization of retrievers/graphs
- Singleton pattern โ resource caching (
@lru_cache) - Strategy pattern โ interchangeable retriever implementations
See the generated diagrams in
architecture/diagrams/02_architecture_diagrams.md.
- Load 12 QA pairs (golden testset)
- Load 38 source docs from Hugging Face
- Create Qdrant vector store
- Build 4 retriever strategies
- Execute 48 RAG queries
- Evaluate with RAGAS metrics (Faithfulness, Answer Relevancy, Context Precision, Context Recall)
- Persist results โ
deliverables/evaluation_evidence/
Cost โ $5 โ 6 per run / 20โ30 min.
Every major stage in this project โ from data ingestion to evaluation runs โ is signed by machine-readable manifests to ensure traceability, data integrity, and AI assistant accountability.
Records the creation of intermediate artifacts (raw โ interim โ HF datasets) with:
- Environment fingerprint (
langchain,ragas,pyarrow, etc.) - Source & golden testset paths and SHA-256 hashes
- Quick schema preview of extracted columns
- Hugging Face lineage metadata linking to:
dwb2023/gdelt-rag-sources-v2dwb2023/gdelt-rag-golden-testset-v2
This ensures that every RAGAS evaluation run references a reproducible, signed dataset snapshot.
Documents the execution of each evaluation:
- Models:
gpt-4.1-mini(RAG generation),text-embedding-3-small - Retrievers: naive, BM25, ensemble, cohere_rerank
- Deterministic configuration (
temperature=0) - Metrics: Faithfulness, Answer Relevancy, Context Precision, Context Recall
- SHA links back to the ingestion manifest for complete lineage
Together, these manifests create a verifiable audit trail between:
source PDFs โ testset โ vector store โ evaluation results
By preserving SHA-256 fingerprints and HF lineage, this mechanism โsignsโ the dataset state โ keeping automated assistants and evaluation scripts consistent, comparable, and honest.
| Component | Technology | Purpose |
|---|---|---|
| LLM | OpenAI GPT-4.1-mini | Deterministic RAG generation |
| Embeddings | text-embedding-3-small | 1536-dim semantic vectors |
| Vector DB | Qdrant | Fast cosine search |
| Orchestration | LangGraph 0.6.7 + LangChain 0.3.19 + | Graph-based workflows |
| Evaluation | RAGAS 0.2.10 (pinned) | Stable evaluation API |
| Monitoring | LangSmith | LLM trace observability |
| Data | Hugging Face Datasets | Reproducible versioned sources |
| UI | LangGraph Studio UI | Prototype chat interface |
Source of truth:
deliverables/evaluation_evidence/RUN_MANIFEST.json(auto-generated each run).
| Retriever | Faithfulness | Answer Relevancy | Context Precision | Context Recall | Avg |
|---|---|---|---|---|---|
| Cohere Rerank | 95.8% | 94.8% | 93.1% | 96.7% | 95.1% |
| Ensemble | 93.4% | 94.6% | 87.5% | 98.8% | 93.6% |
| BM25 | 94.2% | 94.8% | 85.8% | 98.8% | 93.4% |
| Naive | 94.0% | 94.4% | 88.5% | 98.8% | 93.9% |
Note on retriever terminology: The current codebase implements 4 retrievers (naive, bm25, ensemble, cohere_rerank). In evaluation reporting and published datasets, "baseline" may appear as an alias for the "naive" retriever, which serves as the performance comparison reference using simple dense vector search.
Provenance: dataset paths & SHA-256 fingerprints are recorded in data/interim/manifest.json.
Reproducibility: these numbers are written by the evaluation run into RUN_MANIFEST.json.
TLDR summary: data/processed/comparative_ragas_results.csv
This project publishes 4 datasets to HuggingFace Hub for reproducibility and benchmarking.
Scientific Value: These datasets provide the first publicly available evaluation suite for GDELT-focused RAG systems, enabling reproducible benchmarking of retrieval strategies with complete evaluation transparency.
1. dwb2023/gdelt-rag-sources-v2 - 38 GDELT documentation pages
- Content: GDELT GKG 2.1 architecture docs, knowledge graph construction guides, Baltimore Bridge Collapse case study
- Format: Parquet (analytics), JSONL (human-readable), HF Datasets (fast loading)
- Schema:
page_content(1.5k-5.2k chars),metadata(author, title, page, creation_date, etc.) - Use: Populate vector stores, document chunking experiments, GDELT research
- License: Apache 2.0
2. dwb2023/gdelt-rag-golden-testset-v2 - 12 QA pairs
- Content: Synthetically generated questions (RAGAS 0.2.10), ground truth answers, reference contexts
- Topics: GDELT data formats, Translingual features (65 languages), date extraction, proximity context, emotions
- Schema:
user_input(question),reference_contexts(ground truth passages),reference(answer),synthesizer_name - Use: Benchmark RAG systems using RAGAS metrics, validate retrieval performance
- License: Apache 2.0
3. dwb2023/gdelt-rag-evaluation-inputs - 60 evaluation records
- Content: Consolidated RAGAS inputs from 5 retrieval strategies (baseline, naive, BM25, ensemble, cohere_rerank)
- Schema:
retriever,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name - Use: Benchmark new retrievers, analyze retrieval quality, reproduce certification results, debug RAG pipelines
- License: Apache 2.0
4. dwb2023/gdelt-rag-evaluation-metrics - 60 evaluation records with RAGAS scores
- Content: Detailed RAGAS evaluation results with per-question metric scores
- Schema: All evaluation-inputs fields PLUS
faithfulness,answer_relevancy,context_precision,context_recall(float64, 0-1) - Key Findings: Cohere Rerank winner (95.08% avg), Baseline (93.92% avg), Best Precision: Cohere (+4.55% vs baseline)
- Use: Performance analysis, error analysis, train retrieval models with RAGAS scores as quality labels, RAG evaluation research
- License: Apache 2.0
Why These Datasets Matter:
- Reproducibility: Complete evaluation pipeline with versioned datasets and SHA-256 checksums
- Benchmarking: Standard testset for comparing retrieval strategies across 4 RAGAS metrics
- Quality Labels: RAGAS scores serve as training labels for learning-to-rank models
- Domain-Specific: GDELT knowledge graph QA pairs rare in existing RAG datasets
- Evaluation Transparency: Full evaluation inputs + metrics for analysis and debugging
- Multi-Format: Parquet (analytics), JSONL (human-readable), HF Datasets (fast loading)
Research Applications:
- RAG Researchers: Benchmark retrieval strategies, analyze failure modes, validate hypotheses
- GDELT Analysts: Build Q&A systems, train domain-specific embeddings, extend to other GDELT resources
- Evaluation Researchers: Study RAGAS behavior, compare automatic vs human metrics, develop new methodologies
- Educators: Teach RAG best practices, demonstrate comparative analysis, illustrate data provenance
from datasets import load_dataset
# Load evaluation datasets
eval_ds = load_dataset("dwb2023/gdelt-rag-evaluation-datasets")
# Filter by retriever
cohere_evals = eval_ds['train'].filter(lambda x: x['retriever'] == 'cohere_rerank')
print(f"Cohere Rerank: {len(cohere_evals)} examples")
# Load detailed results with metrics
results_ds = load_dataset("dwb2023/gdelt-rag-detailed-results")
# Analyze performance by retriever
import pandas as pd
df = results_ds['train'].to_pandas()
performance = df.groupby('retriever')[
['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']
].mean()
print(performance)Output:
faithfulness answer_relevancy context_precision context_recall
retriever
baseline 0.9351 0.9335 0.9459 0.9410
bm25 0.9462 0.9583 0.9519 0.9511
cohere_rerank 0.9508 0.9321 0.9670 0.9668
ensemble 0.9424 0.9542 0.9477 0.9486
naive 0.9351 0.9335 0.9459 0.9410
If you use these datasets in your research, please cite:
@misc{branson2025gdelt-rag-datasets,
author = {Branson, Don},
title = {GDELT RAG Evaluation Datasets: Benchmarking Retrieval Strategies for Knowledge Graph Q\&A},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/dwb2023}},
note = {Datasets: gdelt-rag-sources-v2, gdelt-rag-golden-testset-v2, gdelt-rag-evaluation-inputs, gdelt-rag-evaluation-metrics}
}
@article{myers2025gdelt,
title={Talking to GDELT Through Knowledge Graphs},
author={Myers, A. and Vargas, M. and Aksoy, S. G. and Joslyn, C. and Wilson, B. and Burke, L. and Grimes, T.},
journal={arXiv preprint arXiv:2503.07584v3},
year={2025}
}Provenance Chain:
- Source: arXiv:2503.07584v3 "Talking to GDELT Through Knowledge Graphs" (PDF)
- Extraction: PyMuPDFLoader (page-level chunking)
- Testset Generation: RAGAS 0.2.10 synthetic data generation
- Evaluation: GPT-4.1-mini (LLM), text-embedding-3-small (embeddings), Cohere rerank-v3.5
- Validation: SHA-256 checksums in
data/interim/manifest.json
Quality Guarantees:
- โ RAGAS 0.2.10 schema validation
- โ SHA-256 fingerprints for data integrity
- โ Manifest tracking (timestamps, model versions, package versions)
- โ
100% validation pass rate (
make validate) - โ Apache 2.0 licensed (open access)
Versioning: -v2 suffix indicates second iteration after fresh ingestion. Pin to specific revision for reproducibility: load_dataset("dwb2023/gdelt-rag-sources-v2", revision="abc123")
Known Limitations:
- Domain-specific (GDELT documentation), may not generalize to other domains
- Synthetic questions (RAGAS-generated, not human-authored)
- English-only (despite GDELT's multilingual capabilities)
- Small scale (12 evaluation questions - sufficient for comparative analysis, not large-scale benchmarking)
- Model bias (RAGAS metrics computed using GPT-4, inherits model biases)
- Temporal snapshot (based on GDELT documentation as of January 2025)
Dataset Cards: See HuggingFace for complete metadata and schemas
| Path | Purpose |
|---|---|
src/ |
Core modular RAG framework (config, retrievers, graph, state, utils) |
scripts/ |
Executable workflows for data ingestion, evaluation, and validation |
architecture/ |
Auto-generated architecture snapshots (Claude Agent SDK Analyzer) |
โโโ 00_README.md |
System overview and lifecycle summary |
โโโ docs/ |
Component inventory, data flows, and API reference |
โโโ diagrams/ |
Mermaid dependency and system diagrams |
docs/ |
Certification artifacts and legacy design documentation |
โโโ initial-architecture.md |
Original hand-drawn architecture sketch (frozen, not updated) |
data/ |
Complete RAG dataset lineage and provenance chain (Parquet-first) |
โโโ raw/ |
Original GDELT PDFs |
โโโ interim/ |
Extracted text, Hugging Face datasets, and manifest fingerprints |
โ โโโ manifest.json โ Ingestion provenance manifest (dataset lineage, SHA-256 hashes) |
|
โโโ processed/ |
Working data (Parquet, ZSTD compressed) - evaluation results + RUN_MANIFEST.json |
โโโ deliverables/ |
Derived data (CSV for human review, regenerable via make deliverables) |
โโโ evaluation_evidence/ |
Human-readable CSV files generated from data/processed/ |
โ โโโ RUN_MANIFEST.json โ Copied from data/processed/ |
|
app/ |
Lightweight LangGraph API (graph_app.py) and entrypoint |
deliverables/ |
High-level evaluation reports and comparative analyses |
Makefile |
Task automation for environment setup, validation, and architecture snapshots |
docker-compose.yml |
Local container configuration for LangGraph + Qdrant stack |
I created a Claude Agent SDK based multi-agent process to create reproducible architecture snapshots. (Still a prototype, but has already helped refine my architectural workflow)
python -m ra_orchestrators.architecture_orchestrator "GDELT architecture"Generates comprehensive repository analysis in ra_output/:
- Component inventory
- Architecture diagrams
- Data flow analysis
- API documentation
- Final synthesis
| Path | Purpose |
|---|---|
ra_agents/ |
Individual agent definitions for design and architecture workflows |
ra_orchestrators/ |
Multi-agent orchestration logic coordinating Claude SDK agents |
ra_tools/ |
Tools that extend agent capabilities via MCP and external APIs |
ra_output/ |
Generated artifacts and agentic outputs (documentation drafts, diagrams, etc.) |
| Pattern | Purpose | Example |
|---|---|---|
| Factory | Deferred initialization | create_retrievers() / build_graph() |
| Singleton | Cached resources (@lru_cache) |
get_llm(), get_qdrant() |
| Strategy | Swap retrieval algorithms | 4 retrievers share .invoke() API |
Apache 2.0 โ see LICENSE
Contact: Don Branson (dwb2023) โ AI Engineering Bootcamp Cohort 8
This repository distinguishes between historical design artifacts and current architecture snapshots:
docs/initial-architecture.mdโ conceptual blueprint (frozen)architecture/โ live system documentation (auto-generated)
This separation ensures long-term clarity and traceability of architectural evolution.
