This repository is a part of Master's thesis in AI: "Enhancing the Reasoning of LLMs in Materials Discovery via Causal Contracts"
This repository contains a modular, DAGβbased framework for experimenting with pipelines that test and improve the reasoning capabilities of large language models (LLMs). The system offers two endβtoβend pipelines:
- Document Pipeline β downloads papers, extracts pages, performs multiβstage analysis, and synthesizes structured artifacts (semantic summaries and contracts) before generating QA datasets.
- Answer & Evaluation Pipeline β answers the generated questions under configurable context settings and evaluates performance with transparent, reproducible artifacts.
The pipelines are built from interoperable agents orchestrated by a lightweight DAG runner, making it easy to plug in new components, models, or evaluation strategies. Code quality, reproducibility, and documentation are priorities.
π Paper: coming soon
- DAGβorchestrated research pipelines with explicit data flow configuration between tasks.
- Agentic design: discrete agents for download, extraction, analysis, contract writing, QA generation, answering, and evaluation.
- Reproducible runs: each experiment writes all inputs/outputs to a timestamped folder under
runs/. - Configurable model runtime via CLI flags (
--model,--temperature,--retries). - Context controls for ablations (e.g., raw text vs. additional context flags).
- Python 3.10+
- macOS/Linux/WSL2 (Windows native should also work)
- Recommended: virtual environment (Conda/venv)
- Clone the repository
git clone https://github.com/stable-reasoning/ai4materials
cd ai4materials- Create and activate a Conda environment
conda create -n ai4materials python=3.10 -y
conda activate ai4materials- Install dependencies
- Using pip (from
requirements.txt):
pip install -r requirements.txt- Or, if you maintain an
environment.yml:
conda env update -n ai4materials -f environment.ymlTip: If you're on Apple Silicon or using CUDA, prefer installing any platformβspecific packages via conda first, then fall back to pip.
Minimal configuration is required for local runs. By default, artifacts are written to ./runs.
Ensure the directories under data/ and test_data/ exist and contain the input files referenced below.
Since the code is using external LLM services, API key is needed. Our software supports LLM services via LiteLLM, and the API key for
OPENAI_API_KEY is needed. Copy and rename the file .env.copy -> .env, and set the environment variables.
Document Pipeline β given a list of paper URLs:
python app.py document --working-dir runs \
--papers ./test_data/papers_1.lst \
--run-id doc_pipeline_test \
--model openai/o4-mini --temperature 1.0 --retries 3Answer & Evaluation Pipeline β given a QA dataset and contracts:
python app.py answer --working-dir runs \
--contracts ./runs/doc_pipeline_test/contract_generation/contracts.json \
--dataset ./runs/doc_pipeline_test/question_generation/qa_dataset.json \
--run-id qa_test \
--flags RAW_TEXT \
--model openai/o4-mini --temperature 1.0 --retries 3Design Pipeline β generates a list of materials candidates on the basis of contracts:
python app.py design --working-dir runs \
--contracts ./runs/doc_pipeline_test/contract_generation/contract_1.json \
--contract_id 1-0 \
--run-id design_pipeline \
--model openai/o3 --temperature 1.0 --retries 3
Both commands create a timestamped experiment folder inside
runs/that contains all intermediate and final artifacts.
The LLM client part is implemented using LiteLLM library, and this makes our tool model-agnostic. When using two providers, appropriate API keys must be provided.
Goal: transform raw papers into structured, analysisβready artifacts and generate a highβquality QA dataset for reasoning evaluation.
Stages (agents):
- DownloadAgent β fetches documents from URLs listed in
--papers. - ExtractionAgent β extracts pages and creates normalized document IDs.
- DocumentAnalyzerAgent β runs firstβpass analysis on the processed documents.
- SemanticAnalyzerAgent β performs deeper semantic analysis to derive structured representations.
- ContractWriterAgent β synthesizes contracts (concise, schemaβconstrained summaries) to support controlled QA.
- QADatasetGeneratorAgent β produces a QA dataset aligned with the contracts and selected context flags.
- QADesignerAgent β a demo agent to produce a list of candidates materials from the contract
Outputs: processed IDs, semantic documents, contracts, and a generated QA dataset β all stored under the current run directory.
Goal: answer questions under specified context settings and evaluate performance.
Stages (agents):
- QAAnswerAgent β answers questions from a QA dataset using the selected model configuration and context flags.
- QAEvaluationAgent β computes evaluation metrics and produces analysis artifacts.
Outputs: model answers and evaluation reports (JSON/CSV/plots as configured) under the current run directory.
- DAG / DAGRunner β schedules and executes agents given explicit data dependencies; ensures artifact lineage, reproducibility, and fault isolation.
- Composability β use the DAG builders in
app4.pyas examples to assemble new experimental pipelines.
utils.prompt_manager.PromptManagerβ centralizes prompts used by analysis/answering agents.utils.common.ModelConfigβ encapsulates model runtime knobs (name,temperature,retries).middleware.ImageStorageβ manages image assets produced along the pipeline.
<repo-root>/
ββ agents/ # Agent implementations (download, extract, analyze, QA, eval, ...)
ββ core/ # DAG & runtime abstractions
ββ data/ # Generated artifacts (contracts, datasets, etc)
ββ docs/ # Technical documentation
ββ docucache/
ββ middleware/ # Shared services (e.g., image storage)
ββ notebooks/ # Analytics for thesis
ββ runs/ # Experiment outputs (created at runtime)
ββ test_data/ # Test lists / small fixtures (e.g., paper URLs)
ββ utils/ # Prompt manager, settings, model config, logging helpers
ββ requirements.txt # Python dependencies
ββ app.py # CLI entrypoint: build & run DAG pipelines
βββ docucache/
βββ metadata.db <-- The SQLite database file
βββ 1/ <-- First paper's folder (ID from DB)
β βββ assets/
β βββ pages/
β βββ tmp/
β βββ 1706.03762.pdf
βββ 2/
β βββ assets/
β βββ pages/
β βββ tmp/
β βββ 2203.02155.pdf
βββ 3/
βββ assets/
βββ pages/
βββ tmp/
βββ 2307.09288.pdfEvery run produces a folder runs/<pipeline-name>-YYYYMMDD-HHMMSS/ containing:
- Inputs: exact copies or references to all inputs used by each agent.
- Intermediate Artifacts: JSON manifests (e.g., processed document IDs, semantic docs, contracts, model answers).
- Final Reports: evaluation summaries and any generated plots/tables.
- Logs: structured logs with timestamps for traceability.
You can pass a custom --run-id to name the experiment folder deterministically.
--working-dir(default:runs) β base directory for artifacts.--model(default:openai/o4-mini) β model name/ID in a standard format provider/model_name.--temperature(default:1.0) β sampling temperature.--retries(default:3) β retry attempts for model calls.
--papersβ path to a text file containing one URL per line (default:./test_data/papers_1.lst).--run-idβ optional experiment/run identifier.
--datasetβ path to a QA dataset JSON (default:./data/test_dataset.json).--contractsβ path to a contracts JSON (default:./data/test_contracts.json).--flagsβ context flags (e.g.,RAW_TEXT,CC, orCC+RAW_TEXT).--run-idβ optional experiment/run identifier.
Import and extend the DAG builders from app.py:
from app import get_document_pipeline_dag, get_answer_pipeline_dag
# Build a custom DAG and run it with your own runner/configurationIf you use this repository in academic work, please cite it as:
@misc{kaliutau2025,
author = {Kaliutau, A.},
title = {{AI in Materials Science}},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/stable-reasoning/ai4materials}},
}- Keep the structure of your input files stable; the agents rely on consistent schemas.
- For ablations, duplicate a prior run with only one change (e.g.,
--flags,--temperature). - Use
--run-idto align artifacts across multiple pipelines when running a batched study.