MWI is a reproducible research toolkit to collect web corpora, qualify/enrich them (NLP/LLM-assisted, auditable), and export interpretable outputs (CSV/JSON/GEXF) for digital methods in social sciences and communication studies.
Use this repository first:
- mwi (local reproducible “desktop-lab”): https://github.com/MyWebIntelligence/mwi
Do not start with mywebapi unless you explicitly need a scalable backend.
Recommended: Docker Compose.
git clone https://github.com/MyWebIntelligence/mwi.git
cd mwi
# Choose one mode
./scripts/docker-compose-setup.sh basic # minimal local setup
# ./scripts/docker-compose-setup.sh api # API-oriented mode
# ./scripts/docker-compose-setup.sh llm # ML/embeddings/LLM mode
# Sanity check (example command)
docker compose exec mwi python mywi.py land listFull installation details:
- https://github.com/MyWebIntelligence/mwi
- https://github.com/MyWebIntelligence/mwi/blob/master/docs/INSTALL_ZERO_bis.md
Collect → Qualify → Analyze → Export
-
Collect
Build a corpus from seed URLs and curated sources, keep crawl traces, store pages + metadata. -
Qualify
Extract readable content, enrich with NLP and optional LLM-based relevance gating.
Auditability is a design goal: raw traces are kept and decisions can be inspected. -
Analyze
Produce socio-semantic structures: documents, expressions/entities, similarity links, networks. -
Export
Generate outputs for analysis and visualization:
- CSV / JSON
- GEXF (Gephi)
- structured datasets / reports
A Land is a research project container (topic) holding:
- terms, seed URLs, crawls
- extracted content + metadata
- enrichment layers
- exports
Think: one Land = one case study / one dataset / one pipeline run.
- mwi
Local reproducible research tool (Python + SQLite + Docker Compose).
https://github.com/MyWebIntelligence/mwi
-
mwiR
R package for analysis and R-friendly workflows (bridge for R users).
https://github.com/MyWebIntelligence/mwiR -
mywebapi
Scalable backend (FastAPI + PostgreSQL + Celery + Redis).
Note: this repository contains components “in transition” (API + legacy parts).
https://github.com/MyWebIntelligence/mywebapi
┌──────────────────────────┐
│ mwi (flagship, local) │
│ CLI + reproducible setup │
└─────────────┬────────────┘
│
SQLite DB + corpus files
│
┌─────────────┴────────────┐
│ │
Exports (CSV/JSON/GEXF) Optional scale-out
for R / Gephi / notebooks (mywebapi: Postgres/API/Celery)
│ │
mwiR as bridge external clients/pipelines
Recommended practice (until stable releases are published everywhere):
- Cite the relevant paper(s) (HAL/publications).
- Cite the software using either:
- a GitHub Release tag (preferred), or
- a commit hash.
Recommended professionalization steps:
- Add
CITATION.cfftomwiandmwiR - Publish GitHub Releases (e.g.,
v0.1.0) - Archive releases to Zenodo (DOI)
For research collaborations, deployments at scale, or reproducible case studies, open an issue on the flagship repository: https://github.com/MyWebIntelligence/mwi/issues
See each repository for licensing details (MIT where specified).