Skip to content

MWI is a reproducible research tool to collect web corpora, qualify them, and produce interpretable outputs for communication studies and digital methods.

License

Notifications You must be signed in to change notification settings

MyWebIntelligence/.github

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

My Web Intelligence (MWI)

MWI is a reproducible research toolkit to collect web corpora, qualify/enrich them (NLP/LLM-assisted, auditable), and export interpretable outputs (CSV/JSON/GEXF) for digital methods in social sciences and communication studies.

Start here (flagship)

Use this repository first:

Do not start with mywebapi unless you explicitly need a scalable backend.


Quickstart (get a first result fast)

Recommended: Docker Compose.

git clone https://github.com/MyWebIntelligence/mwi.git
cd mwi

# Choose one mode
./scripts/docker-compose-setup.sh basic   # minimal local setup
# ./scripts/docker-compose-setup.sh api   # API-oriented mode
# ./scripts/docker-compose-setup.sh llm   # ML/embeddings/LLM mode

# Sanity check (example command)
docker compose exec mwi python mywi.py land list

Full installation details:


What MWI does (workflow)

Collect → Qualify → Analyze → Export

  1. Collect
    Build a corpus from seed URLs and curated sources, keep crawl traces, store pages + metadata.

  2. Qualify
    Extract readable content, enrich with NLP and optional LLM-based relevance gating.
    Auditability is a design goal: raw traces are kept and decisions can be inspected.

  3. Analyze
    Produce socio-semantic structures: documents, expressions/entities, similarity links, networks.

  4. Export
    Generate outputs for analysis and visualization:

  • CSV / JSON
  • GEXF (Gephi)
  • structured datasets / reports

Key concept: “Land”

A Land is a research project container (topic) holding:

  • terms, seed URLs, crawls
  • extracted content + metadata
  • enrichment layers
  • exports

Think: one Land = one case study / one dataset / one pipeline run.


Repository map (what each repo is for)

Flagship (start here)

Components (use when relevant)


Architecture (high-level)

        ┌──────────────────────────┐
        │ mwi (flagship, local)     │
        │ CLI + reproducible setup  │
        └─────────────┬────────────┘
                      │
          SQLite DB + corpus files
                      │
        ┌─────────────┴────────────┐
        │                          │
  Exports (CSV/JSON/GEXF)     Optional scale-out
  for R / Gephi / notebooks   (mywebapi: Postgres/API/Celery)
        │                          │
      mwiR as bridge          external clients/pipelines

Academic citation

Recommended practice (until stable releases are published everywhere):

  1. Cite the relevant paper(s) (HAL/publications).
  2. Cite the software using either:
    • a GitHub Release tag (preferred), or
    • a commit hash.

Recommended professionalization steps:

  • Add CITATION.cff to mwi and mwiR
  • Publish GitHub Releases (e.g., v0.1.0)
  • Archive releases to Zenodo (DOI)

Support / Contact

For research collaborations, deployments at scale, or reproducible case studies, open an issue on the flagship repository: https://github.com/MyWebIntelligence/mwi/issues


License

See each repository for licensing details (MIT where specified).

About

MWI is a reproducible research tool to collect web corpora, qualify them, and produce interpretable outputs for communication studies and digital methods.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published