Skip to content

My Web Intelligence (MWI)

MWI is a reproducible research toolkit to collect web corpora, qualify/enrich them (NLP/LLM-assisted, auditable), and export interpretable outputs (CSV/JSON/GEXF) for digital methods in social sciences and communication studies.

Start here (flagship)

Use this repository first:

Do not start with mywebapi unless you explicitly need a scalable backend.


Quickstart (get a first result fast)

Recommended: Docker Compose.

git clone https://github.com/MyWebIntelligence/mwi.git
cd mwi

# Choose one mode
./scripts/docker-compose-setup.sh basic   # minimal local setup
# ./scripts/docker-compose-setup.sh api   # API-oriented mode
# ./scripts/docker-compose-setup.sh llm   # ML/embeddings/LLM mode

# Sanity check (example command)
docker compose exec mwi python mywi.py land list

Full installation details:


What MWI does (workflow)

Collect → Qualify → Analyze → Export

  1. Collect
    Build a corpus from seed URLs and curated sources, keep crawl traces, store pages + metadata.

  2. Qualify
    Extract readable content, enrich with NLP and optional LLM-based relevance gating.
    Auditability is a design goal: raw traces are kept and decisions can be inspected.

  3. Analyze
    Produce socio-semantic structures: documents, expressions/entities, similarity links, networks.

  4. Export
    Generate outputs for analysis and visualization:

  • CSV / JSON
  • GEXF (Gephi)
  • structured datasets / reports

Key concept: “Land”

A Land is a research project container (topic) holding:

  • terms, seed URLs, crawls
  • extracted content + metadata
  • enrichment layers
  • exports

Think: one Land = one case study / one dataset / one pipeline run.


Repository map (what each repo is for)

Flagship (start here)

Components (use when relevant)


Architecture (high-level)

        ┌──────────────────────────┐
        │ mwi (flagship, local)     │
        │ CLI + reproducible setup  │
        └─────────────┬────────────┘
                      │
          SQLite DB + corpus files
                      │
        ┌─────────────┴────────────┐
        │                          │
  Exports (CSV/JSON/GEXF)     Optional scale-out
  for R / Gephi / notebooks   (mywebapi: Postgres/API/Celery)
        │                          │
      mwiR as bridge          external clients/pipelines

Academic citation

Recommended practice (until stable releases are published everywhere):

  1. Cite the relevant paper(s) (HAL/publications).
  2. Cite the software using either:
    • a GitHub Release tag (preferred), or
    • a commit hash.

Recommended professionalization steps:

  • Add CITATION.cff to mwi and mwiR
  • Publish GitHub Releases (e.g., v0.1.0)
  • Archive releases to Zenodo (DOI)

Support / Contact

For research collaborations, deployments at scale, or reproducible case studies, open an issue on the flagship repository: https://github.com/MyWebIntelligence/mwi/issues


License

See each repository for licensing details (MIT where specified).

Pinned Loading

  1. mwi mwi Public

    Main repository (flagship). Reproducible research tool for collecting, qualifying and analyzing web corpora.

    Python

  2. mwiR mwiR Public

    Component repository: R analysis bridge for MWI exports. Start here: https://github.com/MyWebIntelligence/mwi

    R 3

  3. mywebapi mywebapi Public

    Component repository: experimental scalable backend (API). Start here: https://github.com/MyWebIntelligence/mwi

    Python

Repositories

Showing 7 of 7 repositories

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…