Skip to content

[FEATURE] Generalize Dataset Indexing Within Tasks #33

@federetyk

Description

@federetyk

Problem

The current WorkRB architecture indexes datasets within each task by Language enum. This limits each task to at most one dataset per language.

This constraint prevents supporting tasks with:

  • Multiple monolingual datasets per language (e.g., regional variants, domain-specific subsets)
  • Cross-lingual datasets (e.g., query language differs from corpus language)
  • Multilingual datasets (e.g., corpus spans multiple languages)

The limitation is caused by source code in data loading, evaluation iteration, and result aggregation.

This issue follows up on the architectural discussion in #30.

Proposal

Generalize dataset indexing from Language to arbitrary string identifiers (dataset_id).

Key changes in workrb.tasks.abstract.Task:

  • The attribute lang_datasets: dict[Language, Dataset] becomes datasets: dict[str, Dataset], indexing datasets by an arbitrary string
  • Add a new method languages_to_dataset_ids(languages) -> list[str]
    • Default implementation: [lang.value for lang in languages] (1:1 mapping), which makes the refactor backward compatible for existing tasks
    • Tasks with more complex language-dataset mappings can override this method to return custom identifiers
  • Rename load_monolingual_data(language, split) to load_dataset(dataset_id, split)
  • Add get_dataset_language(dataset_id) -> Language | None to enable per-language result aggregation
    • Returns the language for monolingual datasets
    • Returns None for cross-lingual or multilingual datasets

Key changes in results.py:

  • Add language field to MetricsResult
  • Update _aggregate_per_language() to group by the language field, skipping datasets where language is None

The user-facing API remains unchanged. Users continue to specify languages when instantiating tasks:

task = SomeTask(languages=["en", "de"], split="test")

Internally, languages_to_dataset_ids() maps languages to dataset identifiers.

This proposal is non-breaking: existing tasks work without modification due to the default 1:1 mapping. Result aggregation behavior is preserved for standard tasks. Per-language aggregation remains backward compatible and simply excludes datasets marked as cross-lingual or multilingual.

  • Type:

    • New Ontology (data source for multiple tasks)
    • New Task(s)
    • New Model(s)
    • New Metric(s)
    • Other
  • Area(s) of code: paths, modules, or APIs you expect to touch

    • src/workrb/tasks/abstract/base.py
    • src/workrb/tasks/abstract/ranking_base.py
    • src/workrb/tasks/abstract/classification_base.py
    • src/workrb/run.py
    • src/workrb/config.py
    • src/workrb/results.py
    • All task implementations (method rename only)
    • Test files

Alternatives

An alternative approach proposed by @Mattdl is being discussed in #30: indexing datasets by (query_language, corpus_language) pairs with compound identifiers like "en-en" or "es-nl". This would enable aggregation by query language, corpus language, or specific cross-lingual scenarios. A comparison of both approaches is included in that other discussion.

Implementation

  • I plan to implement this in a PR
  • I am proposing the idea and would like someone else to pick it up

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions