-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Problem
The current WorkRB architecture indexes datasets within each task by Language enum. This limits each task to at most one dataset per language.
This constraint prevents supporting tasks with:
- Multiple monolingual datasets per language (e.g., regional variants, domain-specific subsets)
- Cross-lingual datasets (e.g., query language differs from corpus language)
- Multilingual datasets (e.g., corpus spans multiple languages)
The limitation is caused by source code in data loading, evaluation iteration, and result aggregation.
This issue follows up on the architectural discussion in #30.
Proposal
Generalize dataset indexing from Language to arbitrary string identifiers (dataset_id).
Key changes in workrb.tasks.abstract.Task:
- The attribute
lang_datasets: dict[Language, Dataset]becomesdatasets: dict[str, Dataset], indexing datasets by an arbitrary string - Add a new method
languages_to_dataset_ids(languages) -> list[str]- Default implementation:
[lang.value for lang in languages](1:1 mapping), which makes the refactor backward compatible for existing tasks - Tasks with more complex language-dataset mappings can override this method to return custom identifiers
- Default implementation:
- Rename
load_monolingual_data(language, split)toload_dataset(dataset_id, split) - Add
get_dataset_language(dataset_id) -> Language | Noneto enable per-language result aggregation- Returns the language for monolingual datasets
- Returns
Nonefor cross-lingual or multilingual datasets
Key changes in results.py:
- Add
languagefield toMetricsResult - Update
_aggregate_per_language()to group by thelanguagefield, skipping datasets wherelanguageisNone
The user-facing API remains unchanged. Users continue to specify languages when instantiating tasks:
task = SomeTask(languages=["en", "de"], split="test")Internally, languages_to_dataset_ids() maps languages to dataset identifiers.
This proposal is non-breaking: existing tasks work without modification due to the default 1:1 mapping. Result aggregation behavior is preserved for standard tasks. Per-language aggregation remains backward compatible and simply excludes datasets marked as cross-lingual or multilingual.
-
Type:
- New Ontology (data source for multiple tasks)
- New Task(s)
- New Model(s)
- New Metric(s)
- Other
-
Area(s) of code: paths, modules, or APIs you expect to touch
src/workrb/tasks/abstract/base.pysrc/workrb/tasks/abstract/ranking_base.pysrc/workrb/tasks/abstract/classification_base.pysrc/workrb/run.pysrc/workrb/config.pysrc/workrb/results.py- All task implementations (method rename only)
- Test files
Alternatives
An alternative approach proposed by @Mattdl is being discussed in #30: indexing datasets by (query_language, corpus_language) pairs with compound identifiers like "en-en" or "es-nl". This would enable aggregation by query language, corpus language, or specific cross-lingual scenarios. A comparison of both approaches is included in that other discussion.
Implementation
- I plan to implement this in a PR
- I am proposing the idea and would like someone else to pick it up