[FEATURE] Generalize Dataset Indexing Within Tasks

## Problem

The current WorkRB architecture indexes datasets within each task by `Language` enum. This limits each task to at most one dataset per language.

This constraint prevents supporting tasks with:
- Multiple monolingual datasets per language (e.g., regional variants, domain-specific subsets)
- Cross-lingual datasets (e.g., query language differs from corpus language)
- Multilingual datasets (e.g., corpus spans multiple languages)

The limitation is caused by source code in data loading, evaluation iteration, and result aggregation.

This issue follows up on the architectural discussion in #30.


## Proposal

Generalize dataset indexing from `Language` to arbitrary string identifiers (`dataset_id`).

Key changes in **workrb.tasks.abstract.Task**:
- The attribute `lang_datasets: dict[Language, Dataset]` becomes `datasets: dict[str, Dataset]`, indexing datasets by an arbitrary string
- Add a new method `languages_to_dataset_ids(languages) -> list[str]`
  - Default implementation: `[lang.value for lang in languages]` (1:1 mapping), which makes the refactor backward compatible for existing tasks
  - Tasks with more complex language-dataset mappings can override this method to return custom identifiers
- Rename `load_monolingual_data(language, split)` to `load_dataset(dataset_id, split)`
- Add `get_dataset_language(dataset_id) -> Language | None` to enable per-language result aggregation
  - Returns the language for monolingual datasets
  - Returns `None` for cross-lingual or multilingual datasets

Key changes in **results.py**:
- Add `language` field to `MetricsResult`
- Update `_aggregate_per_language()` to group by the `language` field, skipping datasets where `language` is `None`

The user-facing API remains unchanged. Users continue to specify languages when instantiating tasks:

```python
task = SomeTask(languages=["en", "de"], split="test")
```

Internally, `languages_to_dataset_ids()` maps languages to dataset identifiers.

This proposal is non-breaking: existing tasks work without modification due to the default 1:1 mapping. Result aggregation behavior is preserved for standard tasks. Per-language aggregation remains backward compatible and simply excludes datasets marked as cross-lingual or multilingual.

- Type:
    - [ ] New Ontology (data source for multiple tasks)
    - [ ] New Task(s)
    - [ ] New Model(s)
    - [ ] New Metric(s)
    - [x] Other

- Area(s) of code: paths, modules, or APIs you expect to touch
  - `src/workrb/tasks/abstract/base.py`
  - `src/workrb/tasks/abstract/ranking_base.py`
  - `src/workrb/tasks/abstract/classification_base.py`
  - `src/workrb/run.py`
  - `src/workrb/config.py`
  - `src/workrb/results.py`
  - All task implementations (method rename only)
  - Test files


## Alternatives

An alternative approach proposed by @Mattdl is being discussed in #30: indexing datasets by (`query_language`, `corpus_language`) pairs with compound identifiers like `"en-en"` or `"es-nl"`. This would enable aggregation by query language, corpus language, or specific cross-lingual scenarios. A comparison of both approaches is included in that other discussion.

## Implementation

- [x] I plan to implement this in a PR
- [ ] I am proposing the idea and would like someone else to pick it up


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Generalize Dataset Indexing Within Tasks #33

Problem

Proposal

Alternatives

Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEATURE] Generalize Dataset Indexing Within Tasks #33

Description

Problem

Proposal

Alternatives

Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions