ErgoX TokenX ML NLP Tokenizers

.NET bindings for HuggingFace Tokenizers with comprehensive testing and multi-platform support.

Why ErgoX.TokenX?

TL;DR: Microsoft.ML.Tokenizers requires manual configuration per model. ErgoX.TokenX provides seamless AutoTokenizer.Load() with HuggingFace ecosystem compatibility — 2,500+ Python-.NET tests verified with byte-to-byte Python parity.

The Problem with Microsoft.ML.Tokenizers

While Microsoft.ML.Tokenizers offers exceptional raw performance for GPT models (see benchmarks), working with it reveals significant friction:

🔍 HuggingFace Tokenizer Structure

When you download a HuggingFace tokenizer (AutoTokenizer.from_pretrained in Python), you typically get:

model-name/
├── tokenizer.json          # Serialized tokenizer (model, pre-tokenizer, normalizer, post-processor)
├── tokenizer_config.json   # Metadata (model type, special tokens, casing, padding, truncation)
├── vocab.json              # BPE-based tokenizers (GPT-2, RoBERTa)
├── merges.txt              # BPE merge rules
└── special_tokens_map.json # Maps [CLS], [SEP], [PAD], [BOS], [EOS], etc.

HuggingFace's Python library merges these automatically. In Microsoft.ML.Tokenizers, you must be explicit — manually loading vocabulary files, configuring special tokens, and handling model-specific quirks.

⚠️ Pain Points Found

Missing Special Tokens: Special tokens required by models are not automatically configured. You need manual attention to get it right.
No AutoTokenizer: The AutoTokenizer(...) pattern was missing, while HuggingFace's ecosystem was rapidly growing. .NET lagged behind.
No Chat Templates: Instruction-tuned models (Llama, Mistral, Qwen) require chat templates — outside tokenizer scope in Microsoft.ML, but essential for real-world use.
Limited Pre/Post-Processing: Advanced tokenization pipelines (preprocessing, postprocessing) are difficult to work with.
Complex Overflow Handling: Token overflow scenarios and advanced use cases require significant boilerplate.

The ErgoX.TokenX Solution

Simple approach: HuggingFace's Rust implementation of Tokenizers is ported via C FFI bindings into C#.

✅ What You Get

One-Line Loading: AutoTokenizer.Load("model-name") — works like Python
2,500+ Tests Verified: Byte-to-byte parity with Python HuggingFace Transformers
SHA256 Hash Verification: Every token output verified against Python reference (2,500+ test cases)
Chat Templates: Built-in support for Huggingface chat templates
Multi-Modal Support: Whisper (ASR), CLIP (vision), LayoutLM (documents), TrOCR (OCR)
Advanced Features:
- Token offsets for NER/question answering
- Truncation and padding strategies
- Attention masks, type IDs, special token handling
- Pre-tokenization and post-processing pipelines
Production-Proven: Internally used since May 2025 without revisiting alternatives

// Microsoft.ML.Tokenizers - Manual configuration required
var vocab = File.ReadAllText("vocab.json");
var merges = File.ReadAllText("merges.txt");
var tokenizer = /* ...manual setup... */;

// ErgoX.TokenX - One line
var tokenizer = AutoTokenizer.Load("bert-base-uncased");

⏱️ When My Observations May Be Outdated

This project was developed internally in May 2025 for key implementations. Microsoft.ML.Tokenizers may have evolved since then, but I never revisited alternatives — ErgoX.TokenX met all my needs.

If you're choosing today:

✅ High-throughput GPT-only services → Consider Microsoft.ML (accept manual config)
✅ HuggingFace ecosystem compatibility → Use ErgoX.TokenX (for productivity)
✅ Multi-modal models (Whisper, CLIP, etc.) → Use ErgoX.TokenX (only option)

Features

✅ Cross-platform - Linux, Windows, macOS (x64 & ARM64)
✅ Extensive test coverage across Linux, Windows, and macOS
✅ Rust FFI bindings - High-performance C bindings layer
✅ CI/CD integration - Automated testing and releases
✅ Test reports - Published with every release
✅ Code coverage - Tracked via Codecov
✅ Sequence decoder combinator - Compose native decoders from .NET

Quick Start

Installation

Option 1: NuGet Package (Recommended)

dotnet add package ErgoX.TokenX.HuggingFace

The package includes pre-built native libraries for all supported platforms (Windows, Linux, macOS x64/ARM64).

Option 2: Manual Installation from Releases

Download the latest release from GitHub Releases:

Windows x64: tokenizers-c-win-x64.zip
Linux x64: tokenizers-c-linux-x64.tar.gz
macOS x64: tokenizers-c-osx-x64.tar.gz
macOS ARM64: tokenizers-c-osx-arm64.tar.gz

Extract and place native libraries in your project:

YourProject/
└── runtimes/
    ├── win-x64/native/tokenx_bridge.dll
    ├── linux-x64/native/libtokenx_bridge.so
    └── osx-x64/native/libtokenx_bridge.dylib

Basic Usage

HuggingFace Tokenizers

using ErgoX.TokenX.HuggingFace;

// Load tokenizer automatically (like Python's AutoTokenizer)
using var tokenizer = AutoTokenizer.Load("bert-base-uncased");

// Encode text
var encoding = tokenizer.Tokenizer.Encode("Hello, world!", addSpecialTokens: true);
Console.WriteLine($"Tokens: {string.Join(", ", encoding.Tokens)}");
Console.WriteLine($"IDs: {string.Join(", ", encoding.Ids)}");

// Decode
var decoded = tokenizer.Tokenizer.Decode(encoding.Ids, skipSpecialTokens: true);
Console.WriteLine($"Decoded: {decoded}");

Note: For OpenAI GPT models, consider using Microsoft.ML.Tokenizers which provides TiktokenTokenizer class.

Running Examples

The repository includes ready-to-run examples with pre-configured models:

# HuggingFace comprehensive quickstart (16 examples)
cd examples/HuggingFace/Quickstart
dotnet run

# Other examples (require model downloads)
dotnet run --project examples/HuggingFace/AllMiniLmL6V2Console
dotnet run --project examples/HuggingFace/E5SmallV2Console
dotnet run --project examples/HuggingFace/AutoTokenizerPipelineExplorer

Quickstart Examples Overview

📗 HuggingFace Tokenizer Quickstart

16 comprehensive examples demonstrating:

Basic tokenization (WordPiece, Unigram, BPE)
Padding and truncation strategies
Text pair encoding for classification
Attention masks, type IDs, offset mapping
Chat template rendering with Llama 3
Vocabulary access, special tokens, batch processing

Models included: all-minilm-l6-v2, t5-small, meta-llama-3-8b-instruct

Documentation: Quickstart README | Full Docs

Development

Prerequisites

.NET SDK 8.0+
Rust 1.70+
Visual Studio 2022 (Windows) or equivalent C++ toolchain

Building

# Build Rust library
cd .ext/hf_bridge
cargo build --release

# Copy to .NET runtime folder
Copy-Item target/release/tokenx_bridge.dll ../src/HuggingFace/runtimes/win-x64/native/ -Force

# Build .NET project
cd ..
cd ..
dotnet build --configuration Release

Testing

# Restore sanitized tokenizer fixtures (skips network downloads)
python tests/Py/Common/restore_test_data.py --force

# Run Rust tests
cd .ext/hf_bridge
cargo test --release

# Run .NET tests
dotnet test --configuration Release

# Refresh HuggingFace parity fixtures (requires transformers/tokenizers)
python tests/Py/Huggingface/generate_benchmarks.py

> Ensure the active Python environment includes the `transformers`, `tokenizers`, and `huggingface_hub` packages so the generators can materialize tokenizer pipelines directly from each model asset.

Running the .NET parity suite now also emits dotnet-benchmark.json alongside the Python fixtures in tests/_testdata_huggingface/<model> so you can inspect the full decoded tokens produced by the managed implementation.

Expected Results: 37 passed, 0 skipped, 0 failed

See TESTING-CHECKLIST.md for detailed instructions.

CI/CD

Automated Testing

Every push and pull request triggers:

Rust C Bindings Tests - 20 FFI layer tests on Linux, Windows, macOS
.NET Integration Tests - 185 end-to-end tests on all platforms
Coverage Reports - Uploaded to Codecov

Releases

Create a release by tagging:

git tag c-v0.22.2
git push origin c-v0.22.2

The release workflow will:

✅ Build binaries for 7 platforms
✅ Run full test suite (205 tests)
✅ Package test reports
✅ Create GitHub Release with:
- Multi-platform binaries
- Test reports archive (test-reports.tar.gz)
- Checksums
- Release notes with test results

See CI-CD-WORKFLOWS.md for complete documentation.

Test Reports

Every release includes test-reports.tar.gz containing:

TRX files - Machine-readable test results
HTML reports - Human-readable test results
Coverage reports - Code coverage analysis

Download from the Releases page.

Project Structure

TokenX/
├── .ext/
│   └── hf_bridge/                      # HuggingFace native bridge crate (Rust)
├── .github/
│   ├── workflows/                      # CI/CD workflows
│   │   ├── test-c-bindings.yml         # Rust tests + coverage
│   │   ├── test-dotnet.yml             # .NET tests + coverage
│   │   └── release-c-bindings.yml      # Multi-platform release
│   ├── CI-CD-WORKFLOWS.md              # CI/CD documentation
│   └── TESTING-CHECKLIST.md            # Quick reference
├── src/
│   └── HuggingFace/                    # Managed HuggingFace tokenizer bindings and interop
└── tests/
   ├── ErgoX.VecraX.ML.NLP.Tokenizers.HuggingFace.Tests/
   └── ErgoX.VecraX.ML.NLP.Tokenizers.Testing/   # Shared testing infrastructure

Supported Platforms

Platform	Architecture	Status
Linux	x64	✅ Tested
Windows	x64	✅ Tested
macOS	x64	✅ Tested
macOS	ARM64	✅ Built
iOS	ARM64	✅ Built
Android	ARM64	✅ Built
WebAssembly	wasm32	✅ Built

Note: "Tested" platforms run full .NET test suite in CI. "Built" platforms compile successfully but tests are not run in CI.

Contributing

We welcome contributions. This project maintains strict 1:1 parity with the upstream HuggingFace Tokenizers Rust implementation. Any change that affects tokenization semantics must preserve byte-for-byte parity with the Python/HuggingFace reference unless explicitly documented and reviewed.

Primary contribution patterns

Routine tokenizer updates (most common): When the upstream Rust tokenizer is updated but the C FFI surface is unchanged, update the Rust crate version and rebuild the native artifacts:
1. Fork and create a feature branch.
2. Bump the tokenizer crate/Cargo dependency and update Cargo.lock/Cargo.toml as needed.
3. Rebuild the native bridge (.ext/hf_bridge) and copy the outputs to the appropriate runtimes/ folder.
4. Run the full parity suite and unit tests (see Testing section).
5. Commit changes and open a PR that references the upstream tokenizer release and lists test results.
FFI or ABI changes (infrequent, higher cost): If upstream changes require modifications to the C FFI or introduce new capabilities, you must:
1. Open an issue describing the required change and proposed approach before implementation.
2. Fork and create a feature branch.
3. Update Rust FFI bindings and the managed interop layer. Mirror native structs exactly and add unit tests asserting sizeof/layout parity where possible.
4. Ensure every GCHandle, pinned buffer, and disposable resource has a verified disposal path.
5. Add or update end-to-end parity tests, interop smoke tests, and any size/layout assertions.
6. Update documentation, CHANGELOG, and bump package/native artifact versions.
7. Submit a PR that includes a clear migration note and the downstream test artifacts demonstrating parity.

Required checks for every PR

Run the full test suite locally:
- Rust tests: cargo test --release
- .NET tests: dotnet test --configuration Release
- Parity/fixtures refresh (when applicable): run Python generators only if you are updating parity fixtures and include generated artifacts in the test run.
Verify native outputs are placed in runtimes/*/native and referenced artifacts are updated.
Ensure Sonar/Analyzer findings are addressed: 0 new security/bug findings, and no build warnings (follow project's coding standards).
Add unit tests with >= 80% coverage for new features or changed code paths.
Update documentation and any example projects affected.
Provide a short PR checklist in the PR description listing the items above and linking to relevant upstream release notes.

Guidelines and best practices

Prefer minimal, backward-compatible changes. Maintain byte-for-byte parity unless a breaking change is approved and documented.
For large or invasive changes (FFI, memory model, threading), coordinate with maintainers via an issue and include performance/interop tests.
Never commit secrets or credentials. Use configuration providers for secrets.
Keep changes small and well-tested; include reproducible steps to validate local parity.

Getting help

Open an issue for design discussions or if you are unsure whether a change requires FFI edits.
Reference the coding standards and acceptance criteria in your PR to speed review.

Please ensure documentation and README sections are updated for any user-visible or behavioral changes introduced by your PR.

License

This project follows the license terms of the HuggingFace Tokenizers library.

Acknowledgments

Built on top of HuggingFace Tokenizers - an incredible fast and versatile tokenization library.

Maintained by: ErgoX TokenX Team
Last Updated: October 17, 2025
Version: 0.22.1

Name		Name	Last commit message	Last commit date
Latest commit History 147 Commits
.ext/hf_bridge		.ext/hf_bridge
.github		.github
docs		docs
examples		examples
src		src
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
SECURITY.md		SECURITY.md
TokenX.sln		TokenX.sln
sec.md		sec.md

ergosumx/tokenx

Folders and files

Latest commit

History

Repository files navigation

ErgoX TokenX ML NLP Tokenizers

Why ErgoX.TokenX?

The Problem with Microsoft.ML.Tokenizers

🔍 HuggingFace Tokenizer Structure

⚠️ Pain Points Found

The ErgoX.TokenX Solution

✅ What You Get

⏱️ When My Observations May Be Outdated

Features

Quick Start

Installation

Option 1: NuGet Package (Recommended)

Option 2: Manual Installation from Releases

Basic Usage

HuggingFace Tokenizers

Running Examples

Quickstart Examples Overview

📗 HuggingFace Tokenizer Quickstart

Development

Prerequisites

Building

Testing

CI/CD

Automated Testing

Releases

Test Reports

Project Structure

Supported Platforms

Contributing

License

Acknowledgments

About

Resources

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages