.NET bindings for HuggingFace Tokenizers with comprehensive testing and multi-platform support.
TL;DR: Microsoft.ML.Tokenizers requires manual configuration per model. ErgoX.TokenX provides seamless AutoTokenizer.Load() with HuggingFace ecosystem compatibility — 2,500+ Python-.NET tests verified with byte-to-byte Python parity.
While Microsoft.ML.Tokenizers offers exceptional raw performance for GPT models (see benchmarks), working with it reveals significant friction:
When you download a HuggingFace tokenizer (AutoTokenizer.from_pretrained in Python), you typically get:
model-name/
├── tokenizer.json # Serialized tokenizer (model, pre-tokenizer, normalizer, post-processor)
├── tokenizer_config.json # Metadata (model type, special tokens, casing, padding, truncation)
├── vocab.json # BPE-based tokenizers (GPT-2, RoBERTa)
├── merges.txt # BPE merge rules
└── special_tokens_map.json # Maps [CLS], [SEP], [PAD], [BOS], [EOS], etc.
HuggingFace's Python library merges these automatically. In Microsoft.ML.Tokenizers, you must be explicit — manually loading vocabulary files, configuring special tokens, and handling model-specific quirks.
- Missing Special Tokens: Special tokens required by models are not automatically configured. You need manual attention to get it right.
- No AutoTokenizer: The
AutoTokenizer(...)pattern was missing, while HuggingFace's ecosystem was rapidly growing. .NET lagged behind. - No Chat Templates: Instruction-tuned models (Llama, Mistral, Qwen) require chat templates — outside tokenizer scope in Microsoft.ML, but essential for real-world use.
- Limited Pre/Post-Processing: Advanced tokenization pipelines (preprocessing, postprocessing) are difficult to work with.
- Complex Overflow Handling: Token overflow scenarios and advanced use cases require significant boilerplate.
Simple approach: HuggingFace's Rust implementation of Tokenizers is ported via C FFI bindings into C#.
- One-Line Loading:
AutoTokenizer.Load("model-name")— works like Python - 2,500+ Tests Verified: Byte-to-byte parity with Python HuggingFace Transformers
- SHA256 Hash Verification: Every token output verified against Python reference (2,500+ test cases)
- Chat Templates: Built-in support for Huggingface chat templates
- Multi-Modal Support: Whisper (ASR), CLIP (vision), LayoutLM (documents), TrOCR (OCR)
- Advanced Features:
- Token offsets for NER/question answering
- Truncation and padding strategies
- Attention masks, type IDs, special token handling
- Pre-tokenization and post-processing pipelines
- Production-Proven: Internally used since May 2025 without revisiting alternatives
// Microsoft.ML.Tokenizers - Manual configuration required
var vocab = File.ReadAllText("vocab.json");
var merges = File.ReadAllText("merges.txt");
var tokenizer = /* ...manual setup... */;
// ErgoX.TokenX - One line
var tokenizer = AutoTokenizer.Load("bert-base-uncased");This project was developed internally in May 2025 for key implementations. Microsoft.ML.Tokenizers may have evolved since then, but I never revisited alternatives — ErgoX.TokenX met all my needs.
If you're choosing today:
- ✅ High-throughput GPT-only services → Consider Microsoft.ML (accept manual config)
- ✅ HuggingFace ecosystem compatibility → Use ErgoX.TokenX (for productivity)
- ✅ Multi-modal models (Whisper, CLIP, etc.) → Use ErgoX.TokenX (only option)
✅ Cross-platform - Linux, Windows, macOS (x64 & ARM64)
✅ Extensive test coverage across Linux, Windows, and macOS
✅ Rust FFI bindings - High-performance C bindings layer
✅ CI/CD integration - Automated testing and releases
✅ Test reports - Published with every release
✅ Code coverage - Tracked via Codecov
✅ Sequence decoder combinator - Compose native decoders from .NET
dotnet add package ErgoX.TokenX.HuggingFaceThe package includes pre-built native libraries for all supported platforms (Windows, Linux, macOS x64/ARM64).
Download the latest release from GitHub Releases:
- Windows x64:
tokenizers-c-win-x64.zip - Linux x64:
tokenizers-c-linux-x64.tar.gz - macOS x64:
tokenizers-c-osx-x64.tar.gz - macOS ARM64:
tokenizers-c-osx-arm64.tar.gz
Extract and place native libraries in your project:
YourProject/
└── runtimes/
├── win-x64/native/tokenx_bridge.dll
├── linux-x64/native/libtokenx_bridge.so
└── osx-x64/native/libtokenx_bridge.dylib
using ErgoX.TokenX.HuggingFace;
// Load tokenizer automatically (like Python's AutoTokenizer)
using var tokenizer = AutoTokenizer.Load("bert-base-uncased");
// Encode text
var encoding = tokenizer.Tokenizer.Encode("Hello, world!", addSpecialTokens: true);
Console.WriteLine($"Tokens: {string.Join(", ", encoding.Tokens)}");
Console.WriteLine($"IDs: {string.Join(", ", encoding.Ids)}");
// Decode
var decoded = tokenizer.Tokenizer.Decode(encoding.Ids, skipSpecialTokens: true);
Console.WriteLine($"Decoded: {decoded}");Note: For OpenAI GPT models, consider using Microsoft.ML.Tokenizers which provides
TiktokenTokenizerclass.
The repository includes ready-to-run examples with pre-configured models:
# HuggingFace comprehensive quickstart (16 examples)
cd examples/HuggingFace/Quickstart
dotnet run
# Other examples (require model downloads)
dotnet run --project examples/HuggingFace/AllMiniLmL6V2Console
dotnet run --project examples/HuggingFace/E5SmallV2Console
dotnet run --project examples/HuggingFace/AutoTokenizerPipelineExplorer16 comprehensive examples demonstrating:
- Basic tokenization (WordPiece, Unigram, BPE)
- Padding and truncation strategies
- Text pair encoding for classification
- Attention masks, type IDs, offset mapping
- Chat template rendering with Llama 3
- Vocabulary access, special tokens, batch processing
Models included: all-minilm-l6-v2, t5-small, meta-llama-3-8b-instruct
Documentation: Quickstart README | Full Docs
- .NET SDK 8.0+
- Rust 1.70+
- Visual Studio 2022 (Windows) or equivalent C++ toolchain
# Build Rust library
cd .ext/hf_bridge
cargo build --release
# Copy to .NET runtime folder
Copy-Item target/release/tokenx_bridge.dll ../src/HuggingFace/runtimes/win-x64/native/ -Force
# Build .NET project
cd ..
cd ..
dotnet build --configuration Release# Restore sanitized tokenizer fixtures (skips network downloads)
python tests/Py/Common/restore_test_data.py --force
# Run Rust tests
cd .ext/hf_bridge
cargo test --release
# Run .NET tests
dotnet test --configuration Release
# Refresh HuggingFace parity fixtures (requires transformers/tokenizers)
python tests/Py/Huggingface/generate_benchmarks.py
> Ensure the active Python environment includes the `transformers`, `tokenizers`, and `huggingface_hub` packages so the generators can materialize tokenizer pipelines directly from each model asset.
Running the .NET parity suite now also emits dotnet-benchmark.json alongside the Python fixtures in tests/_testdata_huggingface/<model> so you can inspect the full decoded tokens produced by the managed implementation.
Expected Results: 37 passed, 0 skipped, 0 failed
See TESTING-CHECKLIST.md for detailed instructions.
Every push and pull request triggers:
- Rust C Bindings Tests - 20 FFI layer tests on Linux, Windows, macOS
- .NET Integration Tests - 185 end-to-end tests on all platforms
- Coverage Reports - Uploaded to Codecov
Create a release by tagging:
git tag c-v0.22.2
git push origin c-v0.22.2The release workflow will:
- ✅ Build binaries for 7 platforms
- ✅ Run full test suite (205 tests)
- ✅ Package test reports
- ✅ Create GitHub Release with:
- Multi-platform binaries
- Test reports archive (
test-reports.tar.gz) - Checksums
- Release notes with test results
See CI-CD-WORKFLOWS.md for complete documentation.
Every release includes test-reports.tar.gz containing:
- TRX files - Machine-readable test results
- HTML reports - Human-readable test results
- Coverage reports - Code coverage analysis
Download from the Releases page.
TokenX/
├── .ext/
│ └── hf_bridge/ # HuggingFace native bridge crate (Rust)
├── .github/
│ ├── workflows/ # CI/CD workflows
│ │ ├── test-c-bindings.yml # Rust tests + coverage
│ │ ├── test-dotnet.yml # .NET tests + coverage
│ │ └── release-c-bindings.yml # Multi-platform release
│ ├── CI-CD-WORKFLOWS.md # CI/CD documentation
│ └── TESTING-CHECKLIST.md # Quick reference
├── src/
│ └── HuggingFace/ # Managed HuggingFace tokenizer bindings and interop
└── tests/
├── ErgoX.VecraX.ML.NLP.Tokenizers.HuggingFace.Tests/
└── ErgoX.VecraX.ML.NLP.Tokenizers.Testing/ # Shared testing infrastructure
| Platform | Architecture | Status |
|---|---|---|
| Linux | x64 | ✅ Tested |
| Windows | x64 | ✅ Tested |
| macOS | x64 | ✅ Tested |
| macOS | ARM64 | ✅ Built |
| iOS | ARM64 | ✅ Built |
| Android | ARM64 | ✅ Built |
| WebAssembly | wasm32 | ✅ Built |
Note: "Tested" platforms run full .NET test suite in CI. "Built" platforms compile successfully but tests are not run in CI.
We welcome contributions. This project maintains strict 1:1 parity with the upstream HuggingFace Tokenizers Rust implementation. Any change that affects tokenization semantics must preserve byte-for-byte parity with the Python/HuggingFace reference unless explicitly documented and reviewed.
Primary contribution patterns
-
Routine tokenizer updates (most common): When the upstream Rust tokenizer is updated but the C FFI surface is unchanged, update the Rust crate version and rebuild the native artifacts:
- Fork and create a feature branch.
- Bump the tokenizer crate/Cargo dependency and update Cargo.lock/Cargo.toml as needed.
- Rebuild the native bridge (
.ext/hf_bridge) and copy the outputs to the appropriate runtimes/ folder. - Run the full parity suite and unit tests (see Testing section).
- Commit changes and open a PR that references the upstream tokenizer release and lists test results.
-
FFI or ABI changes (infrequent, higher cost): If upstream changes require modifications to the C FFI or introduce new capabilities, you must:
- Open an issue describing the required change and proposed approach before implementation.
- Fork and create a feature branch.
- Update Rust FFI bindings and the managed interop layer. Mirror native structs exactly and add unit tests asserting sizeof/layout parity where possible.
- Ensure every GCHandle, pinned buffer, and disposable resource has a verified disposal path.
- Add or update end-to-end parity tests, interop smoke tests, and any size/layout assertions.
- Update documentation, CHANGELOG, and bump package/native artifact versions.
- Submit a PR that includes a clear migration note and the downstream test artifacts demonstrating parity.
Required checks for every PR
- Run the full test suite locally:
- Rust tests: cargo test --release
- .NET tests: dotnet test --configuration Release
- Parity/fixtures refresh (when applicable): run Python generators only if you are updating parity fixtures and include generated artifacts in the test run.
- Verify native outputs are placed in runtimes/*/native and referenced artifacts are updated.
- Ensure Sonar/Analyzer findings are addressed: 0 new security/bug findings, and no build warnings (follow project's coding standards).
- Add unit tests with >= 80% coverage for new features or changed code paths.
- Update documentation and any example projects affected.
- Provide a short PR checklist in the PR description listing the items above and linking to relevant upstream release notes.
Guidelines and best practices
- Prefer minimal, backward-compatible changes. Maintain byte-for-byte parity unless a breaking change is approved and documented.
- For large or invasive changes (FFI, memory model, threading), coordinate with maintainers via an issue and include performance/interop tests.
- Never commit secrets or credentials. Use configuration providers for secrets.
- Keep changes small and well-tested; include reproducible steps to validate local parity.
Getting help
- Open an issue for design discussions or if you are unsure whether a change requires FFI edits.
- Reference the coding standards and acceptance criteria in your PR to speed review.
Please ensure documentation and README sections are updated for any user-visible or behavioral changes introduced by your PR.
This project follows the license terms of the HuggingFace Tokenizers library.
Built on top of HuggingFace Tokenizers - an incredible fast and versatile tokenization library.
Maintained by: ErgoX TokenX Team
Last Updated: October 17, 2025
Version: 0.22.1