Skip to content

ergosumx/tokenx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CodeQL Automatic Dependency Submission Dependabot Updates DevSkim HuggingFace Bridge Artifacts HuggingFace RC Validation

ErgoX TokenX ML NLP Tokenizers

.NET bindings for HuggingFace Tokenizers with comprehensive testing and multi-platform support.

Why ErgoX.TokenX?

TL;DR: Microsoft.ML.Tokenizers requires manual configuration per model. ErgoX.TokenX provides seamless AutoTokenizer.Load() with HuggingFace ecosystem compatibility — 2,500+ Python-.NET tests verified with byte-to-byte Python parity.

The Problem with Microsoft.ML.Tokenizers

While Microsoft.ML.Tokenizers offers exceptional raw performance for GPT models (see benchmarks), working with it reveals significant friction:

🔍 HuggingFace Tokenizer Structure

When you download a HuggingFace tokenizer (AutoTokenizer.from_pretrained in Python), you typically get:

model-name/
├── tokenizer.json          # Serialized tokenizer (model, pre-tokenizer, normalizer, post-processor)
├── tokenizer_config.json   # Metadata (model type, special tokens, casing, padding, truncation)
├── vocab.json              # BPE-based tokenizers (GPT-2, RoBERTa)
├── merges.txt              # BPE merge rules
└── special_tokens_map.json # Maps [CLS], [SEP], [PAD], [BOS], [EOS], etc.

HuggingFace's Python library merges these automatically. In Microsoft.ML.Tokenizers, you must be explicit — manually loading vocabulary files, configuring special tokens, and handling model-specific quirks.

⚠️ Pain Points Found

  1. Missing Special Tokens: Special tokens required by models are not automatically configured. You need manual attention to get it right.
  2. No AutoTokenizer: The AutoTokenizer(...) pattern was missing, while HuggingFace's ecosystem was rapidly growing. .NET lagged behind.
  3. No Chat Templates: Instruction-tuned models (Llama, Mistral, Qwen) require chat templates — outside tokenizer scope in Microsoft.ML, but essential for real-world use.
  4. Limited Pre/Post-Processing: Advanced tokenization pipelines (preprocessing, postprocessing) are difficult to work with.
  5. Complex Overflow Handling: Token overflow scenarios and advanced use cases require significant boilerplate.

The ErgoX.TokenX Solution

Simple approach: HuggingFace's Rust implementation of Tokenizers is ported via C FFI bindings into C#.

What You Get

  • One-Line Loading: AutoTokenizer.Load("model-name") — works like Python
  • 2,500+ Tests Verified: Byte-to-byte parity with Python HuggingFace Transformers
  • SHA256 Hash Verification: Every token output verified against Python reference (2,500+ test cases)
  • Chat Templates: Built-in support for Huggingface chat templates
  • Multi-Modal Support: Whisper (ASR), CLIP (vision), LayoutLM (documents), TrOCR (OCR)
  • Advanced Features:
    • Token offsets for NER/question answering
    • Truncation and padding strategies
    • Attention masks, type IDs, special token handling
    • Pre-tokenization and post-processing pipelines
  • Production-Proven: Internally used since May 2025 without revisiting alternatives
// Microsoft.ML.Tokenizers - Manual configuration required
var vocab = File.ReadAllText("vocab.json");
var merges = File.ReadAllText("merges.txt");
var tokenizer = /* ...manual setup... */;

// ErgoX.TokenX - One line
var tokenizer = AutoTokenizer.Load("bert-base-uncased");

⏱️ When My Observations May Be Outdated

This project was developed internally in May 2025 for key implementations. Microsoft.ML.Tokenizers may have evolved since then, but I never revisited alternatives — ErgoX.TokenX met all my needs.

If you're choosing today:

  • High-throughput GPT-only services → Consider Microsoft.ML (accept manual config)
  • HuggingFace ecosystem compatibility → Use ErgoX.TokenX (for productivity)
  • Multi-modal models (Whisper, CLIP, etc.) → Use ErgoX.TokenX (only option)

Features

Cross-platform - Linux, Windows, macOS (x64 & ARM64)
Extensive test coverage across Linux, Windows, and macOS
Rust FFI bindings - High-performance C bindings layer
CI/CD integration - Automated testing and releases
Test reports - Published with every release
Code coverage - Tracked via Codecov
Sequence decoder combinator - Compose native decoders from .NET

Quick Start

Installation

Option 1: NuGet Package (Recommended)

dotnet add package ErgoX.TokenX.HuggingFace

The package includes pre-built native libraries for all supported platforms (Windows, Linux, macOS x64/ARM64).

Option 2: Manual Installation from Releases

Download the latest release from GitHub Releases:

  • Windows x64: tokenizers-c-win-x64.zip
  • Linux x64: tokenizers-c-linux-x64.tar.gz
  • macOS x64: tokenizers-c-osx-x64.tar.gz
  • macOS ARM64: tokenizers-c-osx-arm64.tar.gz

Extract and place native libraries in your project:

YourProject/
└── runtimes/
    ├── win-x64/native/tokenx_bridge.dll
    ├── linux-x64/native/libtokenx_bridge.so
    └── osx-x64/native/libtokenx_bridge.dylib

Basic Usage

HuggingFace Tokenizers

using ErgoX.TokenX.HuggingFace;

// Load tokenizer automatically (like Python's AutoTokenizer)
using var tokenizer = AutoTokenizer.Load("bert-base-uncased");

// Encode text
var encoding = tokenizer.Tokenizer.Encode("Hello, world!", addSpecialTokens: true);
Console.WriteLine($"Tokens: {string.Join(", ", encoding.Tokens)}");
Console.WriteLine($"IDs: {string.Join(", ", encoding.Ids)}");

// Decode
var decoded = tokenizer.Tokenizer.Decode(encoding.Ids, skipSpecialTokens: true);
Console.WriteLine($"Decoded: {decoded}");

Note: For OpenAI GPT models, consider using Microsoft.ML.Tokenizers which provides TiktokenTokenizer class.

Running Examples

The repository includes ready-to-run examples with pre-configured models:

# HuggingFace comprehensive quickstart (16 examples)
cd examples/HuggingFace/Quickstart
dotnet run

# Other examples (require model downloads)
dotnet run --project examples/HuggingFace/AllMiniLmL6V2Console
dotnet run --project examples/HuggingFace/E5SmallV2Console
dotnet run --project examples/HuggingFace/AutoTokenizerPipelineExplorer

Quickstart Examples Overview

📗 HuggingFace Tokenizer Quickstart

16 comprehensive examples demonstrating:

  • Basic tokenization (WordPiece, Unigram, BPE)
  • Padding and truncation strategies
  • Text pair encoding for classification
  • Attention masks, type IDs, offset mapping
  • Chat template rendering with Llama 3
  • Vocabulary access, special tokens, batch processing

Models included: all-minilm-l6-v2, t5-small, meta-llama-3-8b-instruct

Documentation: Quickstart README | Full Docs

Development

Prerequisites

  • .NET SDK 8.0+
  • Rust 1.70+
  • Visual Studio 2022 (Windows) or equivalent C++ toolchain

Building

# Build Rust library
cd .ext/hf_bridge
cargo build --release

# Copy to .NET runtime folder
Copy-Item target/release/tokenx_bridge.dll ../src/HuggingFace/runtimes/win-x64/native/ -Force

# Build .NET project
cd ..
cd ..
dotnet build --configuration Release

Testing

# Restore sanitized tokenizer fixtures (skips network downloads)
python tests/Py/Common/restore_test_data.py --force

# Run Rust tests
cd .ext/hf_bridge
cargo test --release

# Run .NET tests
dotnet test --configuration Release

# Refresh HuggingFace parity fixtures (requires transformers/tokenizers)
python tests/Py/Huggingface/generate_benchmarks.py

> Ensure the active Python environment includes the `transformers`, `tokenizers`, and `huggingface_hub` packages so the generators can materialize tokenizer pipelines directly from each model asset.

Running the .NET parity suite now also emits dotnet-benchmark.json alongside the Python fixtures in tests/_testdata_huggingface/<model> so you can inspect the full decoded tokens produced by the managed implementation.

Expected Results: 37 passed, 0 skipped, 0 failed

See TESTING-CHECKLIST.md for detailed instructions.

CI/CD

Automated Testing

Every push and pull request triggers:

  1. Rust C Bindings Tests - 20 FFI layer tests on Linux, Windows, macOS
  2. .NET Integration Tests - 185 end-to-end tests on all platforms
  3. Coverage Reports - Uploaded to Codecov

Releases

Create a release by tagging:

git tag c-v0.22.2
git push origin c-v0.22.2

The release workflow will:

  1. ✅ Build binaries for 7 platforms
  2. ✅ Run full test suite (205 tests)
  3. ✅ Package test reports
  4. ✅ Create GitHub Release with:
    • Multi-platform binaries
    • Test reports archive (test-reports.tar.gz)
    • Checksums
    • Release notes with test results

See CI-CD-WORKFLOWS.md for complete documentation.

Test Reports

Every release includes test-reports.tar.gz containing:

  • TRX files - Machine-readable test results
  • HTML reports - Human-readable test results
  • Coverage reports - Code coverage analysis

Download from the Releases page.

Project Structure

TokenX/
├── .ext/
│   └── hf_bridge/                      # HuggingFace native bridge crate (Rust)
├── .github/
│   ├── workflows/                      # CI/CD workflows
│   │   ├── test-c-bindings.yml         # Rust tests + coverage
│   │   ├── test-dotnet.yml             # .NET tests + coverage
│   │   └── release-c-bindings.yml      # Multi-platform release
│   ├── CI-CD-WORKFLOWS.md              # CI/CD documentation
│   └── TESTING-CHECKLIST.md            # Quick reference
├── src/
│   └── HuggingFace/                    # Managed HuggingFace tokenizer bindings and interop
└── tests/
   ├── ErgoX.VecraX.ML.NLP.Tokenizers.HuggingFace.Tests/
   └── ErgoX.VecraX.ML.NLP.Tokenizers.Testing/   # Shared testing infrastructure

Supported Platforms

Platform Architecture Status
Linux x64 ✅ Tested
Windows x64 ✅ Tested
macOS x64 ✅ Tested
macOS ARM64 ✅ Built
iOS ARM64 ✅ Built
Android ARM64 ✅ Built
WebAssembly wasm32 ✅ Built

Note: "Tested" platforms run full .NET test suite in CI. "Built" platforms compile successfully but tests are not run in CI.

Contributing

We welcome contributions. This project maintains strict 1:1 parity with the upstream HuggingFace Tokenizers Rust implementation. Any change that affects tokenization semantics must preserve byte-for-byte parity with the Python/HuggingFace reference unless explicitly documented and reviewed.

Primary contribution patterns

  • Routine tokenizer updates (most common): When the upstream Rust tokenizer is updated but the C FFI surface is unchanged, update the Rust crate version and rebuild the native artifacts:

    1. Fork and create a feature branch.
    2. Bump the tokenizer crate/Cargo dependency and update Cargo.lock/Cargo.toml as needed.
    3. Rebuild the native bridge (.ext/hf_bridge) and copy the outputs to the appropriate runtimes/ folder.
    4. Run the full parity suite and unit tests (see Testing section).
    5. Commit changes and open a PR that references the upstream tokenizer release and lists test results.
  • FFI or ABI changes (infrequent, higher cost): If upstream changes require modifications to the C FFI or introduce new capabilities, you must:

    1. Open an issue describing the required change and proposed approach before implementation.
    2. Fork and create a feature branch.
    3. Update Rust FFI bindings and the managed interop layer. Mirror native structs exactly and add unit tests asserting sizeof/layout parity where possible.
    4. Ensure every GCHandle, pinned buffer, and disposable resource has a verified disposal path.
    5. Add or update end-to-end parity tests, interop smoke tests, and any size/layout assertions.
    6. Update documentation, CHANGELOG, and bump package/native artifact versions.
    7. Submit a PR that includes a clear migration note and the downstream test artifacts demonstrating parity.

Required checks for every PR

  • Run the full test suite locally:
    • Rust tests: cargo test --release
    • .NET tests: dotnet test --configuration Release
    • Parity/fixtures refresh (when applicable): run Python generators only if you are updating parity fixtures and include generated artifacts in the test run.
  • Verify native outputs are placed in runtimes/*/native and referenced artifacts are updated.
  • Ensure Sonar/Analyzer findings are addressed: 0 new security/bug findings, and no build warnings (follow project's coding standards).
  • Add unit tests with >= 80% coverage for new features or changed code paths.
  • Update documentation and any example projects affected.
  • Provide a short PR checklist in the PR description listing the items above and linking to relevant upstream release notes.

Guidelines and best practices

  • Prefer minimal, backward-compatible changes. Maintain byte-for-byte parity unless a breaking change is approved and documented.
  • For large or invasive changes (FFI, memory model, threading), coordinate with maintainers via an issue and include performance/interop tests.
  • Never commit secrets or credentials. Use configuration providers for secrets.
  • Keep changes small and well-tested; include reproducible steps to validate local parity.

Getting help

  • Open an issue for design discussions or if you are unsure whether a change requires FFI edits.
  • Reference the coding standards and acceptance criteria in your PR to speed review.

Please ensure documentation and README sections are updated for any user-visible or behavioral changes introduced by your PR.

License

This project follows the license terms of the HuggingFace Tokenizers library.

Acknowledgments

Built on top of HuggingFace Tokenizers - an incredible fast and versatile tokenization library.


Maintained by: ErgoX TokenX Team
Last Updated: October 17, 2025
Version: 0.22.1

About

No description, website, or topics provided.

Resources

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •