TwinSpect - Near-Duplicate Benchmark

A comprehensive benchmarking framework for evaluating near-duplicate matching and similarity search of text, audio, image, and video content based on compact binary codes.

Overview

TwinSpect was built to evaluate the International Standard Content Code (ISCC) and inform the ISO community about its capabilities and performance characteristics across different media types.

The framework provides end-to-end evaluation of information retrieval metrics for compact binary code algorithms against real-world and synthetically augmented media datasets.

Live results: https://eval.iscc.codes

Features

Configurable benchmarks - YAML-based configuration for algorithms, datasets, and metrics
Multi-modal support - Text, audio, image, and video content types
Dataset management - Automatic acquisition and transformation of public media collections
Fast similarity search - HNSW-based indexing for approximate nearest-neighbor queries
Effectiveness metrics - Precision, recall, F1 scores at configurable hamming thresholds
Result visualization - Auto-generated documentation with charts and tables
Extensible architecture - Plugin system for custom algorithms, datasets, and transformations
Performance optimized - Parallel processing and intelligent caching of intermediate results

Quick Start

Requirements: Python 3.11+, uv, and ffmpeg (for audio/video)

# Clone and install
git clone https://github.com/iscc/twinspect
cd twinspect
uv sync

# Run the full benchmark suite
uv run twinspect run

CLI Usage

# List available components
uv run twinspect algorithms       # Show registered algorithms
uv run twinspect datasets         # Show available datasets
uv run twinspect benchmarks       # Show benchmark configurations
uv run twinspect transformations  # Show media transformations

# Run benchmarks
uv run twinspect run              # Execute all configured benchmarks

# Utilities
uv run twinspect version          # Show version
uv run twinspect info             # Show data folder information
uv run twinspect checksum <path>  # Compute folder checksum

Documentation

The benchmark results and methodology are documented at https://eval.iscc.codes, including:

Algorithm descriptions and configurations
Dataset specifications and transformations
Effectiveness metrics and interpretation
Distribution analysis charts

Development

# Install with dev dependencies
uv sync

# Run development tasks
uv run poe all              # Run all formatting and validation tasks
uv run poe format-code      # Format Python code with ruff
uv run poe format-yaml      # Format YAML files
uv run poe validate-schema  # Validate OpenAPI schema
uv run poe generate-code    # Generate Pydantic models from schema

# Preview documentation locally
uv run mkdocs serve

Project Structure

twinspect/
├── algos/          # Algorithm implementations and processing
├── datasets/       # Dataset acquisition and management
├── metrics/        # Effectiveness and distribution metrics
├── render/         # Result rendering (Markdown, charts)
├── transformations/# Media transformation functions
├── config.yml      # Main benchmark configuration
└── schema.yml      # OpenAPI data model specification

Changelog

See CHANGELOG.md for version history and release notes.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
data		data
docs		docs
tests		tests
twinspect		twinspect
.editorconfig		.editorconfig
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
config.yml		config.yml
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TwinSpect - Near-Duplicate Benchmark

Overview

Features

Quick Start

CLI Usage

Documentation

Development

Project Structure

Changelog

License

About

Uh oh!

Uh oh!

Languages

License

iscc/twinspect

Folders and files

Latest commit

History

Repository files navigation

TwinSpect - Near-Duplicate Benchmark

Overview

Features

Quick Start

CLI Usage

Documentation

Development

Project Structure

Changelog

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages