Skip to content

Fast, standalone embeddings in Go for text similarity, semantic search, and vector retrieval.

Notifications You must be signed in to change notification settings

tahcohcat/same-same

Repository files navigation

Same-Same Vector Database Microservice

Go Version License Build Status Tests Docker Pulls API Vector DB


gemini huggingface nginx docker prometheus grafana

A lightweight RESTful microservice for storing and searching vectors using cosine similarity, with multimodal embedding support for both text and images.

Designed and optimized for quick prototyping and exploration of the vector space with minimal setup requirements.

Features

Storage Options

  • In-memory vector storage with thread safety (default)
  • Local file system storage with schema-driven persistence, metadata indexing, and multimodal support

Multimodal Embedding Support

  • CLIP - Embed images and text into the same vector space for cross-modal search (Pure Go, no Python!)
  • Text Embedders:
    • Local TF-IDF (default, no external dependencies)
    • Google Gemini API
    • HuggingFace API
  • Search images with text queries and vice versa

Core Features

  • RESTful API for CRUD operations
  • Vector similarity search using cosine similarity
  • Advanced metadata filtering and search
  • Flexible data ingestion from multiple sources
  • Pluggable embedder interface
  • JSON API responses

Documentation

Quick Start

Installation

# Clone the repository
git clone https://github.com/tahcohcat/same-same.git
cd same-same

# Build the CLI
go build ./cmd/same-same

Start the Server

# Start server on default port 8080
same-same serve

# Or with custom port
same-same serve -a :9000

# With debug logging
same-same serve -d

Ingest Data

# Ingest built-in demo dataset
same-same ingest demo

# Ingest with namespace
same-same ingest -n quotes demo

# Ingest CSV file
same-same ingest --text-col description data.csv

# Ingest images (Pure Go CLIP - no Python!)
same-same ingest -e clip images:./photos

# Ingest HuggingFace dataset
same-same ingest hf:imdb --split train --sample 1000

Search

# Search by text
curl -X POST http://localhost:8080/api/v1/search \
  -H "Content-Type: application/json" \
  -d '{"text": "machine learning", "limit": 5}'

# Search images with text (after ingesting images)
curl -X POST http://localhost:8080/api/v1/search \
  -H "Content-Type: application/json" \
  -d '{"text": "sunset over ocean", "limit": 10}'

CLI Commands

Same-Same provides a unified CLI powered by Cobra:

Available Commands

same-same --help              # Show all commands
same-same serve [flags]       # Start the server
same-same ingest <source>     # Ingest data from various sources

Common Usage Examples

# Start server
same-same serve                           # Default port 8080
same-same serve -a :9000 -d              # Custom port with debug

# Ingest text data
same-same ingest demo                     # Built-in dataset
same-same ingest -n quotes -v demo       # With namespace and verbose
same-same ingest data.csv                # CSV file
same-same ingest data.jsonl              # JSONL file
same-same ingest hf:imdb                 # HuggingFace dataset

# Ingest images (no Python required!)
same-same ingest -e clip images:./photos           # Image directory
same-same ingest -e clip image-list:images.txt    # Image list file
same-same ingest -e clip -n vacation images:./trip # With namespace

Global Flags

  • -v, --verbose - Verbose output
  • -n, --namespace <string> - Namespace for vectors (default: "default")
  • --dry-run - Perform dry run without making changes
  • --version - Show version

Image Embedding (Pure Go)

Same-Same supports multimodal embeddings using CLIP - embed both images and text into the same vector space.

No Python Required! The default CLIP embedder is pure Go with zero external dependencies.

# Ingest images - works out of the box!
same-same ingest -e clip images:./photos

# Search images with text
curl -X POST http://localhost:8080/api/v1/search \
  -H "Content-Type: application/json" \
  -d '{"text": "beach sunset", "limit": 5}'

Optional: Use Python-based OpenCLIP for higher accuracy:

# Install Python dependencies
pip install open_clip_torch pillow torch

# Use Python CLIP
export CLIP_USE_PYTHON=true
same-same ingest -e clip --clip-model ViT-L-14 images:./photos

See IMAGE_EMBEDDING_GUIDE.md for detailed documentation.

Data Ingestion Sources

Built-in Datasets

same-same ingest demo              # 20 quotes (quick test)
same-same ingest quotes            # Full quotes dataset
same-same ingest quotes-small      # Same as demo

HuggingFace Datasets

same-same ingest hf:imdb                           # IMDB reviews
same-same ingest hf:squad:v2 --split train        # SQuAD v2
same-same ingest hf:ag_news --sample 1000         # AG News (sampled)

CSV Files

same-same ingest data.csv                          # Default "text" column
same-same ingest --text-col content data.csv      # Custom column
same-same ingest -n products products.csv         # With namespace

JSONL Files

same-same ingest data.jsonl                        # JSON lines
same-same ingest -v data.ndjson                   # With verbose output

Images

same-same ingest -e clip images:./photos          # Directory (recursive)
same-same ingest -e clip -r=false images:./dir    # Non-recursive
same-same ingest -e clip image-list:list.txt      # From list file

Supported formats: JPEG, PNG, GIF, BMP, WebP

API Endpoints

Vectors

  • POST /api/v1/vectors/embed - Create vector from text (auto-generates embedding)
  • GET /api/v1/vectors/count - Get total number of vectors
  • POST /api/v1/vectors - Create vector manually
  • GET /api/v1/vectors - List all vectors
  • GET /api/v1/vectors/{id} - Get specific vector
  • PUT /api/v1/vectors/{id} - Update vector
  • DELETE /api/v1/vectors/{id} - Delete vector
  • POST /api/v1/vectors/search - Search by vector similarity
  • POST /api/v1/search - Search by text (auto-embedding)

Health

  • GET /health - Health check endpoint

Example API Usage

# Create vector from text
curl -X POST http://localhost:8080/api/v1/vectors/embed \
  -H "Content-Type: application/json" \
  -d '{"text": "artificial intelligence", "author": "AI"}'

# Search similar vectors
curl -X POST http://localhost:8080/api/v1/search \
  -H "Content-Type: application/json" \
  -d '{
    "text": "machine learning",
    "limit": 5,
    "namespace": "quotes"
  }'

# Get vector count
curl http://localhost:8080/api/v1/vectors/count

# Manual vector creation
curl -X POST http://localhost:8080/api/v1/vectors \
  -H "Content-Type: application/json" \
  -d '{
    "id": "custom1",
    "embedding": [0.1, 0.2, 0.3, ...],
    "metadata": {"type": "custom"}
  }'

Architecture

System Architecture

graph TD
    subgraph Client
        A[User / API Client]
    end

    subgraph Server
        B[HTTP Server]
        C[Handlers]
        D[Embedders]
        E[Storage]
        F[Models]
    end

    subgraph Embedders
        D1[TF-IDF]
        D2[Gemini]
        D3[HuggingFace]
        D4[CLIP]
    end

    subgraph Storage
        E1[Memory]
        E2[Local File]
    end

    A -->|REST API| B
    B --> C
    C --> D
    C --> E
    C --> F
    D --> D1
    D --> D2
    D --> D3
    D --> D4
    E --> E1
    E --> E2
Loading

Project Structure

same-same/
├── cmd/
│   └── same-same/              # Main CLI application
│       ├── main.go            # Entry point
│       └── cmd/               # Cobra commands
│           ├── root.go        # Root command
│           ├── serve.go       # Server command
│           └── ingest.go      # Ingest command
├── internal/
│   ├── embedders/             # Embedding implementations
│   │   ├── embedder.go       # Base interface
│   │   ├── multimodal.go     # Multimodal interfaces
│   │   ├── clip/             # CLIP embedders
│   │   │   ├── simple.go     # Pure Go CLIP (default)
│   │   │   ├── clip.go       # Python OpenCLIP (optional)
│   │   │   └── native.go     # Advanced Go CLIP
│   │   └── quotes/           # Text embedders
│   │       ├── gemini/       # Google Gemini
│   │       ├── huggingface/  # HuggingFace
│   │       └── local/tfidf/  # Local TF-IDF
│   ├── handlers/             # HTTP handlers
│   ├── ingestion/            # Data ingestion
│   │   ├── source.go         # Source interface
│   │   ├── builtin.go        # Built-in datasets
│   │   ├── file.go           # CSV/JSONL
│   │   ├── image.go          # Image sources
│   │   ├── huggingface.go    # HuggingFace
│   │   └── ingestor.go       # Main ingestion logic
│   ├── models/               # Data models
│   ├── server/               # HTTP server
│   └── storage/              # Storage implementations
│       ├── memory/           # In-memory
│       └── local/            # File-based
├── .examples/                # Example data and scripts
│   ├── data/                 # Sample datasets
│   ├── images/               # Sample images
│   └── test_clip.py          # CLIP installation test
└── docs/                     # Documentation

Embedder Interface

The system uses a pluggable embedder interface:

// Text embedder
type Embedder interface {
    Embed(text string) ([]float64, error)
    Name() string
}

// Image embedder
type ImageEmbedder interface {
    EmbedImage(imagePath string) ([]float64, error)
    EmbedImageBytes(imageData []byte) ([]float64, error)
    Name() string
}

// Multimodal embedder (text + images)
type MultiModalEmbedder interface {
    Embedder
    ImageEmbedder
    Dimensions() int
}

Supported Embedders:

  • TF-IDF (local, no dependencies) - Text only
  • Gemini (Google API) - Text only
  • HuggingFace (API) - Text only
  • CLIP (Pure Go or Python) - Text + Images

Environment Variables

# Embedder selection (optional, defaults to local)
export EMBEDDER_TYPE=local        # Options: local, gemini, huggingface, clip

# API keys (if using external embedders)
export GEMINI_API_KEY=your_key
export HUGGINGFACE_API_KEY=your_key

# CLIP mode (optional, defaults to Pure Go)
export CLIP_USE_PYTHON=true       # Use Python OpenCLIP for higher accuracy

Development

Build

go build ./cmd/same-same

Run Tests

go test ./...

Test Embedders

# Test TF-IDF (default)
same-same ingest -v demo

# Test CLIP (Pure Go)
same-same ingest -e clip -v images:.examples/images

# Test CLIP (Python - requires installation)
export CLIP_USE_PYTHON=true
same-same ingest -e clip -v images:.examples/images

# Test Gemini
export GEMINI_API_KEY=your_key
export EMBEDDER_TYPE=gemini
same-same ingest -v demo

Adding New Embedders

  1. Implement the embedders.Embedder interface (or MultiModalEmbedder)
  2. Add your implementation to internal/embedders/
  3. Update cmd/same-same/cmd/ingest.go to include your embedder

Example:

package myembedder

import "github.com/tahcohcat/same-same/internal/embedders"

type MyEmbedder struct {
    // fields
}

func NewMyEmbedder() embedders.Embedder {
    return &MyEmbedder{}
}

func (m *MyEmbedder) Embed(text string) ([]float64, error) {
    // implementation
}

func (m *MyEmbedder) Name() string {
    return "my-embedder"
}

Docker

# Build image
docker build -t same-same .

# Run container
docker run -d \
  --name same-same \
  -p 8080:8080 \
  -e EMBEDDER_TYPE=local \
  same-same

# With Gemini embedder
docker run -d \
  --name same-same \
  -p 8080:8080 \
  -e EMBEDDER_TYPE=gemini \
  -e GEMINI_API_KEY=your_key \
  same-same

Performance

Ingestion Speed

Source Embedder Speed Notes
Built-in TF-IDF ~39k records/sec Pure Go
CSV TF-IDF ~10k records/sec Depends on file I/O
Images CLIP (Go) ~1k images/sec Pure Go, no Python
Images CLIP (Python) ~50-100 images/sec Higher accuracy
HuggingFace TF-IDF Varies Network dependent

Storage Options

Type Speed Persistence Use Case
Memory Fastest No Development, testing
Local File Fast Yes Production, single instance

Contributing

We welcome contributions! Please see CONTRIBUTING.md for details.

Development Workflow

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run tests: go test ./...
  5. Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Support


Made with love for the vector search community