Skip to content

sidistic/github-crawler

Repository files navigation

GitHub Repository Crawler

A high-performance, concurrent GitHub repository crawler that efficiently collects repository metadata using GitHub's GraphQL API. Built with clean architecture principles and designed to scale.

🚀 Features

  • Concurrent crawling with configurable parallelism (15 concurrent queries by default)
  • Smart query generation using multi-dimensional search strategies
  • Rate limit handling with automatic pausing and retry mechanisms
  • Clean architecture with separated domain, application, and infrastructure layers
  • Immutable data structures throughout the codebase
  • Anti-corruption layer for GitHub API translation
  • Coverage tracking to ensure comprehensive data collection
  • PostgreSQL storage with efficient upsert operations

📊 Performance

  • Collects 100,000 repositories in under 8 minutes
  • Handles GitHub API rate limits gracefully
  • Minimizes duplicate API calls through intelligent query generation
  • Processes ~200 repositories per second

🏗️ Architecture

The project follows clean architecture principles with three main layers:

Domain Layer

  • Entities: Repository, SearchDimension
  • Value Objects: QueryStrategy, CoverageStats, CrawlerStats
  • Pure data models with immutable dataclasses

Application Layer

  • QueryGenerator: Generates non-overlapping search queries
  • CrawlerService: Orchestrates individual query execution
  • CrawlerOrchestrator: Manages the overall crawling process

Infrastructure Layer

  • GitHubClient: Handles GitHub GraphQL API communication
  • RepoStorage: Manages PostgreSQL database operations
  • Anti-corruption Layer: Translates between GitHub API and domain models

📦 Data Schema

Current Schema

repositories (
    id          BIGINT       PRIMARY KEY,    -- GitHub's databaseId
    full_name   TEXT         UNIQUE,         -- owner/repo format
    stars       INT,                         -- Current star count
    scraped_at  TIMESTAMPTZ  DEFAULT NOW(),  -- Last update time
    extra       JSONB        DEFAULT '{}'    -- Flexible metadata storage
)

crawl_runs (
    id              SERIAL PRIMARY KEY,
    completed_at    TIMESTAMPTZ DEFAULT NOW(),
    coverage_report JSONB,
    total_repos     INT
)

Schema Evolution Strategy

The schema is designed to evolve efficiently as new metadata requirements emerge:

  1. Normalized Approach (Recommended for structured data):

    -- Separate tables for each entity type
    repositories (id, full_name, stars, extra)
    pull_requests (id, repo_id, number, title, state)
    pr_comments (id, pr_id, body, created_at, author)
    issues (id, repo_id, number, title, state)
  2. Event-Driven Approach (For audit trails and flexibility):

    -- Core data with immutable event history
    repositories (id, full_name, current_stars)
    repository_events (
        id, repo_id, event_type, event_data JSONB, 
        occurred_at, processed_at
    )

🔍 Query Generation Strategy

The crawler uses a sophisticated multi-dimensional query generation system:

Search Dimensions

  • Language: Python, JavaScript, Java, Go, TypeScript, etc.
  • Stars: Bucketed ranges (0-10, 11-50, 51-100, etc.)
  • Creation Date: Quarterly and yearly ranges
  • Repository Size: Small to large codebases
  • Activity Metrics: Forks, issues, archived status

Coverage Optimization

  • Tracks which dimension combinations have been queried
  • Prioritizes under-explored areas of the search space
  • Prevents duplicate queries through combination tracking
  • Provides detailed coverage reports

See query_builder_logic.md for a detailed explanation with examples.

🚦 Getting Started

Prerequisites

  • Python 3.12+
  • PostgreSQL 16+
  • GitHub Personal Access Token

Installation

# Clone the repository
git clone https://github.com/yourusername/github-crawler-stars.git
cd github-crawler-stars

# Install dependencies
pip install poetry
poetry install

# Set up environment variables
export GITHUB_TOKEN="your-github-token"
export DATABASE_URL="postgresql://user:pass@localhost:5432/dbname"

Database Setup

# Create the database schema
psql -d your_database -f schema.sql
psql -d your_database -f crawl_runs.sql

Running the Crawler

# Run the crawler
poetry run python crawl.py

🔄 GitHub Actions Workflow

The project includes a complete CI/CD pipeline that:

  1. Sets up a PostgreSQL service container
  2. Initializes the database schema
  3. Runs the crawler with automatic error handling
  4. Exports results as CSV artifacts

🚀 Future Enhancements

Scaling to 500M+ Repositories

  1. Enhanced Query Generation

    • Query result analysis for coverage optimization
    • Machine learning for query effectiveness prediction
    • Dynamic query adjustment based on result density
  2. Distributed Architecture

    • Multiple worker nodes with centralized queue
    • Sharding by repository creation date
    • Delta crawling for update detection
  3. Infrastructure Improvements

    • Caching layer for popular repositories
    • Columnar storage for analytics
    • Data lake integration for raw data archival
  4. Advanced Features

    • Real-time update streaming
    • Change detection and notification system
    • API for querying collected data
    • Data quality monitoring and alerting

Monitoring & Observability

  • Metrics dashboard for crawl performance
  • API token health monitoring
  • Data distribution analysis
  • Automated data quality checks

📈 Metrics & Monitoring

The crawler tracks:

  • Total API calls and success rates
  • Rate limit utilization
  • Query effectiveness (repos found per query)
  • Dimension coverage percentages
  • Crawl duration and throughput

🤝 Contributing

Contributions are welcome! Please feel free to submit pull requests or open issues for bugs and feature requests.

📄 License

This project is open source and available under the MIT License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages