GitHub Repository Crawler

A high-performance, concurrent GitHub repository crawler that efficiently collects repository metadata using GitHub's GraphQL API. Built with clean architecture principles and designed to scale.

🚀 Features

Concurrent crawling with configurable parallelism (15 concurrent queries by default)
Smart query generation using multi-dimensional search strategies
Rate limit handling with automatic pausing and retry mechanisms
Clean architecture with separated domain, application, and infrastructure layers
Immutable data structures throughout the codebase
Anti-corruption layer for GitHub API translation
Coverage tracking to ensure comprehensive data collection
PostgreSQL storage with efficient upsert operations

📊 Performance

Collects 100,000 repositories in under 8 minutes
Handles GitHub API rate limits gracefully
Minimizes duplicate API calls through intelligent query generation
Processes ~200 repositories per second

🏗️ Architecture

The project follows clean architecture principles with three main layers:

Domain Layer

Entities: Repository, SearchDimension
Value Objects: QueryStrategy, CoverageStats, CrawlerStats
Pure data models with immutable dataclasses

Application Layer

QueryGenerator: Generates non-overlapping search queries
CrawlerService: Orchestrates individual query execution
CrawlerOrchestrator: Manages the overall crawling process

Infrastructure Layer

GitHubClient: Handles GitHub GraphQL API communication
RepoStorage: Manages PostgreSQL database operations
Anti-corruption Layer: Translates between GitHub API and domain models

📦 Data Schema

Current Schema

repositories (
    id          BIGINT       PRIMARY KEY,    -- GitHub's databaseId
    full_name   TEXT         UNIQUE,         -- owner/repo format
    stars       INT,                         -- Current star count
    scraped_at  TIMESTAMPTZ  DEFAULT NOW(),  -- Last update time
    extra       JSONB        DEFAULT '{}'    -- Flexible metadata storage
)

crawl_runs (
    id              SERIAL PRIMARY KEY,
    completed_at    TIMESTAMPTZ DEFAULT NOW(),
    coverage_report JSONB,
    total_repos     INT
)

Schema Evolution Strategy

The schema is designed to evolve efficiently as new metadata requirements emerge:

Normalized Approach (Recommended for structured data):

-- Separate tables for each entity type
repositories (id, full_name, stars, extra)
pull_requests (id, repo_id, number, title, state)
pr_comments (id, pr_id, body, created_at, author)
issues (id, repo_id, number, title, state)

Event-Driven Approach (For audit trails and flexibility):

-- Core data with immutable event history
repositories (id, full_name, current_stars)
repository_events (
    id, repo_id, event_type, event_data JSONB, 
    occurred_at, processed_at
)

🔍 Query Generation Strategy

The crawler uses a sophisticated multi-dimensional query generation system:

Search Dimensions

Language: Python, JavaScript, Java, Go, TypeScript, etc.
Stars: Bucketed ranges (0-10, 11-50, 51-100, etc.)
Creation Date: Quarterly and yearly ranges
Repository Size: Small to large codebases
Activity Metrics: Forks, issues, archived status

Coverage Optimization

Tracks which dimension combinations have been queried
Prioritizes under-explored areas of the search space
Prevents duplicate queries through combination tracking
Provides detailed coverage reports

See query_builder_logic.md for a detailed explanation with examples.

🚦 Getting Started

Prerequisites

Python 3.12+
PostgreSQL 16+
GitHub Personal Access Token

Installation

# Clone the repository
git clone https://github.com/yourusername/github-crawler-stars.git
cd github-crawler-stars

# Install dependencies
pip install poetry
poetry install

# Set up environment variables
export GITHUB_TOKEN="your-github-token"
export DATABASE_URL="postgresql://user:pass@localhost:5432/dbname"

Database Setup

# Create the database schema
psql -d your_database -f schema.sql
psql -d your_database -f crawl_runs.sql

Running the Crawler

# Run the crawler
poetry run python crawl.py

🔄 GitHub Actions Workflow

The project includes a complete CI/CD pipeline that:

Sets up a PostgreSQL service container
Initializes the database schema
Runs the crawler with automatic error handling
Exports results as CSV artifacts

🚀 Future Enhancements

Scaling to 500M+ Repositories

Enhanced Query Generation
- Query result analysis for coverage optimization
- Machine learning for query effectiveness prediction
- Dynamic query adjustment based on result density
Distributed Architecture
- Multiple worker nodes with centralized queue
- Sharding by repository creation date
- Delta crawling for update detection
Infrastructure Improvements
- Caching layer for popular repositories
- Columnar storage for analytics
- Data lake integration for raw data archival
Advanced Features
- Real-time update streaming
- Change detection and notification system
- API for querying collected data
- Data quality monitoring and alerting

Monitoring & Observability

Metrics dashboard for crawl performance
API token health monitoring
Data distribution analysis
Automated data quality checks

📈 Metrics & Monitoring

The crawler tracks:

Total API calls and success rates
Rate limit utilization
Query effectiveness (repos found per query)
Dimension coverage percentages
Crawl duration and throughput

🤝 Contributing

Contributions are welcome! Please feel free to submit pull requests or open issues for bugs and feature requests.

📄 License

This project is open source and available under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
dev_utils		dev_utils
src		src
.gitignore		.gitignore
README.md		README.md
crawl.py		crawl.py
crawl_runs.sql		crawl_runs.sql
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
query_builder_logic.md		query_builder_logic.md
schema.sql		schema.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GitHub Repository Crawler

🚀 Features

📊 Performance

🏗️ Architecture

Domain Layer

Application Layer

Infrastructure Layer

📦 Data Schema

Current Schema

Schema Evolution Strategy

🔍 Query Generation Strategy

Search Dimensions

Coverage Optimization

🚦 Getting Started

Prerequisites

Installation

Database Setup

Running the Crawler

🔄 GitHub Actions Workflow

🚀 Future Enhancements

Scaling to 500M+ Repositories

Monitoring & Observability

📈 Metrics & Monitoring

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Languages

sidistic/github-crawler

Folders and files

Latest commit

History

Repository files navigation

GitHub Repository Crawler

🚀 Features

📊 Performance

🏗️ Architecture

Domain Layer

Application Layer

Infrastructure Layer

📦 Data Schema

Current Schema

Schema Evolution Strategy

🔍 Query Generation Strategy

Search Dimensions

Coverage Optimization

🚦 Getting Started

Prerequisites

Installation

Database Setup

Running the Crawler

🔄 GitHub Actions Workflow

🚀 Future Enhancements

Scaling to 500M+ Repositories

Monitoring & Observability

📈 Metrics & Monitoring

🤝 Contributing

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages