Skip to content

Adaptive web crawler designed to bypass bot detection and enable comprehensive AI-assisted research

Notifications You must be signed in to change notification settings

hacktoolkit/chameleon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chameleon

Adaptive web crawler designed to bypass bot detection and enable comprehensive AI-assisted research.

Features

  • Multiple fetch strategies - Automatically tries different methods to successfully fetch content
  • Basic HTTP fetch - Fast, lightweight fetching for unprotected sites
  • Stealth mode - Playwright-based browser automation with anti-detection measures
  • Wayback Machine fallback - Fetch cached versions when live sites are blocked
  • Content extraction - Clean text extraction with CSS selector support
  • Metadata extraction - Extract page titles, descriptions, OpenGraph data
  • Link extraction - Extract all links from a page

Installation

cd ~/code/chameleon
npm install
npm link  # Makes 'chameleon' available globally

Usage

Basic fetch

# Simple fetch (auto-selects best method)
chameleon fetch https://example.com

# Save to file
chameleon fetch https://example.com -o output.txt

# Get HTML instead of text
chameleon fetch https://example.com --format html

# Get JSON output with metadata
chameleon fetch https://example.com --format json --metadata

Force specific fetch method

# Force basic HTTP fetch
chameleon fetch https://example.com --basic

# Force stealth mode (Playwright)
chameleon fetch https://example.com --stealth

# Force Wayback Machine
chameleon fetch https://g2.com/products/crunchtime/reviews --wayback

Content extraction

# Extract specific elements
chameleon fetch https://example.com --extract "h1"
chameleon fetch https://example.com --extract ".review-content"
chameleon fetch https://example.com --extract "blockquote"

# Extract with metadata
chameleon fetch https://example.com --extract "article" --metadata

# Extract all links
chameleon fetch https://example.com --links

Wayback Machine

# List available snapshots
chameleon snapshots https://g2.com/products/crunchtime/reviews

# Fetch from Wayback
chameleon fetch https://g2.com/products/crunchtime/reviews --wayback

Test which methods work

# Test all fetch methods for a URL
chameleon test https://example.com

Debugging

# Run browser in visible mode
chameleon fetch https://example.com --stealth --headful

# Take screenshot
chameleon fetch https://example.com --stealth --screenshot debug.png

# Increase timeout
chameleon fetch https://slow-site.com --timeout 60000

# Increase wait time for JS rendering
chameleon fetch https://spa-site.com --stealth --wait 5000

Output Formats

Text (default)

Clean, extracted text content without HTML tags.

HTML

Raw HTML content as received.

JSON

Structured output including:

  • url - Final URL after redirects
  • status - HTTP status code
  • method - Fetch method used (basic/stealth/wayback)
  • text - Extracted text content
  • metadata - Page metadata (with --metadata flag)
  • links - Extracted links (with --links flag)
  • extracted - Content matching CSS selector (with --extract flag)

Fetch Methods

Basic (--basic)

  • Uses node-fetch with browser-like headers
  • Fastest method, lowest resource usage
  • Works for sites without bot protection
  • Includes automatic retry with backoff

Stealth (--stealth)

  • Uses Playwright with anti-detection measures
  • Bypasses basic JavaScript challenges
  • Handles SPAs and JS-rendered content
  • Overrides navigator.webdriver detection
  • Randomizes viewport and user agent

Wayback (--wayback)

  • Fetches cached versions from archive.org
  • Useful when live site has heavy protection
  • May have outdated content
  • Doesn't include JS-rendered content

Bot Detection Methods (and how we handle them)

Detection Method Our Counter-Measure
User-Agent check Realistic UA rotation
navigator.webdriver Override to undefined
Headless detection Chrome args to mask headless
JavaScript challenge Playwright execution
Rate limiting Retry with backoff
IP blocking Wayback fallback
CAPTCHA Wayback fallback (manual intervention needed)

Site Compatibility

Site Protection Recommended Method
example.com None basic
crunchtime.com Low basic
trustpilot.com Medium stealth
g2.com High wayback
capterra.com High wayback
indeed.com High wayback
glassdoor.com Very High wayback (limited)

Legal & Ethical Notes

This tool is intended for:

  • Educational purposes
  • Research with proper authorization
  • Accessing your own content
  • Fetching publicly archived content

Please:

  • Respect robots.txt
  • Don't hammer servers (built-in delays help)
  • Check site Terms of Service
  • Consider using official APIs when available

Development

# Run directly without global install
node src/index.js fetch https://example.com

# Run tests
npm test

Troubleshooting

"Bot detection triggered"

  • Try --stealth mode
  • Try --wayback for cached version
  • Site may require residential proxies (not yet implemented)

"Timeout"

  • Increase timeout: --timeout 60000
  • Site may be slow or blocking

"No Wayback snapshot"

  • URL may not have been archived
  • Try chameleon snapshots <url> to see available dates

Stealth mode fails

  • Try --headful to see what's happening
  • Take --screenshot for debugging
  • Some sites detect even stealth browsers

Future Enhancements

  • Proxy rotation support
  • CAPTCHA solving integration
  • Cookie persistence between sessions
  • Rate limiting controls
  • Parallel fetching
  • Site-specific extractors (G2, Capterra, etc.)

About

Adaptive web crawler designed to bypass bot detection and enable comprehensive AI-assisted research

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •