A Python web application for web scraping with a visual interface. Combines a pywebview GUI with a Flask backend, using Playwright for browser automation with stealth capabilities to bypass anti-bot detection.
- Visual interface: Desktop GUI powered by pywebview
- Cascade scraping: Multi-method fallback system (HTTP → Playwright → Puppeteer → Agent-browser → Browser-use)
- Anti-bot bypass: Playwright-stealth integration for handling Cloudflare, CAPTCHAs, and rate limiting
- Smart detection: Automatic poison pill detection (paywalls, rate limiting, anti-bot patterns, dead links)
- Data extraction: CSS selectors, XPath expressions, and vision-based OCR fallback
- Export options: SQLite database and Google Sheets integration
- AI-driven scraping: Optional browser-use integration for LLM-controlled browser automation
- Video transcription: Extract and transcribe videos from YouTube, Twitter/X, TikTok, and 1000+ platforms
- Local LLM integration: Free local inference via Ollama for summarization, entity extraction, and classification
- GUI: pywebview
- Backend: Flask + Flask-SocketIO
- Scraping: Playwright, requests, pyppeteer, browser-use (optional)
- Extraction: BeautifulSoup, lxml, Tesseract OCR (optional)
- Database: SQLite via SQLAlchemy
- Export: Google Sheets (gspread)
- Video: yt-dlp + faster-whisper for transcription
- LLM: Ollama (local) with OpenAI/Anthropic fallback
git clone https://github.com/jamditis/scrapefruit.git
cd scrapefruit
# Create virtual environment
python -m venv venv
# Activate (Linux/Mac)
source venv/bin/activate
# Activate (Windows)
venv\Scripts\activatepip install -r requirements.txt
playwright install # Install browser binariescp .env.example .env
# Edit .env with your settingspython main.pyThe app runs on http://127.0.0.1:5150
The engine uses a configurable cascade strategy that automatically falls back between scraping methods:
| Method | Speed | JS Support | Use case |
|---|---|---|---|
| HTTP | Fastest | No | Static pages, APIs |
| Playwright | Medium | Yes | JavaScript-heavy sites, stealth mode |
| Puppeteer | Medium | Yes | Alternative browser fingerprint |
| Agent-browser | Slower | Yes | AI-optimized with accessibility tree |
| Browser-use | Slowest | Yes | LLM-controlled automation |
| Video | Varies | N/A | YouTube, Twitter/X, TikTok, 1000+ sites |
Fallback triggers:
- Blocked status codes (403, 429, 503)
- Anti-bot detection patterns (Cloudflare, CAPTCHA)
- Empty or minimal content (<500 chars)
- JavaScript-heavy SPA markers
- Poison pill detection
| Method | Description |
|---|---|
| CSS selectors | Standard CSS selector syntax |
| XPath | Full XPath expression support |
| Vision/OCR | Screenshot + Tesseract for anti-scraping bypasses |
When DOM extraction fails, the engine can automatically capture a screenshot and use OCR to extract text content.
Key settings in config.py:
| Setting | Default | Description |
|---|---|---|
FLASK_PORT |
5150 | Server port |
DEFAULT_TIMEOUT |
30000ms | Request timeout |
DEFAULT_RETRY_COUNT |
3 | Retry attempts |
CASCADE_ENABLED |
True | Enable cascade fallback |
The project includes a comprehensive test suite:
# Run all tests
pytest
# Run with verbose output
pytest -v
# Run specific test categories
pytest tests/unit/ # Unit tests
pytest tests/integration/ # Integration tests
pytest tests/stress/ # Stress testsTest coverage:
- 170+ tests across unit, integration, and stress testing
- Poison pill detection (paywall, rate limiting, anti-bot, CAPTCHA, dead links)
- Extractors (CSS, XPath, Vision/OCR)
- Fetchers (HTTP, Playwright, Puppeteer, Agent-browser, Browser-use)
- Edge cases (large content, malformed HTML, concurrency, unicode)
- Python 3.11+
- Chromium (installed via Playwright)
- Tesseract OCR (optional, for vision extraction)
| Package | Purpose | Install |
|---|---|---|
pyppeteer |
Puppeteer browser automation | pip install pyppeteer |
browser-use |
AI-driven browser control | pip install browser-use |
pytesseract |
Vision/OCR extraction | pip install pytesseract + install Tesseract |
agent-browser |
Accessibility tree scraping | npm install -g agent-browser |
yt-dlp |
Video/audio extraction | pip install yt-dlp |
faster-whisper |
Audio transcription | pip install faster-whisper |
ollama |
Local LLM inference | ollama.ai + ollama pull qwen2.5:0.5b |
Extract and transcribe videos from 1000+ platforms:
from core.scraping.fetchers.video_fetcher import VideoFetcher
fetcher = VideoFetcher(whisper_model="tiny", use_2x_speed=True)
result = fetcher.fetch("https://youtube.com/watch?v=...")
print(result.transcript) # Plain text
print(result.to_srt()) # SRT subtitles
print(result.metadata.title) # Video metadataSupported platforms: YouTube, Vimeo, Twitter/X, TikTok, Facebook, Instagram, Twitch, Dailymotion, and 1000+ more via yt-dlp.
Free local inference via Ollama (no API costs):
from core.llm import get_llm_service
llm = get_llm_service()
result = llm.summarize("Long article text...")
entities = llm.extract_entities("Text with names and dates...")Setup:
# Install Ollama from ollama.ai, then:
ollama pull qwen2.5:0.5b # 400MB, good for low-memory systemsThe service auto-detects Ollama and falls back to OpenAI/Anthropic if API keys are set.
Contributions are welcome! Please feel free to submit a Pull Request.
