Scrapefruit

A Python web application for web scraping with a visual interface. Combines a pywebview GUI with a Flask backend, using Playwright for browser automation with stealth capabilities to bypass anti-bot detection.

Features

Visual interface: Desktop GUI powered by pywebview
Cascade scraping: Multi-method fallback system (HTTP → Playwright → Puppeteer → Agent-browser → Browser-use)
Anti-bot bypass: Playwright-stealth integration for handling Cloudflare, CAPTCHAs, and rate limiting
Smart detection: Automatic poison pill detection (paywalls, rate limiting, anti-bot patterns, dead links)
Data extraction: CSS selectors, XPath expressions, and vision-based OCR fallback
Export options: SQLite database and Google Sheets integration
AI-driven scraping: Optional browser-use integration for LLM-controlled browser automation
Video transcription: Extract and transcribe videos from YouTube, Twitter/X, TikTok, and 1000+ platforms
Local LLM integration: Free local inference via Ollama for summarization, entity extraction, and classification

Tech stack

GUI: pywebview
Backend: Flask + Flask-SocketIO
Scraping: Playwright, requests, pyppeteer, browser-use (optional)
Extraction: BeautifulSoup, lxml, Tesseract OCR (optional)
Database: SQLite via SQLAlchemy
Export: Google Sheets (gspread)
Video: yt-dlp + faster-whisper for transcription
LLM: Ollama (local) with OpenAI/Anthropic fallback

Quick start

1. Clone and setup

git clone https://github.com/jamditis/scrapefruit.git
cd scrapefruit

# Create virtual environment
python -m venv venv

# Activate (Linux/Mac)
source venv/bin/activate

# Activate (Windows)
venv\Scripts\activate

2. Install dependencies

pip install -r requirements.txt
playwright install  # Install browser binaries

3. Configure environment

cp .env.example .env
# Edit .env with your settings

4. Run

python main.py

The app runs on http://127.0.0.1:5150

Cascade scraping

The engine uses a configurable cascade strategy that automatically falls back between scraping methods:

Method	Speed	JS Support	Use case
HTTP	Fastest	No	Static pages, APIs
Playwright	Medium	Yes	JavaScript-heavy sites, stealth mode
Puppeteer	Medium	Yes	Alternative browser fingerprint
Agent-browser	Slower	Yes	AI-optimized with accessibility tree
Browser-use	Slowest	Yes	LLM-controlled automation
Video	Varies	N/A	YouTube, Twitter/X, TikTok, 1000+ sites

Fallback triggers:

Blocked status codes (403, 429, 503)
Anti-bot detection patterns (Cloudflare, CAPTCHA)
Empty or minimal content (<500 chars)
JavaScript-heavy SPA markers
Poison pill detection

Extraction methods

Method	Description
CSS selectors	Standard CSS selector syntax
XPath	Full XPath expression support
Vision/OCR	Screenshot + Tesseract for anti-scraping bypasses

When DOM extraction fails, the engine can automatically capture a screenshot and use OCR to extract text content.

Configuration

Key settings in config.py:

Setting	Default	Description
`FLASK_PORT`	5150	Server port
`DEFAULT_TIMEOUT`	30000ms	Request timeout
`DEFAULT_RETRY_COUNT`	3	Retry attempts
`CASCADE_ENABLED`	True	Enable cascade fallback

Testing

The project includes a comprehensive test suite:

# Run all tests
pytest

# Run with verbose output
pytest -v

# Run specific test categories
pytest tests/unit/           # Unit tests
pytest tests/integration/    # Integration tests
pytest tests/stress/         # Stress tests

Test coverage:

170+ tests across unit, integration, and stress testing
Poison pill detection (paywall, rate limiting, anti-bot, CAPTCHA, dead links)
Extractors (CSS, XPath, Vision/OCR)
Fetchers (HTTP, Playwright, Puppeteer, Agent-browser, Browser-use)
Edge cases (large content, malformed HTML, concurrency, unicode)

Requirements

Python 3.11+
Chromium (installed via Playwright)
Tesseract OCR (optional, for vision extraction)

Optional dependencies

Package	Purpose	Install
`pyppeteer`	Puppeteer browser automation	`pip install pyppeteer`
`browser-use`	AI-driven browser control	`pip install browser-use`
`pytesseract`	Vision/OCR extraction	`pip install pytesseract` + install Tesseract
`agent-browser`	Accessibility tree scraping	`npm install -g agent-browser`
`yt-dlp`	Video/audio extraction	`pip install yt-dlp`
`faster-whisper`	Audio transcription	`pip install faster-whisper`
`ollama`	Local LLM inference	ollama.ai + `ollama pull qwen2.5:0.5b`

Video transcription

Extract and transcribe videos from 1000+ platforms:

from core.scraping.fetchers.video_fetcher import VideoFetcher

fetcher = VideoFetcher(whisper_model="tiny", use_2x_speed=True)
result = fetcher.fetch("https://youtube.com/watch?v=...")

print(result.transcript)       # Plain text
print(result.to_srt())         # SRT subtitles
print(result.metadata.title)   # Video metadata

Supported platforms: YouTube, Vimeo, Twitter/X, TikTok, Facebook, Instagram, Twitch, Dailymotion, and 1000+ more via yt-dlp.

Local LLM integration

Free local inference via Ollama (no API costs):

from core.llm import get_llm_service

llm = get_llm_service()
result = llm.summarize("Long article text...")
entities = llm.extract_entities("Text with names and dates...")

Setup:

# Install Ollama from ollama.ai, then:
ollama pull qwen2.5:0.5b  # 400MB, good for low-memory systems

The service auto-detects Ollama and falls back to OpenAI/Anthropic if API keys are set.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.github/workflows		.github/workflows
api		api
core		core
database		database
docs		docs
models		models
static		static
tests		tests
utils		utils
.env.example		.env.example
.gitignore		.gitignore
BACKLOG.md		BACKLOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
build_exe.bat		build_exe.bat
config.py		config.py
create_shortcut.ps1		create_shortcut.ps1
install_shortcut.bat		install_shortcut.bat
main.py		main.py
make_shortcut.ps1		make_shortcut.ps1
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run_dev.py		run_dev.py
run_server.py		run_server.py
scrapefruit.spec		scrapefruit.spec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrapefruit

Features

Tech stack

Quick start

1. Clone and setup

2. Install dependencies

3. Configure environment

4. Run

Cascade scraping

Extraction methods

Configuration

Testing

Requirements

Optional dependencies

Video transcription

Local LLM integration

Contributing

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

jamditis/scrapefruit

Folders and files

Latest commit

History

Repository files navigation

Scrapefruit

Features

Tech stack

Quick start

1. Clone and setup

2. Install dependencies

3. Configure environment

4. Run

Cascade scraping

Extraction methods

Configuration

Testing

Requirements

Optional dependencies

Video transcription

Local LLM integration

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages