Open Source Python Scraper

The Best AI Web Scraper
for Developers

Just give Extracto a URL and describe what you want in plain English. No rigid CSS selectors. No brittle XPath. Pure structured data extraction powered by LLMs like GPT-4o and Mistral.

bash — ./extracto
$

Everything you need.
Pre-configured.

A production-ready scraper out of the box. Single pages or millions of URLs.

AI-Powered Extraction

Define what you want in natural language. Our engine figures out the DOM structure and pulls the exact data points reliably.

Batch Mode & Cache

Pass hundreds of URLs in a text file. Cache pages locally so you don't waste API credits re-rendering the same pages twice.

Dynamic Link Extraction

Natively parses non-standard routing like `onclick="window.open()"` or `location.href` assignments for legacy/custom routing fallback.

Proxy Rotation

Built-in round-robin proxy management out of the box to avoid rate-limiting and completely bypass IP bans.

Resume Checkpoints

Because things fail. Extracto saves progress continuously. If your server dies, it resumes exactly where it left off.

Sitemap Auto-Discovery

Point it at a domain and use the --sitemap flag. It will recursively find every single page and queue it up for scraping.

Ethical by Default

Automatic robots.txt compliance. Scrapes politely with configurable rate-limiting between network requests.

Experience it natively

Import Extracto directly into your own asyncio applications.

example.py
import asyncio
from extracto import CrawlerConfig, CrawlerEngine

async def main():
    # 1. Define your crawl job
    config = CrawlerConfig(
        start_url="https://news.ycombinator.com/",
        user_prompt="Extract top 5 post titles and their links.",
        llm_provider="mistral",
        output_format="json",
        max_depth=0
    )

    # 2. Initialize the engine
    engine = CrawlerEngine(config)

    # 3. Run it and get the results directly
    print("Crawling...")
    results = await engine.run()
    
    # 4. Process the data
    for page in results:
        print(f"Scraped {page['source_url']}:")
        print(page["data"])

if __name__ == "__main__":
    asyncio.run(main())

LLM Agnostic

Bring your own API key, or run locally.

OpenAI
Mistral
Groq
Gemini
Ollama

Deploy in Seconds

Step 01

Install Package

# Python >= 3.9 required
pip install extracto-scraper==2.0.5
# Install browser binaries
playwright install chromium
Step 02

Configure LLM

cp .env.example .env
# Edit .env and paste your API key
# Or just use Ollama locally
Step 03

Ignition

# Run interactive setup wizard
extracto

# Or run via pure CLI flags
extracto "example.com" "Get data"

Beautifully Engineered

Completely modular. Swap out the browser engine. Swap out the LLM.

main.py (CLI entrypoint + wizard initialization) ├─ crawler_engine.py # Core execution loop, batch processing, checkpoints │ ├─ browser_engine # Stealth Playwright + proxy rotation + screenshots │ ├─ ai_extractor # Langchain / ScrapeGraphAI integration + providers │ ├─ robots_check # robots.txt fetcher & domain compliance │ ├─ page_cache # Local request caching system │ └─ sitemap # Automatic XML link discovery │ ├─ data_exporter.py # Pipeline to JSON / CSV / XML / SQLite / Markdown ├─ server.py # FastAPI REST API (run with python main.py serve) └─ webhooks.py # Completion streaming to Slack & Discord