How is Extracto different from Scrapy or BeautifulSoup?

Extracto does not require brittle CSS selectors. It visually reads websites using LLMs (like GPT-4o or Mistral) to extract whatever data you describe in plain English.

Can Extracto scrape JavaScript-rendered websites (SPAs)?

Yes. Extracto uses Playwright under the hood to fully render React, Vue, and Angular applications before extraction, and can dynamically resolve non-standard route links like onclick='window.open()'.

Open Source Python Scraper

The Best AI Web Scraper
for Developers

Just give Extracto a URL and describe what you want in plain English. No rigid CSS selectors. No brittle XPath. Pure structured data extraction powered by LLMs like GPT-4o and Mistral.

Get Started → View Demo

bash — ./extracto

Everything you need.
Pre-configured.

A production-ready scraper out of the box. Single pages or millions of URLs.

AI-Powered Extraction

Define what you want in natural language. Our engine figures out the DOM structure and pulls the exact data points reliably.

Batch Mode & Cache

Pass hundreds of URLs in a text file. Cache pages locally so you don't waste API credits re-rendering the same pages twice.

Dynamic Link Extraction

Natively parses non-standard routing like `onclick="window.open()"` or `location.href` assignments for legacy/custom routing fallback.

Proxy Rotation

Built-in round-robin proxy management out of the box to avoid rate-limiting and completely bypass IP bans.

Resume Checkpoints

Because things fail. Extracto saves progress continuously. If your server dies, it resumes exactly where it left off.

Sitemap Auto-Discovery

Point it at a domain and use the --sitemap flag. It will recursively find every single page and queue it up for scraping.

Ethical by Default

Automatic robots.txt compliance. Scrapes politely with configurable rate-limiting between network requests.

Experience it natively

Import Extracto directly into your own asyncio applications.

example.py

import asyncio
from extracto import CrawlerConfig, CrawlerEngine

async def main():
    # 1. Define your crawl job
    config = CrawlerConfig(
        start_url="https://news.ycombinator.com/",
        user_prompt="Extract top 5 post titles and their links.",
        llm_provider="mistral",
        output_format="json",
        max_depth=0
    )

    # 2. Initialize the engine
    engine = CrawlerEngine(config)

    # 3. Run it and get the results directly
    print("Crawling...")
    results = await engine.run()
    
    # 4. Process the data
    for page in results:
        print(f"Scraped {page['source_url']}:")
        print(page["data"])

if __name__ == "__main__":
    asyncio.run(main())

Deploy in Seconds

Step 01

Install Package

# Python >= 3.9 required
pip install extracto-scraper==2.0.5
# Install browser binaries
playwright install chromium

Step 02

Configure LLM

cp .env.example .env
# Edit .env and paste your API key
# Or just use Ollama locally

Step 03

Ignition

# Run interactive setup wizard
extracto

# Or run via pure CLI flags
extracto "example.com" "Get data"

Beautifully Engineered

Completely modular. Swap out the browser engine. Swap out the LLM.

main.py (CLI entrypoint + wizard initialization) ├─ crawler_engine.py # Core execution loop, batch processing, checkpoints │ ├─ browser_engine # Stealth Playwright + proxy rotation + screenshots │ ├─ ai_extractor # Langchain / ScrapeGraphAI integration + providers │ ├─ robots_check # robots.txt fetcher & domain compliance │ ├─ page_cache # Local request caching system │ └─ sitemap # Automatic XML link discovery │ ├─ data_exporter.py # Pipeline to JSON / CSV / XML / SQLite / Markdown ├─ server.py # FastAPI REST API (run with python main.py serve) └─ webhooks.py # Completion streaming to Slack & Discord

The Best AI Web Scraperfor Developers

Everything you need. Pre-configured.