Just give Extracto a URL and describe what you want in plain English. No rigid CSS selectors. No brittle XPath. Pure structured data extraction powered by LLMs like GPT-4o and Mistral.
A production-ready scraper out of the box. Single pages or millions of URLs.
Define what you want in natural language. Our engine figures out the DOM structure and pulls the exact data points reliably.
Pass hundreds of URLs in a text file. Cache pages locally so you don't waste API credits re-rendering the same pages twice.
Natively parses non-standard routing like `onclick="window.open()"` or `location.href` assignments for legacy/custom routing fallback.
Built-in round-robin proxy management out of the box to avoid rate-limiting and completely bypass IP bans.
Because things fail. Extracto saves progress continuously. If your server dies, it resumes exactly where it left off.
Point it at a domain and use the --sitemap flag. It will recursively find every
single page and queue it up for scraping.
Automatic robots.txt compliance. Scrapes politely with configurable rate-limiting between network requests.
Import Extracto directly into your own asyncio applications.
import asyncio from extracto import CrawlerConfig, CrawlerEngine async def main(): # 1. Define your crawl job config = CrawlerConfig( start_url="https://news.ycombinator.com/", user_prompt="Extract top 5 post titles and their links.", llm_provider="mistral", output_format="json", max_depth=0 ) # 2. Initialize the engine engine = CrawlerEngine(config) # 3. Run it and get the results directly print("Crawling...") results = await engine.run() # 4. Process the data for page in results: print(f"Scraped {page['source_url']}:") print(page["data"]) if __name__ == "__main__": asyncio.run(main())
Bring your own API key, or run locally.
# Python >= 3.9 required pip install extracto-scraper==2.0.5 # Install browser binaries playwright install chromium
cp .env.example .env # Edit .env and paste your API key # Or just use Ollama locally
# Run interactive setup wizard extracto # Or run via pure CLI flags extracto "example.com" "Get data"
Completely modular. Swap out the browser engine. Swap out the LLM.