Scraper API Reference
Base URL, authentication, output formats, and endpoint overview for the Geonode Scraper API.
Use the Scraper API reference when you already know what you want to call and need the exact endpoint, request fields, response shape, and error behavior. If you're using the API for the first time, start with the Quick Start Guide first.
Base URL
Use the current Scraper API base URL:
https://scraper.geonode.ioThe examples in this reference use an environment variable for the base URL:
export SCRAPER_API_BASE_URL="https://scraper.geonode.io"Authentication
Send your Scraper API key in the X-Api-Key header.
X-Api-Key: YOUR_API_KEYWhen you run the examples from a terminal, store the API key in an environment variable first:
export GEONODE_SCRAPER_API_KEY="YOUR_API_KEY"Keep your API key private. Do not expose it in frontend code, public repositories, logs, screenshots, or support messages.
Endpoints
The current production API exposes endpoints for single-page extraction, URL discovery, multi-URL batch jobs, site crawls, usage statistics, webhooks, and service health checks.
| Method | Endpoint | Description |
|---|---|---|
GET | /health | Check whether the Scraper API service is healthy. |
POST | /v1/extract | Extract Markdown and/or HTML from a single webpage. |
GET | /v1/extract/{job_id} | Poll one async extraction job and retrieve the result when it is ready. |
GET | /v1/extract/jobs | List and filter previous extraction jobs. |
POST | /v1/batch | Start an asynchronous extraction job for multiple URLs. |
GET | /v1/batch/{job_id} | Poll a batch job and retrieve paginated item results. |
DELETE | /v1/batch/{job_id} | Cancel a batch job that is still running. |
POST | /v1/map | Discover URLs from a base URL using sitemap and HTML link discovery. |
GET | /v1/statistics | Retrieve aggregated extraction statistics for a date range. |
POST | /v1/crawl | Start a site crawl from one seed URL. |
GET | /v1/crawl/{job_id} | Poll a crawl job and retrieve paginated page results. |
DELETE | /v1/crawl/{job_id} | Cancel a crawl job that is still running. |
POST | /v1/webhooks | Register a webhook subscription. |
GET | /v1/webhooks | List registered webhooks. |
GET | /v1/webhooks/{webhook_id} | Retrieve one webhook subscription. |
PATCH | /v1/webhooks/{webhook_id} | Update a webhook subscription. |
DELETE | /v1/webhooks/{webhook_id} | Delete a webhook subscription. |
POST | /v1/webhooks/{webhook_id}/rotate-secret | Rotate a webhook signing secret. |
GET | /v1/webhooks/{webhook_id}/deliveries | List delivery attempts for a webhook. |
Output Formats
The API response envelope is always JSON. The extracted page content can be returned as Markdown, HTML, or both.
{
"formats": ["markdown"]
}{
"formats": ["html"]
}{
"formats": ["markdown", "html"]
}Use Markdown when you need readable page content for LLM workflows, indexing, review, or text processing. Use HTML when you need structure closer to the original page.
There is no structured JSON content extraction format in the current public POST /v1/extract schema.
Common Options
Most requests start with a URL and an output format. Add options only when the target page needs them.
| Option | Use it when |
|---|---|
render_js | The returned content looks like a shell, loading state, or navigation-only document. |
proxy | You need a specific country or proxy type. |
processing_mode | You want to use async mode for slow pages. |
headers | You need to send additional target-request headers. |
extract_links | You want links found on the extracted page included in the response. |
Next Steps
Open the endpoint page that matches what you want to do: