Scraper API Reference

Base URL, authentication, output formats, and endpoint overview for the Geonode Scraper API.

Use the Scraper API reference when you already know what you want to call and need the exact endpoint, request fields, response shape, and error behavior. If you're using the API for the first time, start with the Quick Start Guide first.

Base URL

Use the current Scraper API base URL:

https://scraper.geonode.io

The examples in this reference use an environment variable for the base URL:

export SCRAPER_API_BASE_URL="https://scraper.geonode.io"

Authentication

Send your Scraper API key in the X-Api-Key header.

X-Api-Key: YOUR_API_KEY

When you run the examples from a terminal, store the API key in an environment variable first:

export GEONODE_SCRAPER_API_KEY="YOUR_API_KEY"

Keep your API key private. Do not expose it in frontend code, public repositories, logs, screenshots, or support messages.

Endpoints

The current production API exposes endpoints for single-page extraction, URL discovery, multi-URL batch jobs, site crawls, usage statistics, webhooks, and service health checks.

MethodEndpointDescription
GET/healthCheck whether the Scraper API service is healthy.
POST/v1/extractExtract Markdown and/or HTML from a single webpage.
GET/v1/extract/{job_id}Poll one async extraction job and retrieve the result when it is ready.
GET/v1/extract/jobsList and filter previous extraction jobs.
POST/v1/batchStart an asynchronous extraction job for multiple URLs.
GET/v1/batch/{job_id}Poll a batch job and retrieve paginated item results.
DELETE/v1/batch/{job_id}Cancel a batch job that is still running.
POST/v1/mapDiscover URLs from a base URL using sitemap and HTML link discovery.
GET/v1/statisticsRetrieve aggregated extraction statistics for a date range.
POST/v1/crawlStart a site crawl from one seed URL.
GET/v1/crawl/{job_id}Poll a crawl job and retrieve paginated page results.
DELETE/v1/crawl/{job_id}Cancel a crawl job that is still running.
POST/v1/webhooksRegister a webhook subscription.
GET/v1/webhooksList registered webhooks.
GET/v1/webhooks/{webhook_id}Retrieve one webhook subscription.
PATCH/v1/webhooks/{webhook_id}Update a webhook subscription.
DELETE/v1/webhooks/{webhook_id}Delete a webhook subscription.
POST/v1/webhooks/{webhook_id}/rotate-secretRotate a webhook signing secret.
GET/v1/webhooks/{webhook_id}/deliveriesList delivery attempts for a webhook.

Output Formats

The API response envelope is always JSON. The extracted page content can be returned as Markdown, HTML, or both.

{
  "formats": ["markdown"]
}
{
  "formats": ["html"]
}
{
  "formats": ["markdown", "html"]
}

Use Markdown when you need readable page content for LLM workflows, indexing, review, or text processing. Use HTML when you need structure closer to the original page.

There is no structured JSON content extraction format in the current public POST /v1/extract schema.

Common Options

Most requests start with a URL and an output format. Add options only when the target page needs them.

OptionUse it when
render_jsThe returned content looks like a shell, loading state, or navigation-only document.
proxyYou need a specific country or proxy type.
processing_modeYou want to use async mode for slow pages.
headersYou need to send additional target-request headers.
extract_linksYou want links found on the extracted page included in the response.

Next Steps

Open the endpoint page that matches what you want to do:

On this page