Extract Content

POST /v1/extract extracts content from a single webpage URL. Use this endpoint when you already know the page you want to scrape and you want the API to return the extracted content as Markdown, HTML, or both.

Request

The example below is a shell command. You can run it from a terminal on macOS, Linux, or Windows with a shell that supports curl. The first two lines set environment variables so you do not have to paste the base URL and API key into every command. Replace YOUR_API_KEY with your actual Scraper API key before running it.

This request extracts the Hacker News front page as Markdown, does not use JavaScript rendering, waits for the result in the same HTTP response, and routes the request through a US residential proxy.

export SCRAPER_API_BASE_URL="https://scraper.geonode.io"
export GEONODE_SCRAPER_API_KEY="YOUR_API_KEY"

curl -X POST "$SCRAPER_API_BASE_URL/v1/extract" \
  -H "X-Api-Key: $GEONODE_SCRAPER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com/",
    "formats": ["markdown"],
    "render_js": false,
    "processing_mode": "sync",
    "proxy": {
      "country": "US",
      "type": "residential"
    }
  }'

If the request succeeds, the API returns a JSON response with the extracted content inside data.markdown. If you want HTML instead, change formats to ["html"]. If the page needs browser rendering, set render_js to true.

Request Body

The extraction request body tells the API which page to fetch, which output format to return, and how much browser/proxy help the extraction should use. Only url is required. If you're making your first request, start with formats: ["markdown"], render_js: false, and processing_mode: "sync". You can add JavaScript rendering, proxy overrides, custom headers, or link extraction once you know what the target page needs.

The defaults are meant to keep the first request small. If you omit formats, the API defaults to HTML output. If you omit proxy, the API applies its default residential proxy routing. If you omit processing_mode, the request runs synchronously and returns the result in the same response.

Field	Type	Required	Default	Description
`url`	string	Yes	None	The webpage URL to extract. Must be a valid URI with a hostname.
`formats`	array	No	`["html"]`	Output formats to return. Supported values are `markdown` and `html`.
`render_js`	boolean	No	`false`	Uses a headless browser before extraction. Useful for JavaScript-heavy pages, but usually slower.
`processing_mode`	string	No	`sync`	Use `sync` to wait for the result in the same request, or `async` to create a background job.
`proxy`	object or null	No	Residential proxy defaults	Controls proxy country and type. If omitted or `null`, default proxy routing is applied.
`headers`	object or null	No	`null`	Custom HTTP headers to include in the extraction request.
`extract_links`	boolean	No	`false`	When `true`, returns discovered page links in `data.links`.

Most integrations only need url, formats, and render_js at first. Add proxy when you need a specific country or proxy type, and add processing_mode: "async" when the target page may take too long for a blocking request.

JavaScript Rendering

Set render_js to true when the target page loads content in the browser.

{
  "url": "https://quotes.toscrape.com/js/",
  "formats": ["markdown"],
  "render_js": true,
  "processing_mode": "sync"
}

Use render_js: false for normal static pages. It is faster and is usually enough for documentation pages, articles, simple catalogs, and basic HTML sites.

Use render_js: true when the response looks like a page shell, loading skeleton, or navigation-only document. JavaScript rendering is included in the same request model and does not use extra requests.

Proxy and Geo-Targeting

The Scraper API can route requests through Geonode proxies. If you omit proxy, the API applies default residential proxy routing and tries to infer a useful country from the target URL. If no country can be inferred, the upstream proxy provider picks the exit region.

To control location, pass a proxy object:

{
  "proxy": {
    "country": "US",
    "type": "residential"
  }
}

Field	Type	Description
`proxy.country`	string or null	Two-letter ISO country code, such as `US`, `GB`, or `DE`. If `null`, the API auto-resolves the country when possible.
`proxy.type`	string	Proxy type. Supported values are `residential`, `datacenter`, and `mix`. Defaults to `residential`.

Custom Headers

Use the headers object to send custom HTTP headers with the extraction request.

{
  "url": "https://example.com",
  "formats": ["markdown"],
  "headers": {
    "Accept-Language": "en-US,en;q=0.9"
  }
}

Do not put your Geonode API key inside headers. Authentication belongs in the X-Api-Key request header sent to the Scraper API.

Link Extraction

Set extract_links to true when you want the extracted page content and the links found on that page.

curl -X POST "$SCRAPER_API_BASE_URL/v1/extract" \
  -H "X-Api-Key: $GEONODE_SCRAPER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://quotes.toscrape.com/",
    "formats": ["markdown"],
    "render_js": false,
    "processing_mode": "sync",
    "extract_links": true
  }'

The response includes links in data.links.

{
  "data": {
    "markdown": "...",
    "html": null,
    "links": [
      "https://quotes.toscrape.com/login",
      "https://quotes.toscrape.com/author/Albert-Einstein",
      "https://www.goodreads.com/quotes"
    ]
  }
}

extract_links returns links found on the extracted page. It is not a crawler. If you want to discover URLs under a site before deciding what to extract, use the map endpoint.

Response

In sync mode, a successful extraction returns 200.

{
  "data": {
    "html": null,
    "markdown": "...",
    "links": null
  },
  "metadata": {
    "url": "https://news.ycombinator.com/",
    "render_js": false,
    "http_status": 200,
    "duration_ms": 799,
    "retry_count": 0,
    "formats": ["markdown"],
    "proxy": {
      "country": "US",
      "type": "residential"
    },
    "processing_mode": "sync",
    "headers": null
  },
  "tokens_charged": 1
}

The response is split into three parts: data contains the extracted content, metadata describes how the request ran, and tokens_charged shows how many requests were charged for the extraction.

Field	Description
`data.markdown`	Extracted Markdown when `markdown` is requested. Otherwise usually `null`.
`data.html`	Extracted sanitized HTML when `html` is requested. Otherwise usually `null`.
`data.links`	Links found on the page when `extract_links` is `true`. Otherwise usually `null`.
`metadata.url`	URL that was extracted.
`metadata.render_js`	Whether a headless browser was used.
`metadata.http_status`	HTTP status code observed from the target site, when available.
`metadata.duration_ms`	Extraction duration in milliseconds.
`metadata.retry_count`	Number of retries attempted.
`metadata.formats`	Output formats requested.
`metadata.proxy`	Proxy settings used for the request, when available.
`metadata.processing_mode`	`sync` or `async`.
`metadata.headers`	Custom headers attached to the extraction request, when provided.
`tokens_charged`	Number of requests charged for this extraction.

The response field is currently named tokens_charged in the API schema. In Scraper API docs and billing language, treat this value as the number of requests charged.