Extraction

Extract Content

Extract Markdown or HTML from a single webpage.

POST /v1/extract extracts content from a single webpage URL. Use this endpoint when you already know the page you want to scrape and you want the API to return the extracted content as Markdown, HTML, or both.

Request

The example below is a shell command. You can run it from a terminal on macOS, Linux, or Windows with a shell that supports curl. The first two lines set environment variables so you do not have to paste the base URL and API key into every command. Replace YOUR_API_KEY with your actual Scraper API key before running it.

This request extracts the Hacker News front page as Markdown, does not use JavaScript rendering, waits for the result in the same HTTP response, and routes the request through a US residential proxy.

export SCRAPER_API_BASE_URL="https://scraper.geonode.io"
export GEONODE_SCRAPER_API_KEY="YOUR_API_KEY"

curl -X POST "$SCRAPER_API_BASE_URL/v1/extract" \
  -H "X-Api-Key: $GEONODE_SCRAPER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.ycombinator.com/",
    "formats": ["markdown"],
    "render_js": false,
    "processing_mode": "sync",
    "proxy": {
      "country": "US",
      "type": "residential"
    }
  }'

If the request succeeds, the API returns a JSON response with the extracted content inside data.markdown. If you want HTML instead, change formats to ["html"]. If the page needs browser rendering, set render_js to true.

Request Body

The extraction request body tells the API which page to fetch, which output format to return, and how much browser/proxy help the extraction should use. Only url is required. If you're making your first request, start with formats: ["markdown"], render_js: false, and processing_mode: "sync". You can add JavaScript rendering, proxy overrides, custom headers, or link extraction once you know what the target page needs.

The defaults are meant to keep the first request small. If you omit formats, the API defaults to HTML output. If you omit proxy, the API applies its default residential proxy routing. If you omit processing_mode, the request runs synchronously and returns the result in the same response.

FieldTypeRequiredDefaultDescription
urlstringYesNoneThe webpage URL to extract. Must be a valid URI with a hostname.
formatsarrayNo["html"]Output formats to return. Supported values are markdown and html.
render_jsbooleanNofalseUses a headless browser before extraction. Useful for JavaScript-heavy pages, but usually slower.
processing_modestringNosyncUse sync to wait for the result in the same request, or async to create a background job.
proxyobject or nullNoResidential proxy defaultsControls proxy country and type. If omitted or null, default proxy routing is applied.
headersobject or nullNonullCustom HTTP headers to include in the extraction request.
extract_linksbooleanNofalseWhen true, returns discovered page links in data.links.

Most integrations only need url, formats, and render_js at first. Add proxy when you need a specific country or proxy type, and add processing_mode: "async" when the target page may take too long for a blocking request.

JavaScript Rendering

Set render_js to true when the target page loads content in the browser.

{
  "url": "https://quotes.toscrape.com/js/",
  "formats": ["markdown"],
  "render_js": true,
  "processing_mode": "sync"
}

Use render_js: false for normal static pages. It is faster and is usually enough for documentation pages, articles, simple catalogs, and basic HTML sites.

Use render_js: true when the response looks like a page shell, loading skeleton, or navigation-only document. JavaScript rendering is included in the same request model and does not use extra requests.

Proxy and Geo-Targeting

The Scraper API can route requests through Geonode proxies. If you omit proxy, the API applies default residential proxy routing and tries to infer a useful country from the target URL. If no country can be inferred, the upstream proxy provider picks the exit region.

To control location, pass a proxy object:

{
  "proxy": {
    "country": "US",
    "type": "residential"
  }
}
FieldTypeDescription
proxy.countrystring or nullTwo-letter ISO country code, such as US, GB, or DE. If null, the API auto-resolves the country when possible.
proxy.typestringProxy type. Supported values are residential, datacenter, and mix. Defaults to residential.

Custom Headers

Use the headers object to send custom HTTP headers with the extraction request.

{
  "url": "https://example.com",
  "formats": ["markdown"],
  "headers": {
    "Accept-Language": "en-US,en;q=0.9"
  }
}

Do not put your Geonode API key inside headers. Authentication belongs in the X-Api-Key request header sent to the Scraper API.

Set extract_links to true when you want the extracted page content and the links found on that page.

curl -X POST "$SCRAPER_API_BASE_URL/v1/extract" \
  -H "X-Api-Key: $GEONODE_SCRAPER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://quotes.toscrape.com/",
    "formats": ["markdown"],
    "render_js": false,
    "processing_mode": "sync",
    "extract_links": true
  }'

The response includes links in data.links.

{
  "data": {
    "markdown": "...",
    "html": null,
    "links": [
      "https://quotes.toscrape.com/login",
      "https://quotes.toscrape.com/author/Albert-Einstein",
      "https://www.goodreads.com/quotes"
    ]
  }
}

extract_links returns links found on the extracted page. It is not a crawler. If you want to discover URLs under a site before deciding what to extract, use the map endpoint.

Response

In sync mode, a successful extraction returns 200.

{
  "data": {
    "html": null,
    "markdown": "...",
    "links": null
  },
  "metadata": {
    "url": "https://news.ycombinator.com/",
    "render_js": false,
    "http_status": 200,
    "duration_ms": 799,
    "retry_count": 0,
    "formats": ["markdown"],
    "proxy": {
      "country": "US",
      "type": "residential"
    },
    "processing_mode": "sync",
    "headers": null
  },
  "tokens_charged": 1
}

The response is split into three parts: data contains the extracted content, metadata describes how the request ran, and tokens_charged shows how many requests were charged for the extraction.

FieldDescription
data.markdownExtracted Markdown when markdown is requested. Otherwise usually null.
data.htmlExtracted sanitized HTML when html is requested. Otherwise usually null.
data.linksLinks found on the page when extract_links is true. Otherwise usually null.
metadata.urlURL that was extracted.
metadata.render_jsWhether a headless browser was used.
metadata.http_statusHTTP status code observed from the target site, when available.
metadata.duration_msExtraction duration in milliseconds.
metadata.retry_countNumber of retries attempted.
metadata.formatsOutput formats requested.
metadata.proxyProxy settings used for the request, when available.
metadata.processing_modesync or async.
metadata.headersCustom headers attached to the extraction request, when provided.
tokens_chargedNumber of requests charged for this extraction.

The response field is currently named tokens_charged in the API schema. In Scraper API docs and billing language, treat this value as the number of requests charged.

On this page