Start a Crawl Job

Coming soon

This endpoint is documented but not yet available in production. The contract below reflects the planned behavior. Reach out via support for early access or launch notification.

POST /v1/crawl starts from one seed URL, follows discovered links, and extracts content from each crawled page. Use it when you want the API to discover pages for you instead of providing a fixed list of URLs.

Crawls are asynchronous. The create request returns a job_id, and you poll that job for progress and paginated page results.

Request

The example below starts a crawl from quotes.toscrape.com, limits discovery to the same domain, and extracts Markdown from up to 25 pages at depth 2.

export SCRAPER_API_BASE_URL="https://scraper.geonode.io"
export GEONODE_SCRAPER_API_KEY="YOUR_API_KEY"

curl -X POST "$SCRAPER_API_BASE_URL/v1/crawl" \
  -H "X-Api-Key: $GEONODE_SCRAPER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://quotes.toscrape.com/",
    "depth": 2,
    "limit": 25,
    "formats": ["markdown"],
    "render_js": false,
    "same_domain_only": true,
    "include_subdomains": false,
    "proxy": {
      "country": "US",
      "type": "residential"
    }
  }'

Request Body

The crawl request controls both discovery and extraction. url tells the API where to start, while depth, limit, same_domain_only, and include_subdomains define how far the crawl can expand.

Field	Type	Required	Default	Description
`url`	string	Yes	None	Seed URL to start crawling from.
`depth`	integer	No	`2`	Maximum breadth-first depth from the seed URL. `1` means the seed page only. Maximum is `10`.
`limit`	integer	No	`50`	Maximum number of pages to crawl. Maximum is `10000`.
`formats`	array	No	`["markdown"]`	Output formats to extract per page. Supported values are `markdown` and `html`.
`render_js`	boolean	No	`false`	Uses a headless browser for each page. Useful for JavaScript-heavy sites, but slower.
`same_domain_only`	boolean	No	`true`	Follows only links that stay on the seed domain.
`include_subdomains`	boolean	No	`false`	Includes subdomains when `same_domain_only` is true.
`proxy`	object or null	No	Residential proxy defaults	Proxy settings for page requests.

Start with a conservative limit and increase it after you understand the target site's structure. Large crawls can create many page extraction jobs.

Response

A successful create request returns 202 with a crawl job ID.

{
  "job_id": "9d7b2c8e-8a4b-4c10-9af3-65f4f8f6c019",
  "url": "https://quotes.toscrape.com/",
  "status": "queued",
  "status_url": "/v1/crawl/9d7b2c8e-8a4b-4c10-9af3-65f4f8f6c019",
  "estimated_pages": 25
}

Use the returned job_id with Get Crawl Job Status.

Start a Crawl Job

Request

Request Body

Response

On this page