Start a Crawl Job
Crawl a website from a seed URL and extract page content.
POST /v1/crawl starts from one seed URL, follows discovered links, and extracts content from each crawled page. Use it when you want the API to discover pages for you instead of providing a fixed list of URLs.
Crawls are asynchronous. The create request returns a job_id, and you poll that job for progress and paginated page results.
Request
The example below starts a crawl from quotes.toscrape.com, limits discovery to the same domain, and extracts Markdown from up to 25 pages at depth 2.
export SCRAPER_API_BASE_URL="https://scraper.geonode.io"
export GEONODE_SCRAPER_API_KEY="YOUR_API_KEY"
curl -X POST "$SCRAPER_API_BASE_URL/v1/crawl" \
-H "X-Api-Key: $GEONODE_SCRAPER_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://quotes.toscrape.com/",
"depth": 2,
"limit": 25,
"formats": ["markdown"],
"render_js": false,
"same_domain_only": true,
"include_subdomains": false,
"proxy": {
"country": "US",
"type": "residential"
}
}'Request Body
The crawl request controls both discovery and extraction. url tells the API where to start, while depth, limit, same_domain_only, and include_subdomains define how far the crawl can expand.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
url | string | Yes | None | Seed URL to start crawling from. |
depth | integer | No | 2 | Maximum breadth-first depth from the seed URL. 1 means the seed page only. Maximum is 10. |
limit | integer | No | 50 | Maximum number of pages to crawl. Maximum is 10000. |
formats | array | No | ["markdown"] | Output formats to extract per page. Supported values are markdown and html. |
render_js | boolean | No | false | Uses a headless browser for each page. Useful for JavaScript-heavy sites, but slower. |
same_domain_only | boolean | No | true | Follows only links that stay on the seed domain. |
include_subdomains | boolean | No | false | Includes subdomains when same_domain_only is true. |
proxy | object or null | No | Residential proxy defaults | Proxy settings for page requests. |
Start with a conservative limit and increase it after you understand the target site's structure. Large crawls can create many page extraction jobs.
Response
A successful create request returns 202 with a crawl job ID.
{
"job_id": "9d7b2c8e-8a4b-4c10-9af3-65f4f8f6c019",
"url": "https://quotes.toscrape.com/",
"status": "queued",
"status_url": "/v1/crawl/9d7b2c8e-8a4b-4c10-9af3-65f4f8f6c019",
"estimated_pages": 25
}Use the returned job_id with Get Crawl Job Status.