Understanding Extraction
The Extraction API converts webpages into clean, structured content that can be consumed by applications, AI systems, search pipelines, and automation workflows.
Instead of downloading a webpage and manually parsing raw HTML, you can send a URL and receive the extracted content in a format that is easier to process.
How It Works
Submit a URL
Send the URL of the webpage you want to extract.
The Page Is Processed
Geonode fetches the webpage and extracts the primary content.
Output Is Generated
The extracted content is returned in one or more supported output formats.
Use the Result
Store, analyze, search, or process the extracted content in your application.
Extraction Endpoints
The Extraction API consists of three endpoints.
| Endpoint | Purpose |
|---|---|
POST /v1/extract | Extract Markdown and/or HTML from a webpage. |
GET /v1/extract/jobs | List and filter previous extraction jobs. |
GET /v1/extract/{job_id} | Retrieve the status or result of a specific extraction job. |
Most extraction workflows begin with POST /v1/extract. The remaining endpoints are primarily used to monitor and retrieve asynchronous extraction jobs.
Processing Modes
The Extraction API supports both synchronous and asynchronous processing.
The request remains open until extraction is complete.
The extracted content is returned directly in the response.
The request immediately returns a job_id.
The extraction continues in the background and the result can be retrieved later using GET /v1/extract/{job_id}.
Extraction Workflow
Synchronous
POST /v1/extract
↓
Extraction completes
↓
Content returnedAsynchronous
POST /v1/extract
↓
job_id returned
↓
GET /v1/extract/{job_id}
↓
Content returnedOutput Formats
The Extraction API supports Markdown, HTML, or both formats in a single request.
The extracted content is returned in the data.markdown field.
{
"formats": ["markdown"]
}Markdown returns the extracted content as plain text with lightweight formatting.
The extracted content is returned in the data.html field.
{
"formats": ["html"]
}HTML returns the extracted content with a structure closer to the original webpage.
The extracted content is returned in both the data.markdown and data.html fields.
{
"formats": ["markdown", "html"]
}Both Markdown and HTML are returned in the same response.
Next Steps
Continue to Your First Extraction to send your first extraction request and retrieve content from a webpage.