Understanding Extraction

The Extraction API converts webpages into clean, structured content that can be consumed by applications, AI systems, search pipelines, and automation workflows.

Instead of downloading a webpage and manually parsing raw HTML, you can send a URL and receive the extracted content in a format that is easier to process.

How It Works

Submit a URL

Send the URL of the webpage you want to extract.

The Page Is Processed

Geonode fetches the webpage and extracts the primary content.

Output Is Generated

The extracted content is returned in one or more supported output formats.

Use the Result

Store, analyze, search, or process the extracted content in your application.

Extraction Endpoints

The Extraction API consists of three endpoints.

Endpoint	Purpose
`POST /v1/extract`	Extract Markdown and/or HTML from a webpage.
`GET /v1/extract/jobs`	List and filter previous extraction jobs.
`GET /v1/extract/{job_id}`	Retrieve the status or result of a specific extraction job.

Most extraction workflows begin with POST /v1/extract. The remaining endpoints are primarily used to monitor and retrieve asynchronous extraction jobs.

Processing Modes

The Extraction API supports both synchronous and asynchronous processing.

The request remains open until extraction is complete.

The extracted content is returned directly in the response.

The request immediately returns a job_id.

The extraction continues in the background and the result can be retrieved later using GET /v1/extract/{job_id}.

Extraction Workflow

Synchronous

POST /v1/extract
        ↓
Extraction completes
        ↓
Content returned

Asynchronous

POST /v1/extract
        ↓
job_id returned
        ↓
GET /v1/extract/{job_id}
        ↓
Content returned

Output Formats

The Extraction API supports Markdown, HTML, or both formats in a single request.

The extracted content is returned in the data.markdown field.

{
  "formats": ["markdown"]
}

Markdown returns the extracted content as plain text with lightweight formatting.

The extracted content is returned in the data.html field.

{
  "formats": ["html"]
}

HTML returns the extracted content with a structure closer to the original webpage.

The extracted content is returned in both the data.markdown and data.html fields.

{
  "formats": ["markdown", "html"]
}

Both Markdown and HTML are returned in the same response.

Next Steps

Continue to Your First Extraction to send your first extraction request and retrieve content from a webpage.

Understanding Extraction

On this page