Geonode Logo light

Realtime Scraping

scrape(url: string, configurations?: IConfigurations)

Parameters

url (string): The URL of the webpage you want to scrape. configurations (IConfigurations, optional): An object containing specific configurations for the scraping process.

  • Name
    url
    Type
    string
    Description

    The URL of the webpage you want to scrape.

  • Name
    configurations
    Type
    object | IConfigurations, (optional)
    Description

    An object containing specific configurations for the scraping process.

Description

The .scrape() method leverages the Realtime Mode of the Geonode Scraper. This mode is designed for quick and efficient results. You don't need to set up a separate callback API. Instead, the API directly returns the scraped data within a maximum timeout of 150 seconds. Just provide the URL of the web page you want to scrape. If you have specific configurations in mind, you can include them in the 'configurations' object.

Usage

  • Basic: Simply provide the URL you want to scrape. The method will use default configurations.
scraper
  .scrape('https://example.com/')
  .then((res) => {
    console.log('Response:', res?.data);
  })
  .catch((err) => {
    console.error('Error:', err);
  });

Example Response:

 '<!DOCTYPE html><html><head>\n' +
    '    <title>Example Domain</title>\n' +
    '\n' +
    '    <meta charset="utf-8">\n' +
    '    <meta http-equiv="Content-type" content="text/html; charset=utf-8">\n' +
    '    <meta name="viewport" content="width=device-width, initial-scale=1">\n' +
    '    <style type="text/css">\n' +
    '    body {\n' +
    '        background-color: #f0f0f2;\n' +
    '        margin: 0;\n' +
    '        padding: 0;\n' +
    '        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n' +
    '        \n' +
    '    }\n' +
    '    div {\n' +
    '        width: 600px;\n' +
    '        margin: 5em auto;\n' +
    '        padding: 2em;\n' +
    '        background-color: #fdfdff;\n' +
    '        border-radius: 0.5em;\n' +
    '        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n' +
    '    }\n' +
    '    a:link, a:visited {\n' +
    '        color: #38488f;\n' +
    '        text-decoration: none;\n' +
    '    }\n' +
    '    @media (max-width: 700px) {\n' +
    '        div {\n' +
    '            margin: 0 auto;\n' +
    '            width: auto;\n' +
    '        }\n' +
    '    }\n' +
    '    </style>    \n' +
    '</head>\n' +
    '\n' +
    '<body>\n' +
    '<div>\n' +
    '    <h1>Example Domain</h1>\n' +
    '    <p>This domain is for use in illustrative examples in documents. You may use this\n' +
    '    domain in literature without prior coordination or asking for permission.</p>\n' +
    '    <p><a href="https://www.iana.org/domains/example">More information...</a></p>\n' +
    '</div>\n' +
    '\n' +
    '\n' +
    '</body></html>'
  • Using Custom Configuration: You can customize the scraping process by providing a configuration object. This object can override default settings.
const customConfig: IConfigurations = {
  js_render: true, 
  response_format: ‘json’,
  device_type: 'desktop’, 
  country_code: 'us',
  HTMLMinifier: { useMinifier: true }
};
scraper.scrape('https://example.com/', customConfig)
 .then(res => {
  console.log('JSON response:', res?.data);
 })
 .catch(error => {
  console.error('Error:', error.message);
 });

Example Response:

Response: {
  headers: {
    'accept-ranges': 'bytes',
    age: '410515',
    'cache-control': 'max-age=604800',
    'content-encoding': 'gzip',
    'content-length': '648',
    'content-type': 'text/html; charset=UTF-8',
    date: 'Mon, 14 Aug 2023 21:00:49 GMT',
    etag: '"3147526947"',
    expires: 'Mon, 21 Aug 2023 21:00:49 GMT',
    'last-modified': 'Thu, 17 Oct 2019 07:18:26 GMT',
    server: 'ECS (bsa/EB18)',
    vary: 'Accept-Encoding',
    'x-cache': 'HIT'
  },
  html: '<!DOCTYPE html><html><head><title>Example Domain</title><meta charset=utf-8><meta http-equiv=Content-type content="text/html; charset=utf-8"><meta name=viewport content="width=device-width,initial-scale=1"><style type=text/css>body{background-color:#f0f0f2;margin:0;padding:0;font-family:-apple-system,system-ui,BlinkMacSystemFont,"Segoe UI","Open Sans","Helvetica Neue",Helvetica,Arial,sans-serif}div{width:600px;margin:5em auto;padding:2em;background-color:#fdfdff;border-radius:.5em;box-shadow:2px 3px 7px 2px rgba(0,0,0,.02)}a:link,a:visited{color:#38488f;text-decoration:none}@media (max-width:700px){div{margin:0 auto;width:auto}}</style></head><body><div><h1>Example Domain</h1><p>This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.</p><p><a href=https://www.iana.org/domains/example>More information...</a></p></div></body></html>',
  bandwidthInKb: 0.6328125,
  statusCode: 200,
  requestId: '36390037-6906-4c63-9120-90151f6d15b9',
  dataFromRequest: {}
}
  • Using SDK Methods: Instead of a configuration object, you can use the SDK's built-in methods to set specific options. These methods can be chained for multiple configurations.
scraper
  .useJsRendering(true)
  .setResponseFormat('json')
  .setBlockResources(true)
  .setDeviceType('desktop')
  .setCountryCode('us')
  .useDebugger(true)
  .addScenario.onClickAndWaitForNavigation('a', { delay: 500 });
scraper
  .scrape('https://example.com/')
  .then((res) => {
    console.log('JSON response:', res?.data);
  })
  .catch((error) => {
    console.error('Error:', error.message);
  });

Example Response:

Response: {
  headers: {
    'accept-ranges': 'bytes',
    age: '411055',
    'cache-control': 'max-age=604800',
    'content-encoding': 'gzip',
    'content-length': '648',
    'content-type': 'text/html; charset=UTF-8',
    date: 'Mon, 14 Aug 2023 21:09:49 GMT',
    etag: '"3147526947"',
    expires: 'Mon, 21 Aug 2023 21:09:49 GMT',
    'last-modified': 'Thu, 17 Oct 2019 07:18:26 GMT',
    server: 'ECS (bsa/EB18)',
    vary: 'Accept-Encoding',
    'x-cache': 'HIT'
  },
  html: '<!DOCTYPE html><html dir="ltr" lang="en"><head>\n' +
    '  <meta charset="utf-8">\n' +
    '  <meta name="color-scheme" content="light dark">\n' +
    '  <meta name="theme-color" content="#fff">\n' +
    '  <meta name="viewport" content="width=device-width, initial-scale=1.0,\n' +
    '                                 maximum-scale=1.0, user-scalable=no">\n' +
    '  <title>www.iana.org</title>\n' +
    '  <style>/* Copyright 2017 The Chromium Authors\n' +
    ' * Use of this source code is governed by a BSD-style license that can be\n' +
    ' * found in the LICENSE file. */\n' +
    '\n' +
    'a {\n' +
    '  color: var(--link-color);\n' +
    '}\n' +
    '\n' +
    'body {\n' +
    '  --background-color: #fff;\n' +
    '  --error-code-color: var(--google-gray-700);\n' +
    '  --google-blue-100: rgb(210, 227, 252);\n' +
    '  --google-blue-300: rgb(138, 180, 248);\n' +
    '  --google-blue-600: rgb(26, 115, 232);\n' +
    '  --google-blue-700: rgb(25, 103, 210);\n' +
    '  --google-gray-100: rgb(241, 243, 244);\n' +
    '  --google-gray-300: rgb(218, 220, 224);\n' +
    '  --google-gray-500: rgb(154, 160, 166);\n' +
    '  --google-gray-50: rgb(248, 249, 250);\n' +
    '  --google-gray-600: rgb(128, 134, 139);\n' +
    '  --google-gray-700: rgb(95, 99, 104);\n ...228236 more characters,’

  bandwidthInKb: 8.3623046875,
  statusCode: 200,
  requestId: '03f25bd8-8e27-47a2-aca5-20bd9f50e5e3',
  dataFromRequest: {},
  debug: {
  loadRequests: [
    {
      url: 'https://example.com/',
      type: 'document',
      status: 200,
      duration: 0.5455569999814034,
      bandwidth: 0.6328125
    }
  ],
  blockedRequests: [
    { url: 'https://www.iana.org/domains/example', type: 'document' },   
    {
      url: 'data:image/png;base64,iVBORw0KGgoAAAANSUhEU…',
      type: 'image'
    },
    {
      url: 'data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAACY4AAADCCAM…',
      type: 'image'                                                                                     
    },                                                                        
  ],
  total: {
    load: { counter: 1, bandwidth: 0.6328125, latency: 0.5455569999814034 },  },
    block: { counter: 4 }
  }
}
}