6 min readUpdated Mar 2, 2026

webScraper Documentation

Overview

The webScraper is a powerful component integrated into the Vantage analytics and data platform, designed for fetching and extracting structured data from web pages. It employs multiple extraction strategies, including Schema.org microdata, HTML tables, repeated card/item containers, optional custom extraction rules, and page-level text fallbacks. The component can traverse pagination links for scraping data across multiple pages.

Purpose

The primary purpose of the webScraper is to enable automated extraction of relevant data from web pages in a structured format, facilitating analysis, reporting, and data integration tasks. This functionality is beneficial for a wide range of applications, including data aggregation, market research, competitive analysis, and more.

Settings

The webScraper has several configurable options that influence its operation:

1. urlColumn

2. manualUrls

3. maxUrls

4. timeoutMs

5. tableIndex

6. followPagination

7. maxPages

8. rules

9. extractAll

How it Works

The webScraper operates by:

  1. Input Handling: Accepting input data, either from a specified column with URLs or from manually entered URLs.
  2. URL Processing: For the input URLs, the scraper prepares for fetching content by constructing appropriate request headers, including authentication if necessary.
  3. Page Fetching: It sends HTTP requests to each specified URL while respecting the maxUrls and timeout settings.
  4. Data Extraction: Upon receiving the page content, it attempts to extract structured data using:
    • Schema.org microdata
    • HTML table elements
    • Repeated card/item containers
    • Custom extraction rules if provided
  5. Pagination: If enabled, it analyzes the page for pagination links and continues to scrape subsequent pages until the maxPages limit is reached.
  6. Output Generation: Finally, it consolidates the extracted data and outputs it in a structured format.

Expected Data

The webScraper expects input data formatted as an array of objects, where each object includes at least the URL to be scraped as defined by the urlColumn setting. The component handles different extraction methods based on the structure of the target web page.

Use Cases & Examples

Use Cases

  1. Market Research: Scraping competitor product listings from e-commerce websites to analyze pricing and features, generating insights for business strategy.

  2. Data Aggregation: Aggregating real-time data from stock market websites to collect and analyze stock price movements across multiple sources.

  3. Content Monitoring: Automatically fetching and analyzing blog posts or news articles from multiple sources to track industry trends or sentiment analysis.

Example Configuration

Use Case: E-commerce Price Monitoring

Goal: Monitor competitor pricing on a given product across multiple e-commerce platforms.

Configuration Sample:

json
{
    "urlColumn": "productUrl",
    "manualUrls": "",
    "maxUrls": 10,
    "timeoutMs": 15000,
    "tableIndex": 0,
    "followPagination": true,
    "maxPages": 3,
    "rules": [],
    "extractAll": true
}

In this configuration:

With this setup, the webScraper will efficiently collect relevant pricing data from the specified e-commerce sites over multiple pages, allowing the business to maintain competitive intelligence.

AI Integrations & Billing Impact

The webScraper can be integrated with AI functionalities to enhance data processing, such as sentiment analysis or classification based on scraped data. If applied in high-volume scenarios, usage may impact billing due to the pricing model based on the number of requests, extracted data volume, and any additional AI processing invoked.

It's essential for users to monitor their utilization and optimize the settings as necessary to control costs while obtaining the required data.