6 min read

webScraper Documentation

Overview

The webScraper is a powerful component integrated into the Vantage analytics and data platform, designed for fetching and extracting structured data from web pages. It employs multiple extraction strategies, including Schema.org microdata, HTML tables, repeated card/item containers, optional custom extraction rules, and page-level text fallbacks. The component can traverse pagination links for scraping data across multiple pages.

Purpose

The primary purpose of the webScraper is to enable automated extraction of relevant data from web pages in a structured format, facilitating analysis, reporting, and data integration tasks. This functionality is beneficial for a wide range of applications, including data aggregation, market research, competitive analysis, and more.

Settings

The webScraper has several configurable options that influence its operation:

1. `urlColumn`

Input Type: String
Purpose: Defines the name of the column that contains URLs in the input data source.
Effect: By adjusting this value, you specify the column from which URLs will be extracted for scraping.
Default Value: "url"

2. `manualUrls`

Input Type: String
Purpose: Allows users to input a list of URLs manually, separated by new lines.
Effect: If this setting is populated, it overrides the input data for URL extraction.
Default Value: "" (empty string)

3. `maxUrls`

Input Type: Numeric
Purpose: Sets a maximum limit on the number of URLs to scrape from the input data.
Effect: This limits the total number of requests made by the scraper to avoid overwhelming the target server and to manage processing time.
Default Value: 20

4. `timeoutMs`

Input Type: Numeric
Purpose: Defines how long (in milliseconds) the scraper will wait for a server response before aborting the request.
Effect: Adjusting this value can help in situations where the target server is slow or unresponsive, affecting the efficiency of data scraping.
Default Value: 10000 (10 seconds)

5. `tableIndex`

Input Type: Numeric
Purpose: Specifies which table to extract data from, with zero-based indexing.
Effect: A value of 0 targets the first table, -1 will extract data from all tables found on the page.
Default Value: 0

6. `followPagination`

Input Type: Boolean
Purpose: Determines whether the scraper should follow pagination links to scrape data from multiple pages.
Effect: If set to true, the scraper will attempt to navigate through pagination links found on each scraped page, resulting in a broader data set.
Default Value: false

7. `maxPages`

Input Type: Numeric
Purpose: Defines the maximum number of pages to scrape when pagination is followed.
Effect: This setting can be used to control the extent of the data retrieval process, balancing thoroughness and resource consumption.
Default Value: 5

8. `rules`

Input Type: Array
Purpose: Accepts an array of custom extraction rules (CSS selectors, regex) for tailored data extraction.
Effect: Users can override the default extraction strategies by defining specific rules to capture non-standard data formats.
Default Value: [] (empty array)

9. `extractAll`

Input Type: Boolean
Purpose: Indicates whether the scraper should attempt to extract all relevant data from identified elements or limit it.
Effect: When set to true, it enhances the comprehensiveness of the scraped data but may increase processing time.
Default Value: false

How it Works

The webScraper operates by:

Input Handling: Accepting input data, either from a specified column with URLs or from manually entered URLs.
URL Processing: For the input URLs, the scraper prepares for fetching content by constructing appropriate request headers, including authentication if necessary.
Page Fetching: It sends HTTP requests to each specified URL while respecting the maxUrls and timeout settings.
Data Extraction: Upon receiving the page content, it attempts to extract structured data using:
- Schema.org microdata
- HTML table elements
- Repeated card/item containers
- Custom extraction rules if provided
Pagination: If enabled, it analyzes the page for pagination links and continues to scrape subsequent pages until the maxPages limit is reached.
Output Generation: Finally, it consolidates the extracted data and outputs it in a structured format.

Expected Data

The webScraper expects input data formatted as an array of objects, where each object includes at least the URL to be scraped as defined by the urlColumn setting. The component handles different extraction methods based on the structure of the target web page.

Use Cases & Examples

Use Cases

Market Research: Scraping competitor product listings from e-commerce websites to analyze pricing and features, generating insights for business strategy.
Data Aggregation: Aggregating real-time data from stock market websites to collect and analyze stock price movements across multiple sources.
Content Monitoring: Automatically fetching and analyzing blog posts or news articles from multiple sources to track industry trends or sentiment analysis.

Example Configuration

Use Case: E-commerce Price Monitoring

Goal: Monitor competitor pricing on a given product across multiple e-commerce platforms.

Configuration Sample:

json

{
    "urlColumn": "productUrl",
    "manualUrls": "",
    "maxUrls": 10,
    "timeoutMs": 15000,
    "tableIndex": 0,
    "followPagination": true,
    "maxPages": 3,
    "rules": [],
    "extractAll": true
}

In this configuration:

urlColumn is set to "productUrl" to align with the source data structure.
maxUrls is limited to 10 to manage load.
timeoutMs increased to 15000 to accommodate slower networks.
followPagination enabled to scrape multiple pages of products.
maxPages is set to 3 to limit the scope of the scraping.
extractAll is set to true to ensure comprehensive data capture for each product.

With this setup, the webScraper will efficiently collect relevant pricing data from the specified e-commerce sites over multiple pages, allowing the business to maintain competitive intelligence.

AI Integrations & Billing Impact

The webScraper can be integrated with AI functionalities to enhance data processing, such as sentiment analysis or classification based on scraped data. If applied in high-volume scenarios, usage may impact billing due to the pricing model based on the number of requests, extracted data volume, and any additional AI processing invoked.

It's essential for users to monitor their utilization and optimize the settings as necessary to control costs while obtaining the required data.

← PreviousURL Parser Next →HTTP Request

webScraper Documentation

Overview

Purpose

Settings

1. urlColumn

2. manualUrls

3. maxUrls

4. timeoutMs

5. tableIndex

6. followPagination

7. maxPages

8. rules

9. extractAll