webScraper Documentation
Overview
The webScraper is a powerful component integrated into the Vantage analytics and data platform, designed for fetching and extracting structured data from web pages. It employs multiple extraction strategies, including Schema.org microdata, HTML tables, repeated card/item containers, optional custom extraction rules, and page-level text fallbacks. The component can traverse pagination links for scraping data across multiple pages.
Purpose
The primary purpose of the webScraper is to enable automated extraction of relevant data from web pages in a structured format, facilitating analysis, reporting, and data integration tasks. This functionality is beneficial for a wide range of applications, including data aggregation, market research, competitive analysis, and more.
Settings
The webScraper has several configurable options that influence its operation:
1. urlColumn
- Input Type: String
- Purpose: Defines the name of the column that contains URLs in the input data source.
- Effect: By adjusting this value, you specify the column from which URLs will be extracted for scraping.
- Default Value:
"url"
2. manualUrls
- Input Type: String
- Purpose: Allows users to input a list of URLs manually, separated by new lines.
- Effect: If this setting is populated, it overrides the input data for URL extraction.
- Default Value:
""(empty string)
3. maxUrls
- Input Type: Numeric
- Purpose: Sets a maximum limit on the number of URLs to scrape from the input data.
- Effect: This limits the total number of requests made by the scraper to avoid overwhelming the target server and to manage processing time.
- Default Value:
20
4. timeoutMs
- Input Type: Numeric
- Purpose: Defines how long (in milliseconds) the scraper will wait for a server response before aborting the request.
- Effect: Adjusting this value can help in situations where the target server is slow or unresponsive, affecting the efficiency of data scraping.
- Default Value:
10000(10 seconds)
5. tableIndex
- Input Type: Numeric
- Purpose: Specifies which table to extract data from, with zero-based indexing.
- Effect: A value of
0targets the first table,-1will extract data from all tables found on the page. - Default Value:
0
6. followPagination
- Input Type: Boolean
- Purpose: Determines whether the scraper should follow pagination links to scrape data from multiple pages.
- Effect: If set to
true, the scraper will attempt to navigate through pagination links found on each scraped page, resulting in a broader data set. - Default Value:
false
7. maxPages
- Input Type: Numeric
- Purpose: Defines the maximum number of pages to scrape when pagination is followed.
- Effect: This setting can be used to control the extent of the data retrieval process, balancing thoroughness and resource consumption.
- Default Value:
5
8. rules
- Input Type: Array
- Purpose: Accepts an array of custom extraction rules (CSS selectors, regex) for tailored data extraction.
- Effect: Users can override the default extraction strategies by defining specific rules to capture non-standard data formats.
- Default Value:
[](empty array)
9. extractAll
- Input Type: Boolean
- Purpose: Indicates whether the scraper should attempt to extract all relevant data from identified elements or limit it.
- Effect: When set to
true, it enhances the comprehensiveness of the scraped data but may increase processing time. - Default Value:
false
How it Works
The webScraper operates by:
- Input Handling: Accepting input data, either from a specified column with URLs or from manually entered URLs.
- URL Processing: For the input URLs, the scraper prepares for fetching content by constructing appropriate request headers, including authentication if necessary.
- Page Fetching: It sends HTTP requests to each specified URL while respecting the maxUrls and timeout settings.
- Data Extraction: Upon receiving the page content, it attempts to extract structured data using:
- Schema.org microdata
- HTML table elements
- Repeated card/item containers
- Custom extraction rules if provided
- Pagination: If enabled, it analyzes the page for pagination links and continues to scrape subsequent pages until the maxPages limit is reached.
- Output Generation: Finally, it consolidates the extracted data and outputs it in a structured format.
Expected Data
The webScraper expects input data formatted as an array of objects, where each object includes at least the URL to be scraped as defined by the urlColumn setting. The component handles different extraction methods based on the structure of the target web page.
Use Cases & Examples
Use Cases
-
Market Research: Scraping competitor product listings from e-commerce websites to analyze pricing and features, generating insights for business strategy.
-
Data Aggregation: Aggregating real-time data from stock market websites to collect and analyze stock price movements across multiple sources.
-
Content Monitoring: Automatically fetching and analyzing blog posts or news articles from multiple sources to track industry trends or sentiment analysis.
Example Configuration
Use Case: E-commerce Price Monitoring
Goal: Monitor competitor pricing on a given product across multiple e-commerce platforms.
Configuration Sample:
{
"urlColumn": "productUrl",
"manualUrls": "",
"maxUrls": 10,
"timeoutMs": 15000,
"tableIndex": 0,
"followPagination": true,
"maxPages": 3,
"rules": [],
"extractAll": true
}In this configuration:
- urlColumn is set to
"productUrl"to align with the source data structure. - maxUrls is limited to
10to manage load. - timeoutMs increased to
15000to accommodate slower networks. - followPagination enabled to scrape multiple pages of products.
- maxPages is set to
3to limit the scope of the scraping. - extractAll is set to
trueto ensure comprehensive data capture for each product.
With this setup, the webScraper will efficiently collect relevant pricing data from the specified e-commerce sites over multiple pages, allowing the business to maintain competitive intelligence.
AI Integrations & Billing Impact
The webScraper can be integrated with AI functionalities to enhance data processing, such as sentiment analysis or classification based on scraped data. If applied in high-volume scenarios, usage may impact billing due to the pricing model based on the number of requests, extracted data volume, and any additional AI processing invoked.
It's essential for users to monitor their utilization and optimize the settings as necessary to control costs while obtaining the required data.