URL Reader Documentation
Overview
The urlReader is a component within the Vantage analytics platform designed to fetch content from specified URLs. It extracts crucial data, including the page title, meta description, plain text content, links, status code, and content type. This enables users to gather structured information from web pages efficiently.
Purpose
The primary function of the urlReader is to automate the retrieval and parsing of web page content for various use cases in data analytics and reporting. It's particularly useful for tasks like web scraping, content analysis, SEO optimization, and more.
Settings
1. urlColumn
- Input Type: String
- Description: Specifies the name of the column containing the URLs to be processed. This allows flexibility in naming conventions and enables the user to designate the correct input for URL reading.
- Default Value:
"url"
2. manualUrls
- Input Type: String (newline-separated)
- Description: A list of URLs that can be used when there is no upstream data available. Each URL must be separated by a newline. This is particularly helpful in scenarios where URLs need to be specified manually.
- Default Value:
""(empty)
3. maxUrls
- Input Type: Numeric
- Description: Defines the maximum number of URLs to process in a single run. This setting is crucial to prevent the component from attempting to fetch too many URLs at once, which could lead to performance issues or API rate limits.
- Default Value:
20
4. timeoutMs
- Input Type: Numeric
- Description: Specifies the maximum time, in milliseconds, to wait for a response from the URL before canceling the request. A longer timeout allows for slower responses, while a shorter timeout can help speed up processing if many URLs are being fetched.
- Default Value:
10000(10 seconds)
5. extractLinks
- Input Type: Boolean
- Description: Indicates whether the component should extract links from the fetched HTML content. If true, the component captures up to 200 links; if false, no links will be extracted. This affects the amount of data output and can be crucial depending on the focus of the analysis.
- Default Value:
true
6. extractMeta
- Input Type: Boolean
- Description: Specifies whether to extract the meta description of the web page. If true, the description will be included in the output; if false, an empty string will be returned for that field. This is beneficial for SEO and content summarization.
- Default Value:
true
7. userAgent
- Input Type: String
- Description: Sets the User-Agent header for the HTTP request. This can help identify the source of the request and avoid being blocked by web servers. Different user agents might yield different responses from web servers based on their configuration.
- Default Value:
"VantageBot/1.0"
How It Works
When executed, the urlReader follows these steps:
- Input Handling: It takes inputs defined by the users via a column indicated by
urlColumnor manual URLs supplied directly viamanualUrls. - Data Preparation: If no input data is available and
manualUrlsis defined, the component uses that data. If neither is available, it outputs an empty dataset. - URL Fetching: It processes the URLs (up to
maxUrls) by fetching each page and applying a timeout defined bytimeoutMs. - Data Extraction: It extracts the page title, meta description, plain text content, links, and status code using internal helper functions. The extracted data is then structured into rows of output data.
- Error Handling: If a request fails due to a timeout or another error, it captures the error message and continues processing remaining URLs.
- Output Structure: The component returns an array of objects containing the extracted data for each URL.
Expected Data
The urlReader expects data in a structured format, which could be:
- An array of objects where each object contains at least the URL mapped to the specified
urlColumn. - If using manual URLs, a newline-separated string of URLs.
The output will include:
page_title: Title of the fetched pagepage_description: Meta description of the fetched pagepage_text: Plain text from the fetched page (up to 50,000 characters)page_links: JSON array of extracted links (up to 200)page_status: HTTP status code of the responsepage_content_type: Type of content returned (e.g., text/html)page_error: Error message if applicable
Use Cases & Examples
Use Case 1: SEO Analysis
A digital marketing agency needs to evaluate the SEO performance of multiple competitors' web pages. Using the urlReader, they can extract essential elements such as page titles and meta descriptions for analysis.
Use Case 2: Content Aggregation
A content aggregator platform wants to fetch and summarize news articles from various news sites. The urlReader can be configured to pull data and generate summaries using the plain text content.
Example Configuration for SEO Analysis Use Case
To configure the urlReader for SEO analysis, the following settings might be used:
{
"urlColumn": "page_url",
"manualUrls": "https://example.com/article1\nhttps://example.com/article2",
"maxUrls": 20,
"timeoutMs": 8000,
"extractLinks": false,
"extractMeta": true,
"userAgent": "SEOAnalyzer/1.0"
}In this example:
- The
urlColumnis set topage_url, which contains the URLs for the analysis. - Two manual URLs are provided for fetching data.
- A maximum of 20 URLs will be processed with an 8-second timeout to ensure responsiveness.
- Links will not be extracted to focus purely on SEO elements, and a custom user agent is utilized for the requests.
This configuration will efficiently gather SEO data from the specified URLs, allowing the agency to analyze page elements quickly.