urlParser Documentation
Overview
The urlParser is a specialized component within Vantage that parses URLs from a user-defined data column. It decomposes URL strings into various components, including protocol, hostname, pathname, search, hash, port, origin, query parameters, and a validity flag. This functionality is essential for data analysis tasks that require the extraction and manipulation of URL components.
Purpose
The primary purpose of the urlParser is to transform URL data into a structured format that can be easily analyzed and utilized in further data-processing tasks. It validates the URLs and extracts critical parts for analytical purposes, ensuring that users can work with clean, well-structured data.
Settings
The urlParser has a flexible configuration that allows users to tailor its behavior according to their specific needs. Below is a detailed breakdown of each setting.
1. urlColumn
- Input Type: String
- Description: This setting specifies the name of the column that contains the URL strings to be parsed. If the column name is incorrect or does not exist, the parser will not function as intended.
- Default Value:
"url" - Impact: Changing this value allows you to specify different columns in your dataset. If the specified column doesn't exist, the parser will fail to extract any URLs.
2. manualUrls
- Input Type: String
- Description: This field allows the user to input URLs manually, separated by newline characters. This is especially useful when there is no upstream data available.
- Default Value:
""(empty string) - Impact: If the input data is not available, this setting provides a fallback mechanism by utilizing manually specified URLs. If the manual URL input is empty and no upstream data is provided, no output will be generated.
How It Works
-
Input Handling: The function first checks if the input data is provided in the correct format. If the data is nested, it attempts to extract the actual array of URLs.
-
Manual URL Fallback: If no valid URL data is available, it checks the
manualUrlsfield. If URLs are specified here, they are split by lines, trimmed of whitespace, and formatted into objects containing the URL under the specifiedurlColumn. -
URL Parsing: The function iterates over the array of URLs:
- It checks if each URL is valid and whether it has the correct format (i.e., is a string).
- It uses the URL constructor to decompose the URL into its components (protocol, hostname, pathname, etc.).
- It adds a
url_validflag indicating whether the provided URL is valid. - If parsing fails, it sets default empty values for the components.
-
Output: The function returns an array of objects, each enriched with parsed URL components or an indicator of validity.
Data Expectations
Input Data
- Type: Array of Objects
- Structure: Each object must contain a property corresponding to the
urlColumn(default: "url"), which holds the URL string to be parsed. The expected format is:json[ { "url": "https://example.com/path?query=1" } ]
Output Data
- Type: Array of Objects
- Structure: Each output object has additional parsed properties:
json
[ { "url": "https://example.com/path?query=1", "url_valid": true, "url_protocol": "https", "url_hostname": "example.com", "url_pathname": "/path", "url_search": "?query=1", "url_hash": "", "url_port": "", "url_origin": "https://example.com", "url_params": "{\"query\":\"1\"}" } ]
Use Cases & Examples
Use Case 1: Web Analytics
A marketing team needs to analyze traffic coming from various URLs to gauge the effectiveness of different campaigns. By using urlParser, they can extract campaign-related parameters from URLs stored in their database.
Use Case 2: Data Validation
A data engineer is integrating multiple data sources that contain URLs. They use urlParser to verify URL validity and to generate metrics on which sources contain errors or malformed URLs.
Example Configuration
Scenario
A data analyst has a dataset with raw URLs in a column named "website" and needs to parse and analyze them for further reporting.
Configuration Data
{
"urlColumn": "website",
"manualUrls": ""
}In this configuration:
- The analyst specifies that the column holding the URLs is named "website".
- As no manual URLs are needed, the
manualUrlsfield is left empty.
Output
After running the urlParser with this configuration on a dataset, the output would consist of enriched records containing valid/invalid status and structured components for each URL in the specified column.
Such structured data allows the analyst to easily generate reports and insights based on URL patterns and their associated attributes.
AI Integrations and Billing Impacts
AI Integrations
Currently, urlParser does not have direct AI integrations within its functionality. However, its structured output can be utilized in AI models for predictive analytics, enhancing user insights from parsed URLs.
Billing Impacts
Using the urlParser does not directly incur additional billing costs unless integrated within workflows that exceed plan limits for data processing or outputs. Users should monitor usage based on their Vantage subscription levels, especially if processing large datasets.
This comprehensive documentation provides users with all the necessary details to effectively utilize the urlParser component in their data processing workflows within the Vantage platform.