Vantage Pipeline Documentation
Overview
The "pipeline" function is a core component of the Vantage analytics and data platform. Its primary purpose is to enable the sequential execution of a series of operations (or nodes) on a dataset. Each node processes the output from the previous node, transforming the data until the final dataset is produced. This design allows for flexible data transformations and supports a variety of data processing tasks.
Purpose
The pipeline facilitates data manipulation by allowing users to define multiple processing steps that are executed in a specified order. Each step can modify, filter, aggregate, or enrich the dataset, tailoring the data output to specific analytical needs. It provides a structured yet versatile method for handling complex data workflows.
How It Works
The pipeline works by receiving an array of node objects and an initial dataset. Each node in the array is expected to implement an execute(data) method, which receives the current dataset and returns a processed version of it. The final output after running all nodes in sequence is the transformed dataset.
Execution Flow
- Initialization: The function initializes with the provided nodes and data.
- Sequential Processing: Using the
reducemethod, each node'sexecutefunction is called in sequence, passing the transformed data from one node to the next. - Final Output: Once all nodes have processed the data, the final dataset is returned.
Settings
The pipeline does not have traditional settings as it is a function that primarily processes input data. However, the configuration is determined by the array of nodes provided. Each node should implement a well-defined interface that aligns with the expected structure and behavior.
Node Settings
-
nodes
- Input Type: Array
- Description: This array contains objects, each of which represents a data transformation step. Each object must have an
execute(data)method. The ordering of nodes in the array defines the sequence in which they will process the data. - Default Value:
[](an empty array, meaning no processing will occur without defined nodes).
-
data
- Input Type: Array of Objects
- Description: The initial dataset that is provided to the pipeline. It must be an array of objects, where each object represents a single record of data to be processed through the defined nodes.
- Default Value:
[](an empty dataset, which will produce no output unless populated).
Sample Node Implementation
While the pipeline itself does not create nodes, a typical node object might look like this:
const sampleNode = {
execute: (data) => {
// transformation logic here
return transformedData;
}
};Use Cases & Examples
Use Cases
-
Data Enrichment: Enhance existing datasets by integrating additional information through a series of transformation nodes (e.g., Joining with external APIs or databases).
-
Data Cleaning: Sequentially removing anomalies or incorrect data points through a set of filtering and validation nodes, ensuring high-quality data is used for reporting and analysis.
-
Aggregation: Summarizing data points across various dimensions (e.g., calculating averages, sums, or counts) to prepare datasets for visualization or further analysis.
Example Configuration: Data Cleaning Pipeline
Use Case: Data cleaning for customer data prior to a marketing campaign.
Desired Transformations:
- Step 1: Remove duplicates based on customer email.
- Step 2: Filter out customers with invalid email formats.
- Step 3: Normalize customer names to a standard format.
Pipeline Configuration
const dataCleaningPipeline = [
{
execute: (data) => {
const seen = new Set();
return data.filter(item => {
const duplicate = seen.has(item.email);
seen.add(item.email);
return !duplicate;
});
}
},
{
execute: (data) => {
const emailPattern = /^[^\s@]+@[^\s@]+\.[^\s@]+$/;
return data.filter(item => emailPattern.test(item.email));
}
},
{
execute: (data) => {
return data.map(item => ({
...item,
name: item.name.trim().toLowerCase() // Normalize name
}));
}
}
];
// Sample execution
const initialCustomerData = [
{ email: 'abc@example.com', name: ' John Doe ' },
{ email: 'abc@example.com', name: 'John Doe' },
{ email: 'invalid-email', name: 'Jane Smith' },
{ email: 'xyz@example.com', name: 'Alice Jones' }
];
const cleanedData = runPipeline(dataCleaningPipeline, initialCustomerData);Result of Pipeline Execution
The cleanedData would contain:
[
{ email: 'abc@example.com', name: 'john doe' },
{ email: 'xyz@example.com', name: 'alice jones' }
]This configuration illustrates how the pipeline efficiently cleans and normalizes the customer dataset, improving its quality for downstream marketing analytics.