4 min read

Vantage Pipeline Documentation

Overview

The "pipeline" function is a core component of the Vantage analytics and data platform. Its primary purpose is to enable the sequential execution of a series of operations (or nodes) on a dataset. Each node processes the output from the previous node, transforming the data until the final dataset is produced. This design allows for flexible data transformations and supports a variety of data processing tasks.

Purpose

The pipeline facilitates data manipulation by allowing users to define multiple processing steps that are executed in a specified order. Each step can modify, filter, aggregate, or enrich the dataset, tailoring the data output to specific analytical needs. It provides a structured yet versatile method for handling complex data workflows.

How It Works

The pipeline works by receiving an array of node objects and an initial dataset. Each node in the array is expected to implement an execute(data) method, which receives the current dataset and returns a processed version of it. The final output after running all nodes in sequence is the transformed dataset.

Execution Flow

Initialization: The function initializes with the provided nodes and data.
Sequential Processing: Using the reduce method, each node's execute function is called in sequence, passing the transformed data from one node to the next.
Final Output: Once all nodes have processed the data, the final dataset is returned.

Settings

The pipeline does not have traditional settings as it is a function that primarily processes input data. However, the configuration is determined by the array of nodes provided. Each node should implement a well-defined interface that aligns with the expected structure and behavior.

Node Settings

nodes
- Input Type: Array
- Description: This array contains objects, each of which represents a data transformation step. Each object must have an execute(data) method. The ordering of nodes in the array defines the sequence in which they will process the data.
- Default Value: [] (an empty array, meaning no processing will occur without defined nodes).
data
- Input Type: Array of Objects
- Description: The initial dataset that is provided to the pipeline. It must be an array of objects, where each object represents a single record of data to be processed through the defined nodes.
- Default Value: [] (an empty dataset, which will produce no output unless populated).

Sample Node Implementation

While the pipeline itself does not create nodes, a typical node object might look like this:

javascript

const sampleNode = {
  execute: (data) => {
    // transformation logic here
    return transformedData;
  }
};

Use Cases & Examples

Use Cases

Data Enrichment: Enhance existing datasets by integrating additional information through a series of transformation nodes (e.g., Joining with external APIs or databases).
Data Cleaning: Sequentially removing anomalies or incorrect data points through a set of filtering and validation nodes, ensuring high-quality data is used for reporting and analysis.
Aggregation: Summarizing data points across various dimensions (e.g., calculating averages, sums, or counts) to prepare datasets for visualization or further analysis.

Example Configuration: Data Cleaning Pipeline

Use Case: Data cleaning for customer data prior to a marketing campaign.

Desired Transformations:

Step 1: Remove duplicates based on customer email.
Step 2: Filter out customers with invalid email formats.
Step 3: Normalize customer names to a standard format.

Pipeline Configuration

javascript

const dataCleaningPipeline = [
  {
    execute: (data) => {
      const seen = new Set();
      return data.filter(item => {
        const duplicate = seen.has(item.email);
        seen.add(item.email);
        return !duplicate;
      });
    }
  },
  {
    execute: (data) => {
      const emailPattern = /^[^\s@]+@[^\s@]+\.[^\s@]+$/;
      return data.filter(item => emailPattern.test(item.email));
    }
  },
  {
    execute: (data) => {
      return data.map(item => ({
        ...item,
        name: item.name.trim().toLowerCase() // Normalize name
      }));
    }
  }
];

// Sample execution
const initialCustomerData = [
  { email: 'abc@example.com', name: ' John Doe ' },
  { email: 'abc@example.com', name: 'John Doe' },
  { email: 'invalid-email', name: 'Jane Smith' },
  { email: 'xyz@example.com', name: 'Alice Jones' }
];

const cleanedData = runPipeline(dataCleaningPipeline, initialCustomerData);

Result of Pipeline Execution

The cleanedData would contain:

javascript

[
  { email: 'abc@example.com', name: 'john doe' },
  { email: 'xyz@example.com', name: 'alice jones' }
]

This configuration illustrates how the pipeline efficiently cleans and normalizes the customer dataset, improving its quality for downstream marketing analytics.

← PreviousData Classify Next →Query Runner

Vantage Pipeline Documentation

Overview

Purpose

How It Works

Execution Flow

Settings

Node Settings

Sample Node Implementation

Use Cases & Examples

Use Cases

Example Configuration: Data Cleaning Pipeline

Desired Transformations:

Pipeline Configuration

Result of Pipeline Execution

Related Pages