4 min read

dataValidation Documentation

Overview

This node is designed to validate incoming data against a predefined set of rules. It performs checks on specified columns of the data and generates two outputs: the original data and an array of error or warning messages indicating any validation issues encountered. This component is essential for ensuring data integrity and quality within the Vantage analytics platform.

Purpose

The main purpose of this node is to validate data to ensure it meets specific criteria before any further processing occurs. This helps identify issues early in the data pipeline, reducing the risk of errors in downstream analytics.

Expected Data

The dataValidation node expects the following input data format:

An array of objects, where each object represents a row of data. Each attribute of the object corresponds to a column name.
The column names specified in the validation rules must exist in the input data objects.

How It Works

Data Unwrapping: The function begins by extracting the input data, ensuring it is in array format. If the input data is empty or invalid, it defaults to an empty array.
Rule Resolution: The logic determines the source of validation rules based on the rulesSource setting. It can derive rules from manual input, a Google Drive file, or a context snippet.
Validation Checks: For each row in the input data, the validation checks are applied based on the specified rules. Each rule defines a column to check, a validation type (e.g., whether the value is not empty, matches a regex, falls within a specified range, etc.), and an optional severity level (error or warning).
Output Generation: The component generates two outputs:
- output1: The original unmodified data.
- output2: An array of validation issues found, including the row number, column name, the rule that failed, an error message, and the severity of the issue.
If the stopOnFirstError setting is enabled, the validation will halt at the first encountered error, returning the original data and the first error found.

Settings

This node includes several settings, each with specific functions and implications:

1. `rulesSource`

Input Type: Dropdown (options: "manual", "googleDrive", "contextSnippet")
Description: Determines the source of the validation rules. Changing this field affects how the rules are loaded into the logic.
Default Value: "manual"

2. `rules`

Input Type: Array of rule objects
Description: An array of validation rules. Each rule contains:
- column: The name of the column to validate (string).
- check: The type of check to perform (e.g., "notEmpty", "regex", etc.).
- value: The value to compare against (can be a string, number, or array depending on check).
- severity: Error or warning level (string, default: "error").
Default Value: [] (empty array)

3. `googleDriveRules`

Input Type: Array of file objects
Description: Stores pre-fetched validation rules from Google Drive. If this setting is used, it will be preferred over manual rules.
Default Value: [] (empty array)

4. `contextSnippetId`

Input Type: String or null
Description: An identifier for a context snippet if rules are provided from this source.
Default Value: null

5. `contextSnippetContent`

Input Type: String
Description: Contains the pre-fetched snippet text. It's expected to be JSON representing an array of rule objects.
Default Value: "" (empty string)

6. `stopOnFirstError`

Input Type: Boolean
Description: If true, validation will stop at the first error found, returning only the first validation issue and the original data.
Default Value: false

Use Cases & Examples

Use Cases

Data Quality Control: Ensuring all critical fields in a dataset contain valid and correctly formatted data before analysis.
Regulatory Compliance: Validating datasets to ensure they adhere to specific business rules and compliance regulations, such as proper email formatting.
Duplicate Value Prevention: Checking for duplicate entries in a dataset to maintain uniqueness across records, such as customer IDs or product SKUs.

Example Configuration

Use Case: Data Quality Control for User Registration

Consider a scenario where a business registers users and must ensure that certain fields are populated and formatted correctly, such as names and email addresses.

Sample Configuration:

json

{
  "rulesSource": "manual",
  "rules": [
    {
      "column": "name",
      "check": "notEmpty",
      "value": "",
      "severity": "error"
    },
    {
      "column": "email",
      "check": "email",
      "value": "",
      "severity": "error"
    }
  ],
  "googleDriveRules": [],
  "contextSnippetId": null,
  "contextSnippetContent": "",
  "stopOnFirstError": false
}

In this configuration:

The logic will check if the name field is populated and if the email field conforms to email format rules.
If any data does not meet the criteria, the validator will produce relevant error messages, allowing the input to be corrected before being processed further.

← PreviousProjection Next →Data Classify

dataValidation Documentation

Overview

Purpose

Expected Data

How It Works

Settings

1. rulesSource

2. rules

3. googleDriveRules

4. contextSnippetId

5. contextSnippetContent

6. stopOnFirstError