dataValidation Documentation
Overview
The dataValidation logic is designed to validate incoming data against a predefined set of rules. It performs checks on specified columns of the data and generates two outputs: the original data and an array of error or warning messages indicating any validation issues encountered. This component is essential for ensuring data integrity and quality within the Vantage analytics platform.
Purpose
The main purpose of the dataValidation logic is to validate data to ensure it meets specific criteria before any further processing occurs. This helps identify issues early in the data pipeline, reducing the risk of errors in downstream analytics.
Expected Data
The dataValidation node expects the following input data format:
- An array of objects, where each object represents a row of data. Each attribute of the object corresponds to a column name.
- The column names specified in the validation rules must exist in the input data objects.
How It Works
-
Data Unwrapping: The function begins by extracting the input data, ensuring it is in array format. If the input data is empty or invalid, it defaults to an empty array.
-
Rule Resolution: The logic determines the source of validation rules based on the
rulesSourcesetting. It can derive rules from manual input, a Google Drive file, or a context snippet. -
Validation Checks: For each row in the input data, the validation checks are applied based on the specified rules. Each rule defines a column to check, a validation type (e.g., whether the value is not empty, matches a regex, falls within a specified range, etc.), and an optional severity level (error or warning).
-
Output Generation: The component generates two outputs:
output1: The original unmodified data.output2: An array of validation issues found, including the row number, column name, the rule that failed, an error message, and the severity of the issue.
-
If the
stopOnFirstErrorsetting is enabled, the validation will halt at the first encountered error, returning the original data and the first error found.
Settings
The dataValidation logic includes several settings, each with specific functions and implications:
1. rulesSource
- Input Type: Dropdown (options: "manual", "googleDrive", "contextSnippet")
- Description: Determines the source of the validation rules. Changing this field affects how the rules are loaded into the logic.
- Default Value: "manual"
2. rules
-
Input Type: Array of rule objects
-
Description: An array of validation rules. Each rule contains:
column: The name of the column to validate (string).check: The type of check to perform (e.g., "notEmpty", "regex", etc.).value: The value to compare against (can be a string, number, or array depending oncheck).severity: Error or warning level (string, default: "error").
-
Default Value:
[](empty array)
3. googleDriveRules
- Input Type: Array of file objects
- Description: Stores pre-fetched validation rules from Google Drive. If this setting is used, it will be preferred over manual rules.
- Default Value:
[](empty array)
4. contextSnippetId
- Input Type: String or null
- Description: An identifier for a context snippet if rules are provided from this source.
- Default Value:
null
5. contextSnippetContent
- Input Type: String
- Description: Contains the pre-fetched snippet text. It's expected to be JSON representing an array of rule objects.
- Default Value:
""(empty string)
6. stopOnFirstError
- Input Type: Boolean
- Description: If true, validation will stop at the first error found, returning only the first validation issue and the original data.
- Default Value:
false
Use Cases & Examples
Use Cases
-
Data Quality Control: Ensuring all critical fields in a dataset contain valid and correctly formatted data before analysis.
-
Regulatory Compliance: Validating datasets to ensure they adhere to specific business rules and compliance regulations, such as proper email formatting.
-
Duplicate Value Prevention: Checking for duplicate entries in a dataset to maintain uniqueness across records, such as customer IDs or product SKUs.
Example Configuration
Use Case: Data Quality Control for User Registration
Consider a scenario where a business registers users and must ensure that certain fields are populated and formatted correctly, such as names and email addresses.
Sample Configuration:
{
"rulesSource": "manual",
"rules": [
{
"column": "name",
"check": "notEmpty",
"value": "",
"severity": "error"
},
{
"column": "email",
"check": "email",
"value": "",
"severity": "error"
}
],
"googleDriveRules": [],
"contextSnippetId": null,
"contextSnippetContent": "",
"stopOnFirstError": false
}In this configuration:
- The logic will check if the
namefield is populated and if theemailfield conforms to email format rules. - If any data does not meet the criteria, the validator will produce relevant error messages, allowing the input to be corrected before being processed further.