fillclean Documentation
Overview
The fillclean logic, part of the Vantage analytics platform, is designed to clean and preprocess datasets by applying a series of transformations to the data contained within specified columns. The primary functions of the fillclean node include handling null values, trimming whitespace, and performing type conversions (e.g., strings to numbers and date parsing). This preprocessing step is crucial for ensuring data quality and consistency before further analysis.
Settings
The fillclean logic contains a configuration object that allows users to define various rules indicating how data cleaning should be applied. Below is a detailed breakdown of each setting available in the configuration.
Rules
- Name:
rules- Input Type: Array of Objects
- Description: This is a collection of rules to be applied to the dataset. Each rule specifies a transformation for a specific column (or all columns).
- Default Value:
[](empty array) - Rules Structure:
- Column:
- Name:
column - Input Type: String
- Description: The name of the column to which the transformation applies. Use
'__all__'to apply the rule to all columns. - Default Value: N/A
- Name:
- Action:
- Name:
action - Input Type: String
- Description: The action to take, which could be one of the following:
fillDefault,fillForward,fillBackward,dropNulls,trimWhitespace,toNumber,toDate,toLowerCase,toUpperCase. Each action defines how the data will be modified. - Default Value: N/A
- Name:
- Value:
- Name:
value - Input Type: Any
- Description: The default value that will be used when
fillDefaultaction is triggered. It is also utilized for certain types of transformations where specific values are necessary. - Default Value: N/A
- Name:
- DateFormat:
- Name:
dateFormat - Input Type: String (optional)
- Description: A hint for the date format when parsing dates in the
toDateaction. This helps ensure proper conversion of date formats that may not be in ISO format. - Default Value: N/A
- Name:
- Column:
Example of Rules Configuration
{
"rules": [
{
"column": "age",
"action": "fillDefault",
"value": 0
},
{
"column": "__all__",
"action": "trimWhitespace"
},
{
"column": "birthdate",
"action": "toDate",
"dateFormat": "MM/DD/YYYY"
}
]
}How It Works
- Data Input: The
fillcleanlogic begins by unwrapping the input data and determining if it is an array. An empty array is returned if no valid data is supplied. - Expand Rules: If any rules use the
'__all__'notation, these are replicated for each corresponding column in the dataset. - Rule Processing: Each rule is executed in the order provided, modifying the dataset according to the actions specified:
fillDefault: Replaces null or empty values with a specified default.fillForward: Fills null or empty values with the last non-null value from the previous rows.fillBackward: Fills null or empty values with the next available non-null value from subsequent rows.dropNulls: Removes rows where the specified column contains null or empty values.trimWhitespace: Trims whitespace from the start and end of string values in the specified column.toNumber: Converts string values to numbers, stripping non-numeric characters if necessary.toDate: Converts string representations of dates to ISO string format.toLowerCase/toUpperCase: Converts string values to lower or upper case as specified.
After applying all rules, the cleaned data is returned.
Expected Data
The fillclean node expects the following data structure:
- An array of objects, where each object represents a row of data. Each row object should contain key-value pairs corresponding to the column names and their respective values. The keys are strings, and the values can be of any type (e.g., string, number, date).
Example of Expected Data Format
[
{ "age": null, "name": " Alice ", "birthdate": "12/31/2020" },
{ "age": 30, "name": " Bob ", "birthdate": null },
{ "age": 25, "name": " Charlie", "birthdate": "01/01/2021" }
]Use Cases & Examples
Use Case 1: Handling Missing Age Values
In a user dataset where age might be missing, we can default missing values to 0 for analysis purposes—ensuring that age-related calculations are still viable.
Use Case 2: Standardizing Names
In a dataset containing names with varying casing (e.g., " JANE ", "joHN"), ensuring consistency by transforming all names to a standardized format (e.g., title case or all lowercase) can facilitate better grouping and sorting.
Use Case 3: Date Parsing for Analysis
When load data that includes dates in a non-standard format (e.g., MM/DD/YYYY), cleaning these dates to a consistent ISO format ensures proper chronological analysis.
Detailed Configuration Example for Use Case 1
To address the first use case of handling missing age values, use the following rules configuration:
{
"rules": [
{
"column": "age",
"action": "fillDefault",
"value": 0
},
{
"column": "name",
"action": "trimWhitespace"
}
]
}Using this configuration, any row with a null or empty age field will have that field filled with a default value of 0, while whitespace will be trimmed from the name field for cleaner results.
AI Integrations and Billing Impacts
Currently, the fillclean logic does not directly integrate with any AI features in Vantage. However, utilizing fillclean effectively to preprocess data can enhance the input quality for machine learning models or analytical tools that rely on complete and clean datasets.
As for billing, using this component does not incur additional costs beyond standard usage tiers of Vantage's services. However, extensive data preprocessing may influence computation time and resource allocation, depending on the size of the dataset being processed. It is advisable to monitor performance and resource usage if large datasets are involved.