5 min read

fillclean Documentation

Overview

This node, part of the Vantage analytics platform, is designed to clean and preprocess datasets by applying a series of transformations to the data contained within specified columns. The primary functions of the fillclean node include handling null values, trimming whitespace, and performing type conversions (e.g., strings to numbers and date parsing). This preprocessing step is crucial for ensuring data quality and consistency before further analysis.

Settings

This node contains a configuration object that allows users to define various rules indicating how data cleaning should be applied. Below is a detailed breakdown of each setting available in the configuration.

Rules

Name: rules
- Input Type: Array of Objects
- Description: This is a collection of rules to be applied to the dataset. Each rule specifies a transformation for a specific column (or all columns).
- Default Value: [] (empty array)
- Rules Structure:
  - Column:
    - Name: column
    - Input Type: String
    - Description: The name of the column to which the transformation applies. Use '__all__' to apply the rule to all columns.
    - Default Value: N/A
  - Action:
    - Name: action
    - Input Type: String
    - Description: The action to take, which could be one of the following: fillDefault, fillForward, fillBackward, dropNulls, trimWhitespace, toNumber, toDate, toLowerCase, toUpperCase. Each action defines how the data will be modified.
    - Default Value: N/A
  - Value:
    - Name: value
    - Input Type: Any
    - Description: The default value that will be used when fillDefault action is triggered. It is also utilized for certain types of transformations where specific values are necessary.
    - Default Value: N/A
  - DateFormat:
    - Name: dateFormat
    - Input Type: String (optional)
    - Description: A hint for the date format when parsing dates in the toDate action. This helps ensure proper conversion of date formats that may not be in ISO format.
    - Default Value: N/A

Example of Rules Configuration

json

{
  "rules": [
    {
      "column": "age",
      "action": "fillDefault",
      "value": 0
    },
    {
      "column": "__all__",
      "action": "trimWhitespace"
    },
    {
      "column": "birthdate",
      "action": "toDate",
      "dateFormat": "MM/DD/YYYY"
    }
  ]
}

How It Works

Data Input: This node begins by unwrapping the input data and determining if it is an array. An empty array is returned if no valid data is supplied.
Expand Rules: If any rules use the '__all__' notation, these are replicated for each corresponding column in the dataset.
Rule Processing: Each rule is executed in the order provided, modifying the dataset according to the actions specified:
- fillDefault: Replaces null or empty values with a specified default.
- fillForward: Fills null or empty values with the last non-null value from the previous rows.
- fillBackward: Fills null or empty values with the next available non-null value from subsequent rows.
- dropNulls: Removes rows where the specified column contains null or empty values.
- trimWhitespace: Trims whitespace from the start and end of string values in the specified column.
- toNumber: Converts string values to numbers, stripping non-numeric characters if necessary.
- toDate: Converts string representations of dates to ISO string format.
- toLowerCase / toUpperCase: Converts string values to lower or upper case as specified.

After applying all rules, the cleaned data is returned.

Expected Data

The fillclean node expects the following data structure:

An array of objects, where each object represents a row of data. Each row object should contain key-value pairs corresponding to the column names and their respective values. The keys are strings, and the values can be of any type (e.g., string, number, date).

Example of Expected Data Format

json

[
  { "age": null, "name": " Alice ", "birthdate": "12/31/2020" },
  { "age": 30, "name": " Bob ", "birthdate": null },
  { "age": 25, "name": " Charlie", "birthdate": "01/01/2021" }
]

Use Cases & Examples

Use Case 1: Handling Missing Age Values

In a user dataset where age might be missing, we can default missing values to 0 for analysis purposes—ensuring that age-related calculations are still viable.

Use Case 2: Standardizing Names

In a dataset containing names with varying casing (e.g., " JANE ", "joHN"), ensuring consistency by transforming all names to a standardized format (e.g., title case or all lowercase) can facilitate better grouping and sorting.

Use Case 3: Date Parsing for Analysis

When load data that includes dates in a non-standard format (e.g., MM/DD/YYYY), cleaning these dates to a consistent ISO format ensures proper chronological analysis.

Detailed Configuration Example for Use Case 1

To address the first use case of handling missing age values, use the following rules configuration:

json

{
  "rules": [
    {
      "column": "age",
      "action": "fillDefault",
      "value": 0
    },
    {
      "column": "name",
      "action": "trimWhitespace"
    }
  ]
}

Using this configuration, any row with a null or empty age field will have that field filled with a default value of 0, while whitespace will be trimmed from the name field for cleaner results.

AI Integrations and Billing Impacts

Currently, this node does not directly integrate with any AI features in Vantage. However, utilizing fillclean effectively to preprocess data can enhance the input quality for machine learning models or analytical tools that rely on complete and clean datasets.

As for billing, using this component does not incur additional costs beyond standard usage tiers of Vantage's services. However, extensive data preprocessing may influence computation time and resource allocation, depending on the size of the dataset being processed. It is advisable to monitor performance and resource usage if large datasets are involved.

← PreviousDeduplicate Next →Computed Column

fillclean Documentation

Overview

Settings

Rules

Example of Rules Configuration

How It Works

Expected Data

Example of Expected Data Format

Use Cases & Examples

Use Case 1: Handling Missing Age Values

Use Case 2: Standardizing Names

Use Case 3: Date Parsing for Analysis

Detailed Configuration Example for Use Case 1

AI Integrations and Billing Impacts

Related Pages