4 min read

Deduplicate

Overview

The Deduplicate logic in Vantage is designed to remove duplicate rows from datasets based on specified key columns. In scenarios where no key columns are provided, the logic operates on the entire dataset, ensuring that only unique entries remain. Depending on the configuration, users can choose to keep either the first or the last occurrence of duplicates, making it a versatile tool for data cleaning and preparation.

Settings

1. `keyColumns`

Input Type: Array of Strings
Description: This setting specifies the columns that will be used to determine the uniqueness of each row in the dataset. If this array is empty, the entire row will be treated as the key for deduplication.
Default Value: [] (empty array)
Impact of Changes: When you specify key columns, deduplication will only assess these fields to identify duplicates. Conversely, providing an empty array prompts the component to evaluate the entire row's content for uniqueness.

2. `keepStrategy`

Input Type: String (dropdown)
Description: This setting dictates which duplicate row to retain when duplicates are found. Users can select either "first" to keep the first occurrence or "last" to retain the last occurrence.
Default Value: "first"
Impact of Changes: Selecting "first" maintains the order of data as is, keeping the initial instance of each duplicate. On the other hand, selecting "last" reverses the original order to ensure the last occurrence is kept, followed by restoring the original order in the result.

Functionality

The deduplicate logic processes the input data to identify duplicates based on the specified keyColumns or the entire row. It operates as follows:

Retrieves the input data from the first input slot (input1).
Checks if the data is an array; if it's wrapped in an object, it extracts the relevant array.
If the data is empty or not in an expected format, it returns an empty array as the output.
Defines the deduplication strategy based on the keepStrategy. If it’s set to “last,” the data is reversed to keep the last entries.
Constructs a Set to track seen rows based on the generated key from keyColumns or the full row.
Filters the dataset, retaining only unique rows according to the uniqueness criteria specified.
If necessary, the final output order is reverted to restore the original sequence post filtering.

Expected Data

The deduplicate logic expects the input data to be in a well-structured array format, where each element corresponds to a row of data. Each row is expected to be an object that can contain various attributes defined by the user's dataset. If the input does not conform to these expectations, the component will return an empty output.

Use Cases & Examples

Use Case 1: Data Cleansing for CRM Systems

In a CRM application, sales representatives may inadvertently duplicate customer records while entering new leads. The deduplicate logic can be employed to clean these records before processing them for marketing campaigns.

Use Case 2: ETL Processes

During data extraction and transformation (ETL), it’s crucial to ensure unique records are transferred to the data warehouse. The deduplicate logic can help achieve this, ensuring high data quality without manual intervention.

Use Case 3: Survey Data Analysis

In instances where survey responses are collected, it’s common for participants to submit multiple entries. By using deduplication, analysts can remove duplicate survey responses, enhancing the reliability of their findings.

Configuration Example

To implement the deduplicate logic targeting a dataset of customer entries based on unique customer IDs while maintaining the latest entries, you might configure it as follows:

json

{
    "inputs": {
        "input1": [
            {"customerID": "001", "name": "John Doe"},
            {"customerID": "002", "name": "Jane Smith"},
            {"customerID": "001", "name": "John Doe"},
            {"customerID": "003", "name": "Emma Brown"}
        ]
    },
    "config": {
        "keyColumns": ["customerID"],
        "keepStrategy": "last"
    }
}

In this example, the deduplicate logic will:

Identify that the entries for "customerID" 001 are duplicates.
Retain the last occurrence of this entry, resulting in the following output: