4 min readUpdated Mar 2, 2026

Deduplicate Logic Documentation

Overview

The Deduplicate logic in Vantage is designed to remove duplicate rows from datasets based on specified key columns. In scenarios where no key columns are provided, the logic operates on the entire dataset, ensuring that only unique entries remain. Depending on the configuration, users can choose to keep either the first or the last occurrence of duplicates, making it a versatile tool for data cleaning and preparation.

Settings

1. keyColumns

2. keepStrategy

Functionality

The deduplicate logic processes the input data to identify duplicates based on the specified keyColumns or the entire row. It operates as follows:

  1. Retrieves the input data from the first input slot (input1).
  2. Checks if the data is an array; if it's wrapped in an object, it extracts the relevant array.
  3. If the data is empty or not in an expected format, it returns an empty array as the output.
  4. Defines the deduplication strategy based on the keepStrategy. If it’s set to “last,” the data is reversed to keep the last entries.
  5. Constructs a Set to track seen rows based on the generated key from keyColumns or the full row.
  6. Filters the dataset, retaining only unique rows according to the uniqueness criteria specified.
  7. If necessary, the final output order is reverted to restore the original sequence post filtering.

Expected Data

The deduplicate logic expects the input data to be in a well-structured array format, where each element corresponds to a row of data. Each row is expected to be an object that can contain various attributes defined by the user's dataset. If the input does not conform to these expectations, the component will return an empty output.

Use Cases & Examples

Use Case 1: Data Cleansing for CRM Systems

In a CRM application, sales representatives may inadvertently duplicate customer records while entering new leads. The deduplicate logic can be employed to clean these records before processing them for marketing campaigns.

Use Case 2: ETL Processes

During data extraction and transformation (ETL), it’s crucial to ensure unique records are transferred to the data warehouse. The deduplicate logic can help achieve this, ensuring high data quality without manual intervention.

Use Case 3: Survey Data Analysis

In instances where survey responses are collected, it’s common for participants to submit multiple entries. By using deduplication, analysts can remove duplicate survey responses, enhancing the reliability of their findings.

Configuration Example

To implement the deduplicate logic targeting a dataset of customer entries based on unique customer IDs while maintaining the latest entries, you might configure it as follows:

json
{
    "inputs": {
        "input1": [
            {"customerID": "001", "name": "John Doe"},
            {"customerID": "002", "name": "Jane Smith"},
            {"customerID": "001", "name": "John Doe"},
            {"customerID": "003", "name": "Emma Brown"}
        ]
    },
    "config": {
        "keyColumns": ["customerID"],
        "keepStrategy": "last"
    }
}

In this example, the deduplicate logic will:

json
{
    "output1": {
        "data": [
            {"customerID": "002", "name": "Jane Smith"},
            {"customerID": "001", "name": "John Doe"},
            {"customerID": "003", "name": "Emma Brown"}
        ]
    }
}

This ensures that the unique entries are preserved accurately based on the criteria defined, ready for further processing or reporting.