Skip to content

How to set up and activate Duplicate Guard Monitor

Overview

This tutorial explains how to configure, activate, and analyze results from a Duplicate Guard Monitor in CDQ.

Learning Goals

In this tutorial, you will learn how to:

  • set up a deduplication configuration
  • activate a duplicate monitor
  • generate and interpret deduplication reports

Prerequisites

CDQ supports a default matching configuration for Duplicate Monitors. It consists of three search attributes and five matching attributes.

Default configuration

The default configuration is designed to systematically identify duplicate records based on specified attributes and matching criteria, using various cleaning and comparison techniques to ensure accuracy.

Key components:

The following JSON code represents the default configuration for duplicate matching, which can be used by a Duplicate Monitor to identify and handle duplicate records.

Here is the list of the key components:

  • "name": "Default Matching Configuration": the name of the configuration

  • "configuration": the structure of the configuration, which contains at least two objects:

    • candidateSearchConfiguration
    • generalMatchingConfiguration
  • "candidateSearchConfiguration": defines how candidate records are searched for potential duplicates and includes:

    • "maxCandidates": 10: specifies the maximum number of candidate records to consider (in this case, 10)
    • "searchAttributes": an array of attributes used to search for candidates, each defined by a JSON path:
      • {"jsonPath": "$.names[0].value"}: searches using the first name value
      • {"jsonPath": "$.addresses[0].country.shortName"}: searches using the country ISO code in the first address
      • {"jsonPath": "$.addresses[0].thoroughfares[0].value"}: searches using the thoroughfare value in the first address
  • "generalMatchingConfiguration": defines the criteria for matching records and includes:

    • "threshold": 0.85: the score above which records are considered a definite match
    • "thresholdMaybe": 0.65: the score above which records are considered a possible match; the score must be below the threshold value — in this case, between 0.65 and 0.85
    • "matchingAttributes": an array of attributes used for matching:
      • "name": descriptive name of the attribute being matched; this name will be used in reports
      • "jsonPath": path to the attribute in the JSON structure
      • "cleaners": list of cleaning operations applied to the attribute before matching (e.g., LowerCaseNormalizer, PunctuationsCleaner)
      • "high" and "low": thresholds for matching scores specific to the attribute
      • "comparator": method used to compare the attribute values (e.g., QGramComparator, ExactComparator, Levenshtein)

Example default configuration:

{
    "name": "Default Matching Configuration",
    "configuration": {
        "candidateSearchConfiguration": {
            "maxCandidates": 10,
            "searchAttributes": [
                { "jsonPath": "$.names[0].value" },
                { "jsonPath": "$.addresses[0].country.shortName" },
                { "jsonPath": "$.addresses[0].thoroughfares[0].value" }
            ]
        },
        "generalMatchingConfiguration": {
            "threshold": 0.85,
            "thresholdMaybe": 0.65,
            "matchingAttributes": [
                {
                    "name": "Business Partner Name",
                    "jsonPath": "businessPartner.names[0].value",
                    "cleaners": [
                        { "name": "LowerCaseNormalizer" },
                        { "name": "PunctuationsCleaner" },
                        {
                            "name": "LegalFormCleaner",
                            "parameters": [
                                { "name": "configProperty", "value": "$.addresses[0].country.shortName" }
                            ]
                        }
                    ],
                    "high": 0.8,
                    "low": 0.3,
                    "comparator": {
                        "name": "QGramComparator",
                        "parameters": [
                            { "name": "tokenizer", "value": "BASIC" },
                            { "name": "formula", "value": "DICE" },
                            { "name": "q", "value": "3" }
                        ]
                    }
                },
                {
                    "name": "Business Partner Country",
                    "jsonPath": "businessPartner.addresses[0].country.shortName",
                    "cleaners": [ { "name": "LowerCaseNormalizer" } ],
                    "high": 0.5,
                    "low": 0,
                    "comparator": { "name": "ExactComparator" }
                },
                {
                    "name": "Business Partner City",
                    "jsonPath": "businessPartner.addresses[0].localities[0].value",
                    "cleaners": [
                        { "name": "LowerCaseNormalizer" },
                        { "name": "PunctuationsCleaner" }
                    ],
                    "high": 0.6,
                    "low": 0.2,
                    "comparator": { "name": "Levenshtein" }
                },
                {
                    "name": "Business Partner Street",
                    "jsonPath": "businessPartner.addresses[0].thoroughfares[0].value",
                    "cleaners": [
                        { "name": "LowerCaseNormalizer" },
                        { "name": "PunctuationsCleaner" }
                    ],
                    "high": 0.7,
                    "low": 0.3,
                    "comparator": { "name": "Levenshtein" }
                },
                {
                    "name": "Business Partner Postal Code",
                    "jsonPath": "businessPartner.addresses[0].postCodes[0].value",
                    "cleaners": [ { "name": "DigitsOnlyCleaner" } ],
                    "high": 0.6,
                    "low": 0.3,
                    "comparator": { "name": "ExactComparator" }
                }
            ]
        },
        "scopedMatchingConfiguration": {}
    }
}

Step 1: Log into CDQ Cloud Apps

Check your CDQ apps account.

  1. Log into the CDQ Cloud Apps

No account?

  1. If your organization is already a CDQ customer, ask your internal point of contact to create a CDQ dedicated account or request account. Account detail will be sent by email.
  2. If your organization is not yet a CDQ customer, please contact us to get started.

Step 2: Set up a Deduplication Configuration

In order to create the default configuration for the Duplicate Guard Monitor, do the following in the Duplicate Guard Configurator application:

  1. Navigate to the Duplicate Guard Configurator app.

Figure 1. Duplicate Guard Configurator

  1. Click the Create New Configuration button.
  2. Provide a Configuration Name and click Create.

Figure 2. Setting Duplicate Guard Monitor

Creation of default configuration

The default configuration is created. Browse and adjust if necessary (for this example, leave it as is).


Step 3: Create a Duplicate Monitor

Duplicate Monitor allows for asynchronous processing of data stored in a data mirror (that contains data sources - previously known as storage) in order to find duplicated business partners. The Duplicate Monitor can scan multiple data sources in order to detect duplicates. However, each data source can have only one dedicated Duplicate Monitor.

Required attributes:

  • Duplicate Matching Configuration (see above)
  • Selection of data sources within one data mirror

Data Source for Tests

To simplify understanding, a sample data set is provided with 100 Business Partners:

  • 40 are different (according to the default configuration)
  • 60 are duplicates

The basis for the set was the CDQ sample data file, which contains 10 different Business Partners. We extended it by adding 9 additional entries for each existing Business Partner. As a result, each original Business Partner has the following set of relationships with those additional 9:

  • Three NO_MATCH results (2nd, 6th, and 7th after the original Business Partner)
  • Three MAYBE_MATCH links (1st, 8th, and 9th)
  • Three MATCH links (3rd, 4th, and 5th)
Worth noting

Such results will be present only when using the default configuration. The applied matching configuration has always impact on the results. With different settings completely different results will be obtained.

Set up a duplicate monitor

  1. Navigate to the Data Mirror Management app.
  2. Use the Create Data Source button to create a new data source:
    • Name it Example Deduplication Data.
    • Set the Default mapping.
  3. Upload the example data using the Import Data button.

Click the SampleDataForDeduplication.xlsx to download the example data set.

  1. Navigate to the Data Clinic app.
  2. Add a new Data Monitor using the Add New Data Monitor button:
    • Set the data source to Example Deduplication Data.
    • Choose monitor type: Duplicate.
    • Choose configuration: Example Matching Configuration.
    • Click Create New Data Monitor.

Figure 2. Setting Duplicate Guard Monitor

Monitor results

The example data set contains 60 duplicates: 30 MATCH links and 30 MAYBE_MATCH links. Each business partner contains both (3 and 3, respectively).


Step 4: Generate a Deduplication Report

In order to analyze Duplicate Guard Monitor results we need to generate an appropriate report. The way of generating it follows the usual way of report generation in CDQ apps. Duplicate Guard reports contain data columns that were provided in duplicate guard configuration. However, we also allow for having additional custom columns in the resulting report (for analysis purposes). A user can add up to 10 additional columns by providing a name of it as well as path to a data field in the original data.

To generate a report:

  1. In the Data Clinic application, click the Reports tab.
  2. Click Generate New Report, then do the following:
    • Provide the report title (for example, Example Report).
    • Choose report type: Duplicate Matching Report.
    • Choose the appropriate data monitor: Example Deduplication Data (only Duplicate monitors are listed).
    • Select file format: Excel.
    • Leave other options as they are.
    • Click Generate.

Figure 3. Generating Duplicate Matching Report

Report availability

After a couple of minutes (it depends on the size of Duplicate Guard Monitor results) a report will available for download.

Duplicate Guard Report Format

Duplicate Guard report structure strictly relays on the configuration which was used by the Monitor itself. It contains all data fields indicated in the configuration. We provide fields like the following ones (based on the default Duplicate Guard configuration):

  • Data Source ID - indicates a data source from which a record comes from
  • External ID - external identifier of a record
  • Matching Status - the status of the match between the pattern record and a duplicate
  • Linkage Status - the status of the link which connects the pattern record with a duplicate
  • Overall Score - the similarity score between the pattern record and a duplicate
  • Data Monitor Id - the identifier of a Duplicate Guard Monitor
  • The list of matching attributes with their original values and similarity scores
  • In case of selecting the feature SHOW_MATCHING_EXPLANATION a report will contain the Matching Explanation column in which we provide the explanation of score calculation. The feature is set OFF by default.
Worth noting

Each group of Pattern+Duplicates is separated by a blank (empty) row. Moreover, patterns have light-green background while duplicates have white background.


We are constantly working on providing an outstanding user experience with our products. Please share your opinion about this tutorial!