This tutorial explains how to configure, activate, and analyze results from a Duplicate Guard Monitor in CDQ.
Learning Goals
In this tutorial, you will learn how to:
- set up a deduplication configuration
- activate a duplicate monitor
- generate and interpret deduplication reports
CDQ supports a default matching configuration for Duplicate Monitors. It consists of three search attributes and five matching attributes.
The default configuration is designed to systematically identify duplicate records based on specified attributes and matching criteria, using various cleaning and comparison techniques to ensure accuracy.
Key components:
The following JSON code represents the default configuration for duplicate matching, which can be used by a Duplicate Monitor to identify and handle duplicate records.
Here is the list of the key components:
"name": "Default Matching Configuration": the name of the configuration"configuration": the structure of the configuration, which contains at least two objects:candidateSearchConfigurationgeneralMatchingConfiguration
"candidateSearchConfiguration": defines how candidate records are searched for potential duplicates and includes:"maxCandidates": 10: specifies the maximum number of candidate records to consider (in this case,10)"searchAttributes": an array of attributes used to search for candidates, each defined by a JSON path:{"jsonPath": "$.names[0].value"}: searches using the first name value{"jsonPath": "$.addresses[0].country.shortName"}: searches using the country ISO code in the first address{"jsonPath": "$.addresses[0].thoroughfares[0].value"}: searches using the thoroughfare value in the first address
"generalMatchingConfiguration": defines the criteria for matching records and includes:"threshold": 0.85: the score above which records are considered a definite match"thresholdMaybe": 0.65: the score above which records are considered a possible match; the score must be below thethresholdvalue — in this case, between0.65and0.85"matchingAttributes": an array of attributes used for matching:"name": descriptive name of the attribute being matched; this name will be used in reports"jsonPath": path to the attribute in the JSON structure"cleaners": list of cleaning operations applied to the attribute before matching (e.g.,LowerCaseNormalizer,PunctuationsCleaner)"high"and"low": thresholds for matching scores specific to the attribute"comparator": method used to compare the attribute values (e.g.,QGramComparator,ExactComparator,Levenshtein)
Example default configuration:
{
"name": "Default Matching Configuration",
"configuration": {
"candidateSearchConfiguration": {
"maxCandidates": 10,
"searchAttributes": [
{ "jsonPath": "$.names[0].value" },
{ "jsonPath": "$.addresses[0].country.shortName" },
{ "jsonPath": "$.addresses[0].thoroughfares[0].value" }
]
},
"generalMatchingConfiguration": {
"threshold": 0.85,
"thresholdMaybe": 0.65,
"matchingAttributes": [
{
"name": "Business Partner Name",
"jsonPath": "businessPartner.names[0].value",
"cleaners": [
{ "name": "LowerCaseNormalizer" },
{ "name": "PunctuationsCleaner" },
{
"name": "LegalFormCleaner",
"parameters": [
{ "name": "configProperty", "value": "$.addresses[0].country.shortName" }
]
}
],
"high": 0.8,
"low": 0.3,
"comparator": {
"name": "QGramComparator",
"parameters": [
{ "name": "tokenizer", "value": "BASIC" },
{ "name": "formula", "value": "DICE" },
{ "name": "q", "value": "3" }
]
}
},
{
"name": "Business Partner Country",
"jsonPath": "businessPartner.addresses[0].country.shortName",
"cleaners": [ { "name": "LowerCaseNormalizer" } ],
"high": 0.5,
"low": 0,
"comparator": { "name": "ExactComparator" }
},
{
"name": "Business Partner City",
"jsonPath": "businessPartner.addresses[0].localities[0].value",
"cleaners": [
{ "name": "LowerCaseNormalizer" },
{ "name": "PunctuationsCleaner" }
],
"high": 0.6,
"low": 0.2,
"comparator": { "name": "Levenshtein" }
},
{
"name": "Business Partner Street",
"jsonPath": "businessPartner.addresses[0].thoroughfares[0].value",
"cleaners": [
{ "name": "LowerCaseNormalizer" },
{ "name": "PunctuationsCleaner" }
],
"high": 0.7,
"low": 0.3,
"comparator": { "name": "Levenshtein" }
},
{
"name": "Business Partner Postal Code",
"jsonPath": "businessPartner.addresses[0].postCodes[0].value",
"cleaners": [ { "name": "DigitsOnlyCleaner" } ],
"high": 0.6,
"low": 0.3,
"comparator": { "name": "ExactComparator" }
}
]
},
"scopedMatchingConfiguration": {}
}
}Check your CDQ apps account.
- Log into the CDQ Cloud Apps

No account?
- If your organization is already a CDQ customer, ask your internal point of contact to create a CDQ dedicated account or request account. Account detail will be sent by email.
- If your organization is not yet a CDQ customer, please contact us to get started.
In order to create the default configuration for the Duplicate Guard Monitor, do the following in the Duplicate Guard Configurator application:
- Navigate to the Duplicate Guard Configurator app.

- Click the Create New Configuration button.
- Provide a Configuration Name and click Create.

The default configuration is created. Browse and adjust if necessary (for this example, leave it as is).
Duplicate Monitor allows for asynchronous processing of data stored in a data mirror (that contains data sources - previously known as storage) in order to find duplicated business partners. The Duplicate Monitor can scan multiple data sources in order to detect duplicates. However, each data source can have only one dedicated Duplicate Monitor.
Required attributes:
- Duplicate Matching Configuration (see above)
- Selection of data sources within one data mirror
To simplify understanding, a sample data set is provided with 100 Business Partners:
- 40 are different (according to the default configuration)
- 60 are duplicates
The basis for the set was the CDQ sample data file, which contains 10 different Business Partners. We extended it by adding 9 additional entries for each existing Business Partner. As a result, each original Business Partner has the following set of relationships with those additional 9:
- Three
NO_MATCHresults (2nd, 6th, and 7th after the original Business Partner) - Three
MAYBE_MATCHlinks (1st, 8th, and 9th) - Three
MATCHlinks (3rd, 4th, and 5th)
Such results will be present only when using the default configuration. The applied matching configuration has always impact on the results. With different settings completely different results will be obtained.
- Navigate to the Data Mirror Management app.
- Use the Create Data Source button to create a new data source:
- Name it Example Deduplication Data.
- Set the Default mapping.
- Upload the example data using the Import Data button.
Click the SampleDataForDeduplication.xlsx to download the example data set.
- Navigate to the Data Clinic app.
- Add a new Data Monitor using the Add New Data Monitor button:
- Set the data source to Example Deduplication Data.
- Choose monitor type: Duplicate.
- Choose configuration: Example Matching Configuration.
- Click Create New Data Monitor.

The example data set contains 60 duplicates: 30 MATCH links and 30 MAYBE_MATCH links. Each business partner contains both (3 and 3, respectively).
In order to analyze Duplicate Guard Monitor results we need to generate an appropriate report. The way of generating it follows the usual way of report generation in CDQ apps. Duplicate Guard reports contain data columns that were provided in duplicate guard configuration. However, we also allow for having additional custom columns in the resulting report (for analysis purposes). A user can add up to 10 additional columns by providing a name of it as well as path to a data field in the original data.
To generate a report:
- In the Data Clinic application, click the Reports tab.
- Click Generate New Report, then do the following:
- Provide the report title (for example, Example Report).
- Choose report type: Duplicate Matching Report.
- Choose the appropriate data monitor: Example Deduplication Data (only
Duplicatemonitors are listed). - Select file format: Excel.
- Leave other options as they are.
- Click Generate.

After a couple of minutes (it depends on the size of Duplicate Guard Monitor results) a report will available for download.
Duplicate Guard report structure strictly relays on the configuration which was used by the Monitor itself. It contains all data fields indicated in the configuration. We provide fields like the following ones (based on the default Duplicate Guard configuration):
Data Source ID- indicates a data source from which a record comes fromExternal ID- external identifier of a recordMatching Status- the status of the match between the pattern record and a duplicateLinkage Status- the status of the link which connects the pattern record with a duplicateOverall Score- the similarity score between the pattern record and a duplicateData Monitor Id- the identifier of a Duplicate Guard Monitor- The list of matching attributes with their original values and similarity scores
- In case of selecting the feature
SHOW_MATCHING_EXPLANATIONa report will contain the Matching Explanation column in which we provide the explanation of score calculation. The feature is setOFFby default.
Each group of Pattern+Duplicates is separated by a blank (empty) row. Moreover, patterns have light-green background while duplicates have white background.
We are constantly working on providing an outstanding user experience with our products. Please share your opinion about this tutorial!