Deduplication

Contents

  1. What is Deduplication?

  2. Deduplication Methods

  3. Getting Started

  4. Deduplication & Datasets

What is Deduplication?

Deduplication is the act of identifying and removing duplicated records from Osmos. For Pipelines, Osmos supports four different deduplication methods. You can choose to deduplicate at file level or record level. Deduplication is not time-bound and is exclusively based on data content. Osmos uses a hashing method to identify duplication. A unique record hash is similar to a "fingerprint" or "checksum," and can identify duplicate files. For record-level deduplication, the individual hash is at the row level.

Note: Uploaders have no programmatic deduplication since they are designed to be a “human in the loop” operation. Uploaders can be configured to display upload history with basic information like Filename, upload timestamp, and the ability to download the source file to avoid confusion.

Deduplication Methods

File-level Deduplication

Osmos creates a hash of a given file, including file contents and metadata, and compares that to an index of previously imported hashes, if any is an exact match the new file is skipped. For most file types, changing the filename alone is sufficient for the metadata to update.

Record level Deduplication across all historical data

Osmos creates a hash of a given row within a file and compares that to an index of previously imported hashes, if any is an exact match the new row is skipped as a duplicate. To reiterate, an identical record that was already processed in a previous Pipeline run will not be processed in the current file, nor will duplicate records within the same file.

Record level Deduplication within individual files

Osmos creates a hash of a given row within a file and compares that to an index of the other rows within the file, if any are an exact match the new row is skipped as a duplicate.

No Deduplication

Do no deduplication, neither for files nor records. All rows of all files will be processed. This setting is typically used in the initial configuration and testing of a new pipeline. Rarely do we see it used in a production setting.

Getting Started

The Deduplication Method is set on the Source connector. It defaults to File-level deduplication. On the Source Connector > Show Advanced Options you will see the Deduplication Methods.

Scenario 1

  1. Customer ABC uploads a customer file of 100k records.

  2. Osmos runs the job and the records are processed.

  3. Customer ABC then uploads a second file, the same file again. More clearly, the same file is loaded twice.

For Pipelines

File-level deduplication

The second file is seen as a duplicated file and is skipped

Record Level Deduplication across all historical data

Each record from the second file is seen as a duplicate of a previously processed record and is skipped

Record level Deduplication within individual files

If there are records within the second file that contain the same content they may be skipped, but the file itself will be seen and processed as a new file

No Deduplication

No deduplication actions are taken and all records are processed

For Uploaders

Osmos assumes that the uploading user has the most up-to-date information and ingests any submitted files.

Scenario 2

  1. Customer ABC uploads a customer file of 100k records.

  2. The next week, Customer ABC uploads a new file that contains the same records. More clearly, same data but it is a new file with the same name.

  3. Osmos runs the job and the records are processed.

For Pipelines

File-level deduplication

The second file has new metadata and so is processed as a new file (even if the data content is the same as the content that has been previously processed)

Record Level Deduplication across all historical data

Each record from the second file is seen as a duplicate of a previously processed record and is skipped

Record level Deduplication within individual files

If there are records within the second file that contain the same content they may be skipped, but the file itself will be seen and processed as a new file

No Deduplication

No deduplication actions are taken and all records are processed

For Uploaders

Osmos assumes that the uploading user has the most up-to-date information and ingests any submitted files.

Deduplication & Datasets

In a multi-stage workflow, it is possible that a record is not seen as a duplicate on ingestion, but after transformation is seen as a duplicate. It is important to weigh/review the deduplication configuration at every step.

Datasets metadata is assigned to a record after processing, so it is not part of the deduplication on initial ingestion. If metadata information should be added to make a record unique, it should be added as its field at the time of ingestion. In a multi-stage workflow, dataset metadata automatically assigned on ingest can be used in later steps as a data component and can be manipulated just like any other field. One important thing to remember when it comes to deduplication there may be some relevant information that comes from recognizing that a record contains the same information. For example, there may be a table with a primary key and Dataset to upsert. When the data updates it is important to note the “date Updated” metadata, even if none of the data content changes for that record.

Note: In the Datasets Query Builder, "Group By" can also be used to combine multiple rows into a single row. This is an aggregation rather than a deduplication - which is why it was not included earlier - but this often comes up as an alternative to deduplication, so it is worth noting.

Last updated