The volume of data involved in the e-discovery process is imposing. Removing exact duplicates prior to review is the most powerful and sensible way to eliminate obviously redundant data. This reduces not only the overall data volume, but also ensures data only needs to be reviewed once.
In the latest enhancement to the Nextpoint product lineup, including Cloud Preservation, Discovery Cloud, and Trial Cloud, automatic “deduplication” is built in — allowing users to eliminate duplicative data simply and easily via a user friendly interface. This is a huge efficiency improvement, reducing the time and cost of e-discovery and we are excited to bring this innovation to our customers.
The 2 phased Deduplication process.
1. Preventing duplicate uploads.
As an initial step, uploaded files (zip, pst, loose file, etc) are compared to all previous uploads. If the upload is an exact duplicate of a previous upload, the application will request confirmation you would like the duplicative files to be loaded. Frequently, this step alone can prevent large numbers of duplicate documents.
2. Preventing duplicate documents/files, contained in different archives.
Often times the same file (an email, document, etc) has been collected from multiple sources. When this occurs, the upload will slip by Phase 1 because the container (zip/pst/etc) was physically different than anything previously uploaded. This is by far the most common cause of duplicate documents.
Individual checks occur on files contained inside of the high-level container to search for an exact match*. When an exact match is caught, introduction of the duplicative data is prevented, instead linking to the pre-existing copy of the document. The pre-existing document will now indicate that it was loaded both in Batch 1 and Batch 2. Additionally fields such as location URI and custodian will be merged.
* Metadata from a load file and/or changes made via the web application after the previous load completed (designating, reviewing, changing shortcuts, etc) will be taken into consideration when making the “exact match” determination. By default, a file hash is employed to identify candidates for “exact matches” – optionally, this can be expanded to include documents that may have a different file hash but share an Email-Message-ID.
Understanding what has been deduplicated.
The batch details screen (choose “Import/Export” from the menu bar in Discovery Cloud or “More” -> “Imports” in Trial Cloud) has been enhanced to provide information on how much and what has been deduplicated.
The main section provides verbose lists of Actions, Errors, and Skipped Rows that occurred during processing. It also provides abbreviated lists of new and duplicate documents encountered during processing, with links to quickly view the full set(s) via our normal search/filter interface.

The Batch Summary section (located in the sidebar) has been enhanced to provide information about the uploaded file, the results and status of processing, links to resulting documents, and the ability to reprocess the originally uploaded file completely.

Handling the duplicates that make it through.
The “same file” can make it into the system through a few different channels.
- A user deliberately disabled deduplication during an upload or selected the “reprocess file” option on an entire batch.
- The same document was attached to multiple emails. The meaning of a document can vary wildly based on it’s context, thus we consider an email and it’s attachment(s) as a single unit during the deduplication process.
- The “same file” by content hash will be allowed into the system if the associated meta is different. This could happen due to differences specified in load files, changes made to meta in the system following uploads, etc.
In all of these situations, we display related files in the sidebar (of
Discovery Cloud) to allow the human reviewer to determine whether associated meta does or does not warrant further deduplication.
Customization & Disabling Deduplication
Deduplication can be disabled on any individual upload/reprocessing request. It can be disabled at the instance level via “Settings” => “General Settings” => “Deduplication”. The settings section also provides the ability to use Email-Message-ID in the “exact match” determination.

When will it be available?
This functionality is available immediately in both Discovery and Trial Clouds.
As always, custom support options are available on request to address unique deduplication needs. We’re excited for these new improvement and the positive impact they’ll have for our customers going forward.
Read Full Post »