Expediting MRSH-v2 Approximate Matching with Hierarchical Bloom Filter Trees

Lillis, David; Breitinger, Frank; Scanlon, Mark

Publication Date:  October 2017

Publication Name:  9th EAI International Conference on Digital Forensics and Cybercrime (ICDF2C)

Abstract:   Perhaps the most common task encountered by digital forensic investigators consists of searching through a seized device for pertinent data. Frequently, an investigator will be in possession of a collection of known-illegal files (e.g. a collection of child pornographic images) and will seek to find whether copies of these are stored on the seized drive. Traditional hash matching techniques can efficiently find files that precisely match. However, this will fail in the case of merged files, embedded files, partial files, or if a file has been changed in any way. In recent years, approximate matching algorithms have shown significant promise in the detection of files that have a high bytewise similarity. This paper focuses on MRSH-v2. A number of experiments were conducted using realistic hardware and document collections of realistic scale. Hierarchical Bloom filter trees were used to dramatically reduce the quantity of pairwise comparisons that must be made between known-illegal files and files on the seized disk. The experiments demonstrate substantial speed gains over using the original MRSH-v2, while maintaining effectiveness.

Download:

Download Paper as PDF

BibTeX Entry:


      @inproceedings{lillis2017bloomfiltertrees,
author={Lillis, David and Breitinger, Frank and Scanlon, Mark},
title="{Expediting MRSH-v2 Approximate Matching with Hierarchical Bloom Filter Trees}",
booktitle="{9th EAI International Conference on Digital Forensics and Cybercrime (ICDF2C)}",
year="2017",
month="10",
publisher={Springer},
abstract="Perhaps the most common task encountered by digital forensic investigators consists of searching through a seized device for pertinent data. Frequently, an investigator will be in possession of a collection of "known-illegal" files (e.g. a collection of child pornographic images) and will seek to find whether copies of these are stored on the seized drive. Traditional hash matching techniques can efficiently find files that precisely match. However, this will fail in the case of merged files, embedded files, partial files, or if a file has been changed in any way. In recent years, approximate matching algorithms have shown significant promise in the detection of files that have a high bytewise similarity. This paper focuses on MRSH-v2. A number of experiments were conducted using realistic hardware and document collections of realistic scale. Hierarchical Bloom filter trees were used to dramatically reduce the quantity of pairwise comparisons that must be made between known-illegal files and files on the seized disk. The experiments demonstrate substantial speed gains over using the original MRSH-v2, while maintaining effectiveness."
}