RAW file recovery and duplicates

By | April 11, 2021

Recovery is resulting in duplicate photos

One of the drawbacks of ‘raw recovery’ is that generally files are recovered with auto generated meaningless filenames. To conclusively identify files you have to open them although initial culling can be done by sorting by size, or with help of a tool that can sort files by creation date, location where an image was taken etc. using EXIF data.

It’s not impossible that a raw scanner detects multiple copies of the same file. An often heard complaint is people finding the same photo several times after having used a raw file recovery tool such as PhotoRec or JpegDigger to recover their data. It makes the job of going through the recovered files even more frustrating.

I am adding a feature that helps JpegDigger fight duplicates and skip them.

Duplicate file detection default is enabled.

Duplicate file detection default is enabled.

JpegDigger preventing duplicates

Identical file sizes are not sufficient to detect duplicates. Also, JpegDigger can not visually compare photos like we humans can. A byte by byte file comparison takes too much time and it’s simply impractical to compare a file to all those found previously.

So my idea is to compute a CRC32 for each file detected and keeping them in an array. If a file is detected it’s CRC32 hash is compared to the table, if not present in table we’ll recover the file (or add it to the list) and add the CRC32 value to our table. If we’d run into a duplicate of this file then it’s CRC32 hash would already be in table and so we skip it.

Collisions are possible with CRC32 but I think the chance is sufficiently small to go with it anyway. Computing a hash takes CPU time and in general safer hash methods require more of it. Duplicate file detection is enabled by default but can be disabled in advanced options if desired.

Probability hash collisions

Limitations

JpegDigger will only be able to detect files that are binary identical. If where’s one bit difference between two files then both will pass as unique files. I have seen for example an intact file and a copy with minor corruptions for example survive a format. If JpegDigger is scanning with skip corrupt files option it may filter the corrupt file. If however ran with the repair option enabled both files will pass.

I have now also created a (too long) video on the topic:

One thought on “RAW file recovery and duplicates

Leave a Reply to seven_ooo_seven Cancel reply

Your email address will not be published. Required fields are marked *