Some thoughts on signature based file recovery

By | March 23, 2021

But first, file system based recovery

I am A ‘Windows person’ and deal with data recovery on file systems that were for the most part designed by Microsoft. As a consequence I can predict with reasonable certainty the chances of successful data recovery after you formatted your NTFS drive. Because I understand what such a format does and which file system meta data survives. Today however I met someone online who reformatted his HFS+ file system to HFS+. And I know very little about HFS+.

So then I do some digging and come across articles like this to learn prospects of file system based recovery aren’t very good. File system based recovery scans the drive for meta data about files. Their names, creation date, size, location etc.. Using all that we can recover a complete file with all it’s original attributes such as filename and folder they were stored in. This is the ideal situation, recovery complete with filenames and folder structure.

If this file system data was lost, we need to resort to what’s called raw file recovery.

Raw or signature based recovery.

Most file types start with a header containing some ‘magic bytes’. Magic bytes are a sequence of bytes that you’ll find in every file of that file type. For example MP3 files start with bytes 0x49, 44, 33. JPEG files start with 0xFF, D8, FF. One difference between those two file types that is relevant to raw scans is the fact that JPEG files also have such a signature at the end of the file while MP3 files don’t. In fact many file types don’t.

JPEG end of file marker observed directly on disk with hex editor

If we snoop around on the raw disk data we can observe the JPEG end of file marker 0xFF, D9.

Which brings us to an interesting topic, how can we determine the end of a file? Unlike with the file system data method, where we know the file size, we need an alternative method when recovering data using the signature scan or raw scan method.

Detecting the end of a file when raw scanning

In general there are a few methods:

  1. The raw scan tool is file type aware; it knows the structure of a file type so it can ‘parse’ it. For example, the header of the file type may include information on the size of different components of the file and maybe even the file size. In this case we can determine the exact file size.
  2. File type has start and end magic byte signature. So looking at JPEG again we can detect it’s start and end. May seem simple enough, but it becomes already more complicated as a JPEG can embed a thumbnail which in itself is also a complete JPEG with start and end signature.
    This method also does not take into account file fragmentation, assume for example file is in two fragments. While 0xFF, D9 has special meaning to JPEG, it does not to other file types. So if a file located after first fragment of the JPEG ends contains 0xFF, D9 the recovery tool will truncate the JPEG it is recovering at this point.
  3. Assume one file end where next starts. This is a commonly used strategy. In case we’re recovering an MP3 which has a start but no end signature, we assume we reached the end of the file as soon as we detect the next file. Problem is this” Although MP3 files start with bytes 0x49, 44, 33, this is not a unique sequence. As a consequence the scanner may come to the conclusion it found the start of an MP3 file even though it is ‘just some data’ in another file. So suppose we’re recovering an MP4 video file 0x49, 44, 33 does not have to signify the start of an MP3, it might just as well be encoded video data.
  4. I guess most scanners combine method 2 and 3. If both methods fail the scanner will just keep adding data to the file we’re recovering and this causes the often observed effect of huge files as result of raw recovery. Many scanners therefor allow you to set a maximum file size and once this threshold is reached the file is closed and saved.

Some observations, thoughts and tips when raw scanning

Swinging back to today, to the person trying to recover data from a HFS+ formatted volume. He was mainly interested in videos, and as we learned HFS+ isn’t very forgiving to file system meta data when you reformat the volume. So then it comes down to detecting files using the raw scan method.

To perform the recovery he was using R-Studio. R-Studio is one of the more advanced and professional file recovery tools available and is used in many data recovery labs for logical data recoveries. Out of the box R-Studio detects many file types using the raw scanner and you can even add your own.

Add file signatures for raw scan R-Studio

Problem was that a substantial number of videos that was recovered were corrupt. It got me thinking as I happen to know MP4 video files do not have a end of file signature. Also I suspect R-Studio does not parse MP4 files, it just detects the start signature. So it means 2 methods for end of file detection remain:

  1. Scan until we find signature for next file.
  2. Truncate file at preset max file size.

If you find prematurely cut-off videos and the tool you’re using offers a max file size option, you can try increasing the file size. Another option is the file is closes and saves prematurely because the tool assumes it finds the start of the next file. Let’s look at the MP3 example.

MP3 file signature is 0x49, 44, 33 but we already discussed how this sequence of bytes isn’t unique. In other words it may simply be part of a video data stream. The makers of R-Studio acknowledge this sort of (https://www.r-studio.com/file-recovery-basics.html):

For example, the string “ID3” may occur in a file without being a file signature. For example, the text you are currently reading includes the string “ID3” but is not an MP3 file. As such, a raw file search may incorrectly identify a portion of this text as the beginning of an MP3 file.

We also can see how many file types the raw scanner looks for by default:

Default R-Studio raw scans for many file types

R-Studio-defaults for raw scan

Each of the associated file signatures potentially may interrupt a MP4 video data stream. So if video files are your primary concern and you get a lot of truncated videos, re-scan after disabling all but MP4 file types in the raw scan settings window. Most other file recovery software like R-Studio offer this option too.

Leave a Reply

Your email address will not be published. Required fields are marked *