Decoding Status from NVMe SMART logs

By | August 12, 2023

Man page provides no clue how to interpret NVMe error log

I noticed this question a lot (on SuperUser for example) and I wondered about it myself too. Questions like:

Decyphering NVMe SMART error

root@truenas[~]# smartctl -l error /dev/nvme4
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p12 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0          1     6  0x007b  0xc502  0x000    130089984     1     -

Turns out the answers are there, but it requires combining bits and bobs of information. So value we’re interested in this case is 0xc502 I found the answer here. If we look at answer given by birkelund who identifies as NVMe software developer we read:

If you are asking how that error code is encoded in 0xC502, then its 0xC502 >> 1 to get rid of the Phase Tag. That leave us with 0x6281. Then apply a mask of 0x7ff to extract the lower 11 bytes (3 for the Status Code Type and 8 for the Status Code), ending up with 0x281. 0x2xx are “Media and Data Integrity Errors” and the 0x81 status code is “Unrecovered Read Error”.

So that gives us one piece of the puzzle, but leaves me with some questions too:

  1. Where did he get Status Code Type and Status Code definitions from?
  2. What’s a phase tag?

I did some more research to determine where he got Status Code Type and Status Code from and found some source code which corresponds with what he explains (more complete list here):

NVME_STATUS_TYPE_GENERIC_COMMAND = 0,
NVME_STATUS_TYPE_COMMAND_SPECIFIC = 1,
NVME_STATUS_TYPE_MEDIA_ERROR = 2,
NVME_STATUS_TYPE_VENDOR_SPECIFIC = 7,

// Status Code (SC) of NVME_STATUS_TYPE_GENERIC_COMMAND

NVME_STATUS_SUCCESS_COMPLETION = 0x00,
NVME_STATUS_INVALID_COMMAND_OPCODE = 0x01,
NVME_STATUS_INVALID_FIELD_IN_COMMAND = 0x02,
NVME_STATUS_COMMAND_ID_CONFLICT = 0x03,
NVME_STATUS_DATA_TRANSFER_ERROR = 0x04,
NVME_STATUS_COMMAND_ABORTED_DUE_TO_POWER_LOSS_NOTIFICATION = 0x05,
NVME_STATUS_INTERNAL_DEVICE_ERROR = 0x06,
NVME_STATUS_COMMAND_ABORT_REQUESTED = 0x07,
NVME_STATUS_COMMAND_ABORTED_DUE_TO_SQ_DELETION = 0x08,
NVME_STATUS_COMMAND_ABORTED_DUE_TO_FAILED_FUSED_COMMAND = 0x09,
NVME_STATUS_COMMAND_ABORTED_DUE_TO_FAILED_MISSING_COMMAND = 0x0A,
NVME_STATUS_INVALID_NAMESPACE_OR_FORMAT = 0x0B,
NVME_STATUS_COMMAND_SEQUENCE_ERROR = 0x0C,
NVME_STATUS_INVALID_SGL_LAST_SEGMENT_DESCR = 0x0D,
NVME_STATUS_INVALID_NUMBER_OF_SGL_DESCR = 0x0E,
NVME_STATUS_DATA_SGL_LENGTH_INVALID = 0x0F,
NVME_STATUS_METADATA_SGL_LENGTH_INVALID = 0x10,
NVME_STATUS_SGL_DESCR_TYPE_INVALID = 0x11,
NVME_STATUS_INVALID_USE_OF_CONTROLLER_MEMORY_BUFFER = 0x12,
NVME_STATUS_PRP_OFFSET_INVALID = 0x13,
NVME_STATUS_ATOMIC_WRITE_UNIT_EXCEEDED = 0x14,
NVME_STATUS_OPERATION_DENIED = 0x15,
NVME_STATUS_SGL_OFFSET_INVALID = 0x16,
NVME_STATUS_RESERVED = 0x17,
NVME_STATUS_HOST_IDENTIFIER_INCONSISTENT_FORMAT = 0x18,
NVME_STATUS_KEEP_ALIVE_TIMEOUT_EXPIRED = 0x19,
NVME_STATUS_KEEP_ALIVE_TIMEOUT_INVALID = 0x1A,
NVME_STATUS_COMMAND_ABORTED_DUE_TO_PREEMPT_ABORT = 0x1B,
NVME_STATUS_SANITIZE_FAILED = 0x1C,
NVME_STATUS_SANITIZE_IN_PROGRESS = 0x1D,
NVME_STATUS_SGL_DATA_BLOCK_GRANULARITY_INVALID = 0x1E,
NVME_STATUS_DIRECTIVE_TYPE_INVALID = 0x70,

NVME_STATUS_DIRECTIVE_ID_INVALID = 0x71,
NVME_STATUS_NVM_LBA_OUT_OF_RANGE = 0x80,

NVME_STATUS_NVM_CAPACITY_EXCEEDED = 0x81,
NVME_STATUS_NVM_NAMESPACE_NOT_READY = 0x82,
NVME_STATUS_NVM_RESERVATION_CONFLICT = 0x83,
NVME_STATUS_FORMAT_IN_PROGRESS = 0x84,

// Status Code (SC) of NVME_STATUS_TYPE_COMMAND_SPECIFIC

NVME_STATUS_COMPLETION_QUEUE_INVALID = 0x00,
NVME_STATUS_INVALID_QUEUE_IDENTIFIER = 0x01,
NVME_STATUS_MAX_QUEUE_SIZE_EXCEEDED = 0x02,
NVME_STATUS_ABORT_COMMAND_LIMIT_EXCEEDED = 0x03,
NVME_STATUS_ASYNC_EVENT_REQUEST_LIMIT_EXCEEDED = 0x05,
NVME_STATUS_INVALID_FIRMWARE_SLOT = 0x06,
NVME_STATUS_INVALID_FIRMWARE_IMAGE = 0x07,
NVME_STATUS_INVALID_INTERRUPT_VECTOR = 0x08,
NVME_STATUS_INVALID_LOG_PAGE = 0x09,
NVME_STATUS_INVALID_FORMAT = 0x0A,
NVME_STATUS_FIRMWARE_ACTIVATION_REQUIRES_CONVENTIONAL_RESET = 0x0B,
NVME_STATUS_INVALID_QUEUE_DELETION = 0x0C,
NVME_STATUS_FEATURE_ID_NOT_SAVEABLE = 0x0D,
NVME_STATUS_FEATURE_NOT_CHANGEABLE = 0x0E,
NVME_STATUS_FEATURE_NOT_NAMESPACE_SPECIFIC = 0x0F,
NVME_STATUS_FIRMWARE_ACTIVATION_REQUIRES_NVM_SUBSYSTEM_RESET = 0x10,
NVME_STATUS_FIRMWARE_ACTIVATION_REQUIRES_RESET = 0x11,
NVME_STATUS_FIRMWARE_ACTIVATION_REQUIRES_MAX_TIME_VIOLATION = 0x12,
NVME_STATUS_FIRMWARE_ACTIVATION_PROHIBITED = 0x13,
NVME_STATUS_OVERLAPPING_RANGE = 0x14,
NVME_STATUS_NAMESPACE_INSUFFICIENT_CAPACITY = 0x15,

NVME_STATUS_NAMESPACE_IDENTIFIER_UNAVAILABLE = 0x16,
NVME_STATUS_NAMESPACE_ALREADY_ATTACHED = 0x18,
NVME_STATUS_NAMESPACE_IS_PRIVATE = 0x19,
NVME_STATUS_NAMESPACE_NOT_ATTACHED = 0x1A,
NVME_STATUS_NAMESPACE_THIN_PROVISIONING_NOT_SUPPORTED = 0x1B,
NVME_STATUS_CONTROLLER_LIST_INVALID = 0x1C,
NVME_STATUS_DEVICE_SELF_TEST_IN_PROGRESS = 0x1D,
NVME_STATUS_BOOT_PARTITION_WRITE_PROHIBITED = 0x1E,
NVME_STATUS_INVALID_CONTROLLER_IDENTIFIER = 0x1F,

NVME_STATUS_INVALID_SECONDARY_CONTROLLER_STATE = 0x20,
NVME_STATUS_INVALID_NUMBER_OF_CONTROLLER_RESOURCES = 0x21,
NVME_STATUS_INVALID_RESOURCE_IDENTIFIER = 0x22,
NVME_STATUS_STREAM_RESOURCE_ALLOCATION_FAILED = 0x7F,
NVME_STATUS_NVM_CONFLICTING_ATTRIBUTES = 0x80,

NVME_STATUS_NVM_INVALID_PROTECTION_INFORMATION = 0x81,
NVME_STATUS_NVM_ATTEMPTED_WRITE_TO_READ_ONLY_RANGE = 0x82,

// Status Code (SC) of NVME_STATUS_TYPE_MEDIA_ERROR

NVME_STATUS_NVM_WRITE_FAULT = 0x80,
NVME_STATUS_NVM_UNRECOVERED_READ_ERROR = 0x81,
NVME_STATUS_NVM_END_TO_END_GUARD_CHECK_ERROR = 0x82,
NVME_STATUS_NVM_END_TO_END_APPLICATION_TAG_CHECK_ERROR = 0x83,
NVME_STATUS_NVM_END_TO_END_REFERENCE_TAG_CHECK_ERROR = 0x84,
NVME_STATUS_NVM_COMPARE_FAILURE = 0x85,
NVME_STATUS_NVM_ACCESS_DENIED = 0x86,
NVME_STATUS_NVM_DEALLOCATED_OR_UNWRITTEN_LOGICAL_BLOCK = 0x87,

For the other loose end, the Phase Tag or why we need to take it into account I found the following description:

Status Field: This field indicates the Status Field for the command that completed. The Status
Field is located in bits 15:01, bit 00 corresponds to the Phase Tag posted for the command. If
the error is not specific to a particular command then this field reports the most applicable
status value.

This gives us all we need to decipher or decode NVMe Status codes. Let’s apply what we found to the Status Code we found at the start of the post: 0xc502

Step 1:  Get rid of Phase Tag. Right Shift (“>> 1”) is basically the same as divide by 2 for our purposes. So 0xc502 / 2 = 0x6281

Step 2:  “apply a mask of 0x7ff to extract the lower 11 bytes” which in effect is the same as taking the three right side nibbles of 0x6281, so that gives us 281

Step 3: Left nibble give us the ‘Status Code Type’, 2 which in above table gives us NVME_SCT_MEDIA_ERROR = 0x2, and if we look up error 81 (two right nibbles, 81) in /* media error status codes */ we get NVME_SC_UNRECOVERED_READ_ERROR

Let’s do some more for exercise.

Error Information (NVMe Log 0x01, max 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0       1181     0  0x0015  0x421a  0x028            0     0     -
  1       1180     0  0x0013  0x4212  0x028            0     0     -
  2       1179     0  0x0015  0x421a  0x028            0     0     -
  3       1178     0  0x0013  0x4212  0x028            0     0     -

0x421a -> 0x421a / 2 = 0x210D -> Status Code Type 1, Status Code 0D -> FEATURE_ID_NOT_SAVEABLE

0x4212 -> 0x4212 / 2 = 0xx2109 -> Status Code Type 1, Status Code 09 -> INVALID_LOG_PAGE

Leave a Reply

Your email address will not be published. Required fields are marked *