I noticed this question a lot (on SuperUser for example) and I wondered about it myself too. Questions like:
Decyphering NVMe SMART error
root@truenas[~]# smartctl -l error /dev/nvme4
smartctl 7.2 2020-12-30 r5155 [FreeBSD 12.2-RELEASE-p12 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF SMART DATA SECTION ===
Error Information (NVMe Log 0x01, 16 of 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 1 6 0x007b 0xc502 0x000 130089984 1 -
Turns out the answers are there, but it requires combining bits and bobs of information. So value we’re interested in this case is 0xc502 I found the answer here. If we look at answer given by birkelund who identifies as NVMe software developer we read:
If you are asking how that error code is encoded in 0xC502, then its 0xC502 >> 1 to get rid of the Phase Tag. That leave us with 0x6281. Then apply a mask of 0x7ff to extract the lower 11 bytes (3 for the Status Code Type and 8 for the Status Code), ending up with 0x281. 0x2xx are “Media and Data Integrity Errors” and the 0x81 status code is “Unrecovered Read Error”.
So that gives us one piece of the puzzle, but leaves me with some questions too:
- Where did he get Status Code Type and Status Code definitions from?
- What’s a phase tag?
I did some more research to determine where he got Status Code Type and Status Code from and found some source code which corresponds with what he explains (more complete list here):
NVME_STATUS_TYPE_GENERIC_COMMAND = 0,
NVME_STATUS_TYPE_COMMAND_SPECIFIC = 1,
NVME_STATUS_TYPE_MEDIA_ERROR = 2,
NVME_STATUS_TYPE_VENDOR_SPECIFIC = 7,
// Status Code (SC) of NVME_STATUS_TYPE_GENERIC_COMMAND
NVME_STATUS_SUCCESS_COMPLETION = 0x00,
NVME_STATUS_INVALID_COMMAND_OPCODE = 0x01,
NVME_STATUS_INVALID_FIELD_IN_COMMAND = 0x02,
NVME_STATUS_COMMAND_ID_CONFLICT = 0x03,
NVME_STATUS_DATA_TRANSFER_ERROR = 0x04,
NVME_STATUS_COMMAND_ABORTED_DUE_TO_POWER_LOSS_NOTIFICATION = 0x05,
NVME_STATUS_INTERNAL_DEVICE_ERROR = 0x06,
NVME_STATUS_COMMAND_ABORT_REQUESTED = 0x07,
NVME_STATUS_COMMAND_ABORTED_DUE_TO_SQ_DELETION = 0x08,
NVME_STATUS_COMMAND_ABORTED_DUE_TO_FAILED_FUSED_COMMAND = 0x09,
NVME_STATUS_COMMAND_ABORTED_DUE_TO_FAILED_MISSING_COMMAND = 0x0A,
NVME_STATUS_INVALID_NAMESPACE_OR_FORMAT = 0x0B,
NVME_STATUS_COMMAND_SEQUENCE_ERROR = 0x0C,
NVME_STATUS_INVALID_SGL_LAST_SEGMENT_DESCR = 0x0D,
NVME_STATUS_INVALID_NUMBER_OF_SGL_DESCR = 0x0E,
NVME_STATUS_DATA_SGL_LENGTH_INVALID = 0x0F,
NVME_STATUS_METADATA_SGL_LENGTH_INVALID = 0x10,
NVME_STATUS_SGL_DESCR_TYPE_INVALID = 0x11,
NVME_STATUS_INVALID_USE_OF_CONTROLLER_MEMORY_BUFFER = 0x12,
NVME_STATUS_PRP_OFFSET_INVALID = 0x13,
NVME_STATUS_ATOMIC_WRITE_UNIT_EXCEEDED = 0x14,
NVME_STATUS_OPERATION_DENIED = 0x15,
NVME_STATUS_SGL_OFFSET_INVALID = 0x16,
NVME_STATUS_RESERVED = 0x17,
NVME_STATUS_HOST_IDENTIFIER_INCONSISTENT_FORMAT = 0x18,
NVME_STATUS_KEEP_ALIVE_TIMEOUT_EXPIRED = 0x19,
NVME_STATUS_KEEP_ALIVE_TIMEOUT_INVALID = 0x1A,
NVME_STATUS_COMMAND_ABORTED_DUE_TO_PREEMPT_ABORT = 0x1B,
NVME_STATUS_SANITIZE_FAILED = 0x1C,
NVME_STATUS_SANITIZE_IN_PROGRESS = 0x1D,
NVME_STATUS_SGL_DATA_BLOCK_GRANULARITY_INVALID = 0x1E,
NVME_STATUS_DIRECTIVE_TYPE_INVALID = 0x70,
NVME_STATUS_DIRECTIVE_ID_INVALID = 0x71,
NVME_STATUS_NVM_LBA_OUT_OF_RANGE = 0x80,
NVME_STATUS_NVM_CAPACITY_EXCEEDED = 0x81,
NVME_STATUS_NVM_NAMESPACE_NOT_READY = 0x82,
NVME_STATUS_NVM_RESERVATION_CONFLICT = 0x83,
NVME_STATUS_FORMAT_IN_PROGRESS = 0x84,
// Status Code (SC) of NVME_STATUS_TYPE_COMMAND_SPECIFIC
NVME_STATUS_COMPLETION_QUEUE_INVALID = 0x00,
NVME_STATUS_INVALID_QUEUE_IDENTIFIER = 0x01,
NVME_STATUS_MAX_QUEUE_SIZE_EXCEEDED = 0x02,
NVME_STATUS_ABORT_COMMAND_LIMIT_EXCEEDED = 0x03,
NVME_STATUS_ASYNC_EVENT_REQUEST_LIMIT_EXCEEDED = 0x05,
NVME_STATUS_INVALID_FIRMWARE_SLOT = 0x06,
NVME_STATUS_INVALID_FIRMWARE_IMAGE = 0x07,
NVME_STATUS_INVALID_INTERRUPT_VECTOR = 0x08,
NVME_STATUS_INVALID_LOG_PAGE = 0x09,
NVME_STATUS_INVALID_FORMAT = 0x0A,
NVME_STATUS_FIRMWARE_ACTIVATION_REQUIRES_CONVENTIONAL_RESET = 0x0B,
NVME_STATUS_INVALID_QUEUE_DELETION = 0x0C,
NVME_STATUS_FEATURE_ID_NOT_SAVEABLE = 0x0D,
NVME_STATUS_FEATURE_NOT_CHANGEABLE = 0x0E,
NVME_STATUS_FEATURE_NOT_NAMESPACE_SPECIFIC = 0x0F,
NVME_STATUS_FIRMWARE_ACTIVATION_REQUIRES_NVM_SUBSYSTEM_RESET = 0x10,
NVME_STATUS_FIRMWARE_ACTIVATION_REQUIRES_RESET = 0x11,
NVME_STATUS_FIRMWARE_ACTIVATION_REQUIRES_MAX_TIME_VIOLATION = 0x12,
NVME_STATUS_FIRMWARE_ACTIVATION_PROHIBITED = 0x13,
NVME_STATUS_OVERLAPPING_RANGE = 0x14,
NVME_STATUS_NAMESPACE_INSUFFICIENT_CAPACITY = 0x15,
NVME_STATUS_NAMESPACE_IDENTIFIER_UNAVAILABLE = 0x16,
NVME_STATUS_NAMESPACE_ALREADY_ATTACHED = 0x18,
NVME_STATUS_NAMESPACE_IS_PRIVATE = 0x19,
NVME_STATUS_NAMESPACE_NOT_ATTACHED = 0x1A,
NVME_STATUS_NAMESPACE_THIN_PROVISIONING_NOT_SUPPORTED = 0x1B,
NVME_STATUS_CONTROLLER_LIST_INVALID = 0x1C,
NVME_STATUS_DEVICE_SELF_TEST_IN_PROGRESS = 0x1D,
NVME_STATUS_BOOT_PARTITION_WRITE_PROHIBITED = 0x1E,
NVME_STATUS_INVALID_CONTROLLER_IDENTIFIER = 0x1F,
NVME_STATUS_INVALID_SECONDARY_CONTROLLER_STATE = 0x20,
NVME_STATUS_INVALID_NUMBER_OF_CONTROLLER_RESOURCES = 0x21,
NVME_STATUS_INVALID_RESOURCE_IDENTIFIER = 0x22,
NVME_STATUS_STREAM_RESOURCE_ALLOCATION_FAILED = 0x7F,
NVME_STATUS_NVM_CONFLICTING_ATTRIBUTES = 0x80,
NVME_STATUS_NVM_INVALID_PROTECTION_INFORMATION = 0x81,
NVME_STATUS_NVM_ATTEMPTED_WRITE_TO_READ_ONLY_RANGE = 0x82,
// Status Code (SC) of NVME_STATUS_TYPE_MEDIA_ERROR
NVME_STATUS_NVM_WRITE_FAULT = 0x80,
NVME_STATUS_NVM_UNRECOVERED_READ_ERROR = 0x81,
NVME_STATUS_NVM_END_TO_END_GUARD_CHECK_ERROR = 0x82,
NVME_STATUS_NVM_END_TO_END_APPLICATION_TAG_CHECK_ERROR = 0x83,
NVME_STATUS_NVM_END_TO_END_REFERENCE_TAG_CHECK_ERROR = 0x84,
NVME_STATUS_NVM_COMPARE_FAILURE = 0x85,
NVME_STATUS_NVM_ACCESS_DENIED = 0x86,
NVME_STATUS_NVM_DEALLOCATED_OR_UNWRITTEN_LOGICAL_BLOCK = 0x87,
For the other loose end, the Phase Tag or why we need to take it into account I found the following description:
Status Field: This field indicates the Status Field for the command that completed. The Status
Field is located in bits 15:01, bit 00 corresponds to the Phase Tag posted for the command. If
the error is not specific to a particular command then this field reports the most applicable
status value.
This gives us all we need to decipher or decode NVMe Status codes. Let’s apply what we found to the Status Code we found at the start of the post: 0xc502
Step 1: Get rid of Phase Tag. Right Shift (“>> 1”) is basically the same as divide by 2 for our purposes. So 0xc502 / 2 = 0x6281
Step 2: “apply a mask of 0x7ff to extract the lower 11 bytes” which in effect is the same as taking the three right side nibbles of 0x6281, so that gives us 281
Step 3: Left nibble give us the ‘Status Code Type’, 2 which in above table gives us NVME_SCT_MEDIA_ERROR = 0x2, and if we look up error 81 (two right nibbles, 81) in /* media error status codes */ we get NVME_SC_UNRECOVERED_READ_ERROR
Let’s do some more for exercise.
Error Information (NVMe Log 0x01, max 64 entries) Num ErrCount SQId CmdId Status PELoc LBA NSID VS 0 1181 0 0x0015 0x421a 0x028 0 0 - 1 1180 0 0x0013 0x4212 0x028 0 0 - 2 1179 0 0x0015 0x421a 0x028 0 0 - 3 1178 0 0x0013 0x4212 0x028 0 0 -
0x421a -> 0x421a / 2 = 0x210D -> Status Code Type 1, Status Code 0D -> FEATURE_ID_NOT_SAVEABLE
0x4212 -> 0x4212 / 2 = 0xx2109 -> Status Code Type 1, Status Code 09 -> INVALID_LOG_PAGE