Big scary Raw SMART value, is my drive dying?!
The other week I noticed this post on superuser.com:
4295032833, now that is a big scary number! That’s a lot of timeout errors!
Does this mean this drive is bad or even dying? Then how come the normalized values seem to indicate that nothing’s wrong? Some will answer that last question by telling you, you shouldn’t rely on normalized values but on RAW values instead. So then we’d go with big scary number?
No, usually when we see huge numbers it is a vendor specific RAW value and rather than interpreting the RAW value as one, we’re actually dealing with several values that we need to ‘break up’. Some times hard drive manufacturers provide the documentation that can help us do this. In this case we’re dealing with a Seagate drive and this document provides further info: http://t1.daumcdn.net/brunch/service/user/axm/file/zRYOdwPu3OMoKYmBOby1fEEQEbU.pdf.
What does this big scary number mean?
From the document we learn:
3.11 Attribute ID 188: Command Timeout Count
Normalized Command Timeout Count = 100 – Command Timeout Count .This attribute tracks the number of command time outs as defined by an active command being interrupted by a HRESET and COMRESET or SRST or another command
The normalized value is only computed when the number of commands is in the range 103 to 104. The CommandCount and ErroCount are cleared when Number Of Commands reaches 104. The error count used to compute normalized value is not reported in attribute Raw value. It is reported in vendor info area of Attribute sector, bytes 474:475. If Command Timeout Count is > 99, normalize value of 1 is reported. The initial Worst Value is set to 0xFD as a special case.Raw Usage
Raw [1 – 0] = Total # of command timeouts, with Max hold of FFFFh
Raw [3 – 2] = Total # of commands with > 5 second completion, including those > 7.5 seconds
Raw [5 – 4] = Total # of commands with > 7.5 second completion
Decoding the huge Raw SMART value
Okay, we need that last section “Raw usage” to interpret our big scary number. We see it is not one value.
Big scary number is So 4295032833, we need to break that into 3 and for that we need to convert it to HEX. So we get HEX = 0x100010001.
It may just be me, but that looks less scary already!
We need to break this into 3 separate word values so we get 0x0001, 0x0001 and 0x0001. Using the Seagate document I then decode this as we had one time-out error, and one that took > 5 seconds to complete and one that took longer than 7.5 seconds to complete. Or in other words, we had one error that took longer than 7.5 seconds to complete.
Summarizing, big scary decimal Raw SMART value 4295032833 tells us one command timeout error that took longer than 7.5 seconds to complete occurred.
which I would not worry about too much, could be due to someone bumping into a desk which the computer was on.
Another example:
Attribute ID 7: Seek Error Rate
Monitor seeks requiring one or more retries. Exclude calibration seeks and seeks in system area. Normalized Seek Error Rate = 10 * log10(SeekCount / SeekErrors) which is only updated when SeekCount is in the range 106 to 109. The counts are cleared when SeekCount = 109 . (Evaluates to a value from 1 to 100).
Raw Usage
Raw [3 – 0] = Number of seeks
Raw [5 – 4] = Number of seek errors
So, The raw value of the SMART attribute occupies 48 bits. Seagate’s Seek Error Rate attribute consists of two parts — a 16-bit count of seek errors in the uppermost 4 nibbles, and a 32-bit count of seeks in the lowermost 8 nibbles.
Now assume we see:
ID Current Worst ThresholdData RAW
(07) Seek Error Rate 85 60 30 359872048 Ok
We get: 359872048 > 0x000015733630 > 0 errors, 359872048 seeks. So no errors.
Great article.