SMART

Self-Monitoring, Analysis and Reporting Technology is a system for monitoring and recognizing errors of storage media like Hard Disk Drives (HDDs) and Solid State Disks (SSD).

All HDDs and SSDs (refered to as HDD in the following) have SMART functionality while some external drive cases do NOT support reading SMART values from the HDD directly.

Since SMART is not a standardised procedure or well-defined value collection, these values differ greatly between manufacturers and might indicate death for some hard drives while others do not even show or update said values.

Have a look at the specifications, e.g. at Wikipedia, for further details on what has been standardised.

But how do I actually read SMART values?

I am personally using smartctl (on my linux servers) which can be installed via apt

apt-get install -y smartmontools

or on macOS you can use brew

brew install smartmontools

and used inside the terminal by typing the following and replacing /dev/sdX by some disk you want to check

sudo smartctl -a /dev/sdX

Attributes to look out for could be the following, I added a short description what this might mean for your drive:

Raw_Read_Error_Rate
- rate of sectors which showed errors while reading
Spin_Up_Time
- how often did the shingles stop and spun up again (e.g. after entering power save mode)
Start_Stop_Count
- starts and stops - not quite the same as above, since this might be counted on start and stop signals sent by the controller
Seek_Error_Rate
- the rate of errors occurring while quick-seeking for sectors
Power_On_Hours
- the all-time hour count while the disk was powered on

Those attributes are assigned a TYPE which might indicate an actual “health attribute” - like “old age”, saying “this value shows that the drive is old”. Actually most of the default TYPE values might be Old_age which can be intriguing and means that it highly depends which ATTRIBUTE has this TYPE.

This might - depending on your HDD, as said - indicate that the drive shows a reduced lifespan.

You might want to have a look at the actual VALUE and WORST (values) to see if there could be a problem. I have also seen the THRESH(old) being useful, for example for Reallocated_Event_Count. The threshold might be ‘3’, indicating that reallocation is allowed to fail 3 times until the bad sector gets marked as ‘uncorrectable’.

Have a look at this smartctl output:

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0033   095   095   050    Pre-fail  Always       -       0/2447236
  5 Retired_Block_Count     0x0033   100   100   003    Pre-fail  Always       -       0
  9 Power_On_Hours_and_Msec 0x0032   100   100   000    Old_age   Always       -       711h+10m+35.860s
 12 Power_Cycle_Count       0x0032   098   098   000    Old_age   Always       -       2976
 13 Soft_Read_Error_Rate    0x0032   095   095   000    Old_age   Always       -       0/2447236
100 Gigabytes_Erased        0x0032   000   000   000    Old_age   Always       -       23256
170 Reserve_Block_Count     0x00ef   000   000   000    Pre-fail  Always       -       3424
171 Program_Fail_Count      0x000a   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
174 Unexpect_Power_Loss_Ct  0x0030   000   000   000    Old_age   Offline      -       86
177 Wear_Range_Delta        0x0000   000   000   000    Old_age   Offline      -       94
181 Program_Fail_Count      0x000a   100   100   000    Old_age   Always       -       0
182 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
184 IO_Error_Detect_Code_Ct 0x00ef   100   100   090    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0012   100   100   000    Old_age   Always       -       0
189 Airflow_Temperature_Cel 0x0000   028   041   000    Old_age   Offline      -       28 (Min/Max 9/41)
194 Temperature_Celsius     0x0022   028   041   000    Old_age   Always       -       28 (Min/Max 9/41)
195 ECC_Uncorr_Error_Count  0x001c   120   120   000    Old_age   Offline      -       0/2447236
196 Reallocated_Event_Count 0x0033   100   100   003    Pre-fail  Always       -       0
198 Uncorrectable_Sector_Ct 0x0010   120   120   000    Old_age   Offline      -       0/2447236
199 SATA_CRC_Error_Count    0x0024   200   200   000    Old_age   Offline      -       0
201 Unc_Soft_Read_Err_Rate  0x001c   120   120   000    Old_age   Offline      -       0/2447236
204 Soft_ECC_Correct_Rate   0x001c   120   120   000    Old_age   Offline      -       0/2447236
230 Life_Curve_Status       0x0013   100   100   000    Pre-fail  Always       -       100
231 SSD_Life_Left           0x0013   093   093   010    Pre-fail  Always       -       1
232 Available_Reservd_Space 0x0032   000   000   000    Old_age   Always       -       13
233 SandForce_Internal      0x0032   000   000   000    Old_age   Always       -       21231
234 SandForce_Internal      0x0032   000   000   000    Old_age   Always       -       4848
241 Lifetime_Writes_GiB     0x0032   000   000   000    Old_age   Always       -       4848
242 Lifetime_Reads_GiB      0x0032   000   000   000    Old_age   Always       -       4914

SMART Error Log not supported

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       707         -
# 2  Short offline       Completed without error       00%       699         -
# 3  Short offline       Completed without error       00%       696         -
# 4  Extended offline    Completed without error       00%       696         -
# 5  Short offline       Completed without error       00%       694         -
# 6  Extended offline    Completed without error       00%       694         -
# 7  Extended offline    Completed without error       00%       693         -
# 8  Short offline       Completed without error       00%       692         -
# 9  Extended offline    Completed without error       00%       692         -
#10  Short offline       Completed without error       00%       689         -
#11  Extended offline    Completed without error       00%       687         -
#12  Short offline       Completed without error       00%       686         -
#13  Extended offline    Completed without error       00%       686         -
#14  Short offline       Completed without error       00%       683         -
#15  Short offline       Completed without error       00%       680         -
#16  Short offline       Completed without error       00%       680         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing

In the upper block you can find all reported values as mentioned above, in the lower block you can see direct test results, indicating that a SMART test has been run some time before, it’s results are listed here.

This specific HDD does not show any problems or values to be concerned, since it is relatively new, (line 6:) 700h runtime is merely 29 days which is technically “new” for HDDs.

Interestingly enough we can see (line 7:) that Power_Cycle_Count is relatively high for 700 hours which means that this HDD has been turned on and off about 4 times average per hour.

This might indicate that this HDD might rather have been used in a desktop PC instead of a server.

I hope this gave some insight on how to read and interpret SMART values for your HDDs, let me know if you have any addition to this and thanks for being here.