6 minutes
SMART
Self-Monitoring, Analysis and Reporting Technology is a system for monitoring and recognizing errors of storage media like Hard Disk Drives (HDDs) and Solid State Disks (SSD).
All HDDs and SSDs (refered to as HDD in the following) have SMART functionality while some external drive cases do NOT support reading SMART values from the HDD directly.
Since SMART is not a standardised procedure or well-defined value collection, these values differ greatly between manufacturers and might indicate death for some hard drives while others do not even show or update said values.
Have a look at the specifications, e.g. at Wikipedia, for further details on what has been standardised.
But how do I actually read SMART values?
I am personally using smartctl (on my linux servers) which can be installed via apt
apt-get install -y smartmontools
or on macOS you can use brew
brew install smartmontools
and used inside the terminal by typing the following and replacing /dev/sdX by some disk you want to check
sudo smartctl -a /dev/sdX
Attributes to look out for could be the following, I added a short description what this might mean for your drive:
- Raw_Read_Error_Rate
- rate of sectors which showed errors while reading
- Spin_Up_Time
- how often did the shingles stop and spun up again (e.g. after entering power save mode)
- Start_Stop_Count
- starts and stops - not quite the same as above, since this might be counted on start and stop signals sent by the controller
- Seek_Error_Rate
- the rate of errors occurring while quick-seeking for sectors
- Power_On_Hours
- the all-time hour count while the disk was powered on
Those attributes are assigned a TYPE
which might indicate an actual “health attribute” - like “old age”, saying “this
value shows that the drive is old”. Actually most of the default TYPE
values might be Old_age which can be intriguing
and means that it highly depends which ATTRIBUTE
has this TYPE
.
This might - depending on your HDD, as said - indicate that the drive shows a reduced lifespan.
You might want to have a look at the actual VALUE
and WORST
(values) to see if there could be a problem.
I have also seen the THRESH
(old) being useful, for example for Reallocated_Event_Count.
The threshold might be ‘3’, indicating that reallocation is allowed to fail 3 times until the bad sector gets marked
as ‘uncorrectable’.
Have a look at this smartctl output:
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x0033 095 095 050 Pre-fail Always - 0/2447236
5 Retired_Block_Count 0x0033 100 100 003 Pre-fail Always - 0
9 Power_On_Hours_and_Msec 0x0032 100 100 000 Old_age Always - 711h+10m+35.860s
12 Power_Cycle_Count 0x0032 098 098 000 Old_age Always - 2976
13 Soft_Read_Error_Rate 0x0032 095 095 000 Old_age Always - 0/2447236
100 Gigabytes_Erased 0x0032 000 000 000 Old_age Always - 23256
170 Reserve_Block_Count 0x00ef 000 000 000 Pre-fail Always - 3424
171 Program_Fail_Count 0x000a 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
174 Unexpect_Power_Loss_Ct 0x0030 000 000 000 Old_age Offline - 86
177 Wear_Range_Delta 0x0000 000 000 000 Old_age Offline - 94
181 Program_Fail_Count 0x000a 100 100 000 Old_age Always - 0
182 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
184 IO_Error_Detect_Code_Ct 0x00ef 100 100 090 Pre-fail Always - 0
187 Reported_Uncorrect 0x0012 100 100 000 Old_age Always - 0
189 Airflow_Temperature_Cel 0x0000 028 041 000 Old_age Offline - 28 (Min/Max 9/41)
194 Temperature_Celsius 0x0022 028 041 000 Old_age Always - 28 (Min/Max 9/41)
195 ECC_Uncorr_Error_Count 0x001c 120 120 000 Old_age Offline - 0/2447236
196 Reallocated_Event_Count 0x0033 100 100 003 Pre-fail Always - 0
198 Uncorrectable_Sector_Ct 0x0010 120 120 000 Old_age Offline - 0/2447236
199 SATA_CRC_Error_Count 0x0024 200 200 000 Old_age Offline - 0
201 Unc_Soft_Read_Err_Rate 0x001c 120 120 000 Old_age Offline - 0/2447236
204 Soft_ECC_Correct_Rate 0x001c 120 120 000 Old_age Offline - 0/2447236
230 Life_Curve_Status 0x0013 100 100 000 Pre-fail Always - 100
231 SSD_Life_Left 0x0013 093 093 010 Pre-fail Always - 1
232 Available_Reservd_Space 0x0032 000 000 000 Old_age Always - 13
233 SandForce_Internal 0x0032 000 000 000 Old_age Always - 21231
234 SandForce_Internal 0x0032 000 000 000 Old_age Always - 4848
241 Lifetime_Writes_GiB 0x0032 000 000 000 Old_age Always - 4848
242 Lifetime_Reads_GiB 0x0032 000 000 000 Old_age Always - 4914
SMART Error Log not supported
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 707 -
# 2 Short offline Completed without error 00% 699 -
# 3 Short offline Completed without error 00% 696 -
# 4 Extended offline Completed without error 00% 696 -
# 5 Short offline Completed without error 00% 694 -
# 6 Extended offline Completed without error 00% 694 -
# 7 Extended offline Completed without error 00% 693 -
# 8 Short offline Completed without error 00% 692 -
# 9 Extended offline Completed without error 00% 692 -
#10 Short offline Completed without error 00% 689 -
#11 Extended offline Completed without error 00% 687 -
#12 Short offline Completed without error 00% 686 -
#13 Extended offline Completed without error 00% 686 -
#14 Short offline Completed without error 00% 683 -
#15 Short offline Completed without error 00% 680 -
#16 Short offline Completed without error 00% 680 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
In the upper block you can find all reported values as mentioned above, in the lower block you can see direct test results,
indicating that a SMART
test has been run some time before, it’s results are listed here.
This specific HDD does not show any problems or values to be concerned, since it is relatively new, (line 6:) 700h runtime is merely 29 days which is technically “new” for HDDs.
Interestingly enough we can see (line 7:) that Power_Cycle_Count is relatively high for 700 hours which means that this HDD has been turned on and off about 4 times average per hour.
This might indicate that this HDD might rather have been used in a desktop PC instead of a server.
I hope this gave some insight on how to read and interpret SMART values for your HDDs, let me know if you have any addition to this and thanks for being here.