如何在 Linux 中监控和测试 SSD 的运行状况
对S.M.A.R.T的详细分析固态硬盘技术以及如何在 Linux 中使用 smartctl 工具监控和检查 SSD 的健康状况。
什么是 S.M.A.R.T.?
聪明的。 – 自我监控、分析和报告技术 – 是一种嵌入到硬盘驱动器或 SSD 等存储设备中的技术,其目标是监控其健康状态。
在实践中,S.M.A.R.T.将在正常驱动器操作期间监视多个磁盘参数,例如读取错误数、驱动器启动时间甚至环境条件。此外,S.M.A.R.T.还可以对驱动器执行按需测试。
理想情况下,S.M.A.R.T.可以预测可预测的故障,例如由机械磨损或磁盘表面退化引起的故障,以及不可预测的 > 由意外缺陷引起的故障。由于驱动器通常不会突然发生故障,S.M.A.R.T.为操作系统或系统管理员提供了一个选项来识别即将发生故障的驱动器,以便可以在发生任何数据丢失之前更换它们。
什么不是 S.M.A.R.T.?
这一切看起来都很美好。然而,S.M.A.R.T.不是水晶球。它无法 100% 准确地预测故障,另一方面也不能保证驱动器不会在没有任何早期预警的情况下发生故障。充其量,S.M.A.R.T.应用于估计失败的可能性。
鉴于故障预测的统计性质,S.M.A.R.T.技术对使用大量存储单元的公司特别感兴趣,并且已经进行了现场研究来估计 S.M.A.R.T. 的准确性。报告问题以预测数据中心或服务器场中的磁盘更换需求。
2016年,微软和宾夕法尼亚州立大学进行了一项针对SSD的研究。
根据这项研究,似乎有些 S.M.A.R.T.属性是即将失败的良好指标。论文中特别提到:
重新分配 (Realloc) 扇区数:
虽然底层技术截然不同,但该指标在 SSD 领域似乎与在硬盘领域同样重要。值得一提的是,由于 SSD 中使用了磨损均衡算法,当几个块开始出现故障时,很可能更多的块很快就会出现故障。编程/擦除 (P/E) 失败计数:
这是底层闪存硬件出现问题的症状,其中驱动器无法清除或存储块中的数据。由于制造过程中存在缺陷,因此很少会出现此类错误。然而,闪存的清除/写入周期数量有限。因此,事件数量的突然增加可能表明驱动器已达到其使用寿命极限,并且我们可以预期更多的内存单元很快就会发生故障。CRC 和不可纠正的错误(“数据错误”):
这些事件可能是由存储错误或驱动器内部通信链路问题引起的。该指标考虑了已纠正错误(因此没有向主机系统报告任何问题)以及未纠正 > 错误(因此阻止驱动器报告无法读取主机系统)。换句话说,可纠正错误对于主机操作系统来说是不可见的,但它们仍然会影响驱动器性能,因为数据必须由驱动器固件和可能的扇区进行纠正可能会发生重定位。SATA 降档计数:
由于临时干扰、驱动器与主机之间的通信链路问题或由于内部驱动器问题,SATA 接口可能会切换到较低的信号速率。将链路降级到标称链路速率以下对观察到的驱动器性能有明显影响。选择较低的信号速率并不罕见,尤其是在较旧的驱动器上。因此,当与前面一个或多个指标的存在相关时,该指标最为重要。
研究表明,62% 的故障 SSD 至少表现出上述症状之一。然而,如果你颠倒这一说法,这也意味着 38% 的研究 SSD 出现故障,没有表现出任何上述症状。但该研究并未提及故障驱动器是否表现出任何其他 S.M.A.R.T.报告失败与否。因此,这不能与 Google 论文中提到的硬盘驱动器 36% 的无事先通知故障率直接进行比较。
微软/宾夕法尼亚州立大学的论文没有透露所研究的确切驱动器模型,但据作者称,大多数驱动器都来自同一供应商,跨越了几代。
该研究注意到不同模型之间的可靠性存在显着差异。例如,所研究的“最差”模型在第一次重定位错误后 9 个月的故障率为 20%,在第一次发生数据错误后 9 个月的故障率高达 36%。 “最差”的模型也恰好是本文研究的较旧的驱动器一代。
另一方面,对于相同的症状,属于最年轻一代设备的驱动器对于相同错误的故障率分别仅为 3% 和 20%。很难判断这些数字是否可以通过驱动器设计和制造工艺的改进来解释,或者这是否只是驱动器老化的影响。
最有趣的是,我之前给出了一些可能的原因,论文提到,这不是原始值,而是报告错误数量的突然增加,应将其视为一个令人震惊的指标:
“ ” “ SSD 故障前的症状的可能性较高,且表现强烈且进展迅速,导致其生存时间无法超过几个月“ ” ”
换句话说,偶尔的 S.M.A.R.T.报告的错误可能不应被视为即将发生故障的信号。然而,当健康的 SSD 开始报告越来越多的错误时,就必须预见到短期到中期的故障。
但如何知道您的硬盘或 SSD 是否健康?为了满足您的好奇心,或者因为您想开始密切监视您的驱动器,现在是时候介绍 smartctl
监视工具了:
在 Linux 中使用 smartctl 监控 SSD 的状态
有多种方法可以在 Linux 中列出磁盘,但可以监视 S.M.A.R.T.磁盘状态,我建议使用 smartctl
工具,它是 smartmontool
软件包的一部分(至少在 Debian/Ubuntu 上)。
sudo apt install smartmontools
smartctl 是一个命令行工具,但这是完美的,特别是如果您想在服务器上自动收集数据。
使用 smartctl
的第一步是检查您的磁盘是否具有 S.M.A.R.T.已启用并受该工具支持:
sh$ sudo smartctl -i /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Momentus 7200.4
Device Model: ST9500420AS
Serial Number: 5VJAS7FL
LU WWN Device Id: 5 000c50 02fa0b800
Firmware Version: D005SDM1
User Capacity: 500,107,862,016 bytes [500 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 2.6, 3.0 Gb/s
Local Time is: Mon Mar 12 15:54:43 2018 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
正如您所看到的,我的笔记本电脑内置硬盘确实具有 S.M.A.R.T.能力和 S.M.A.R.T.支持已启用。那么,现在 S.M.R.T. 怎么样?地位?是否记录了一些错误?
报告“有关磁盘的所有 SMART 信息”是 -a
选项的工作:
sh$ sudo smartctl -i -a /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Momentus 7200.4
Device Model: ST9500420AS
Serial Number: 5VJAS7FL
LU WWN Device Id: 5 000c50 02fa0b800
Firmware Version: D005SDM1
User Capacity: 500,107,862,016 bytes [500 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 2.6, 3.0 Gb/s
Local Time is: Mon Mar 12 15:56:58 2018 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 110) minutes.
Conveyance self-test routine
recommended polling time: ( 3) minutes.
SCT capabilities: (0x103f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 111 099 006 Pre-fail Always - 29694249
3 Spin_Up_Time 0x0003 100 098 085 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 095 095 020 Old_age Always - 5413
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 3
7 Seek_Error_Rate 0x000f 071 060 030 Pre-fail Always - 51710773327
9 Power_On_Hours 0x0032 070 070 000 Old_age Always - 26423
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 096 037 020 Old_age Always - 4836
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 072 072 000 Old_age Always - 28
188 Command_Timeout 0x0032 100 096 000 Old_age Always - 4295033738
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 056 042 045 Old_age Always In_the_past 44 (Min/Max 21/44 #22)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 184
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 104
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 395415
194 Temperature_Celsius 0x0022 044 058 000 Old_age Always - 44 (0 13 0 0 0)
195 Hardware_ECC_Recovered 0x001a 050 045 000 Old_age Always - 29694249
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 1
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 1
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 25131 (246 202 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 3028413736
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 1613088055
254 Free_Fall_Sensor 0x0032 100 100 000 Old_age Always - 0
SMART Error Log Version: 1
ATA Error Count: 3
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 3 occurred at disk power-on lifetime: 21171 hours (882 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 08 ff ff ff 4f 00 00:45:12.580 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00 00:45:12.580 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00 00:45:12.579 READ FPDMA QUEUED
60 00 08 ff ff ff 4f 00 00:45:12.571 READ FPDMA QUEUED
60 00 20 ff ff ff 4f 00 00:45:12.543 READ FPDMA QUEUED
Error 2 occurred at disk power-on lifetime: 21171 hours (882 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 00 ff ff ff 4f 00 00:45:09.456 READ FPDMA QUEUED
60 00 00 ff ff ff 4f 00 00:45:09.451 READ FPDMA QUEUED
61 00 08 ff ff ff 4f 00 00:45:09.450 WRITE FPDMA QUEUED
60 00 00 ff ff ff 4f 00 00:45:08.878 READ FPDMA QUEUED
60 00 00 ff ff ff 4f 00 00:45:08.856 READ FPDMA QUEUED
Error 1 occurred at disk power-on lifetime: 21131 hours (880 days + 11 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 00 ff ff ff 4f 00 05:52:18.809 READ FPDMA QUEUED
61 00 00 7e fb 31 45 00 05:52:18.806 WRITE FPDMA QUEUED
60 00 00 ff ff ff 4f 00 05:52:18.571 READ FPDMA QUEUED
ea 00 00 00 00 00 a0 00 05:52:18.529 FLUSH CACHE EXT
61 00 08 ff ff ff 4f 00 05:52:18.527 WRITE FPDMA QUEUED
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 10904 -
# 2 Short offline Completed without error 00% 12 -
# 3 Short offline Completed without error 00% 0 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
了解 smartctl 命令的输出
这是大量信息,但解释这些数据并不总是那么容易。最有趣的部分可能是标记为“带有阈值的供应商特定 SMART 属性”的部分。。它报告了 S.M.A.R.T. 收集的各种统计数据。设备,并让您将这些值(当前或历史最差)与某些供应商定义的阈值进行比较。
例如,以下是我的磁盘报告重定位扇区的方式:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 3
您可以看到这是一个“失败前”属性。这仅仅意味着该属性对应于异常。因此,如果该属性超过阈值,则可能表明即将发生故障。另一个类别是“Old_age”,用于与“正常穿着”属性相对应的属性。
最后一个字段(此处为“3”)对应于驱动器报告的该属性的原始值。通常,这个数字具有物理意义。这里,这是重新定位的扇区的实际数量。然而,对于其他属性,它可能是以摄氏度为单位的温度、以小时或分钟为单位的时间,或者驱动器遇到特定条件的次数。
除了原始值之外,S.M.A.R.T.启用的驱动器必须报告“标准化”值(字段值、最差值和阈值)。这些值在 1-254 范围内标准化(阈值为 0-255)。磁盘固件使用某种内部算法执行该标准化。此外,不同的制造商可能对同一属性进行不同的归一化。大多数值以百分比形式报告,越高越好,但这不是强制性的。当某个参数低于或等于制造商提供的阈值时,磁盘就被认为在该属性上出现故障。考虑到该文章第一部分中提到的所有储备,当“故障前”属性发生故障时,磁盘故障可能即将发生。
作为第二个例子,让我们检查一下“查找错误率”:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
7 Seek_Error_Rate 0x000f 071 060 030 Pre-fail Always - 51710773327
实际上,这是 S.M.A.R.T 的问题。报告中,每个值的确切含义是特定于供应商的。就我而言,希捷使用对数刻度来标准化该值。因此,“71”大约意味着 1000 万次搜索(10 的 7.1 次方)有一个错误。有趣的是,历史上最糟糕的情况是 100 万次搜索(10 的 6.0 次方)出现一个错误。如果我解释正确的话,这意味着我的磁盘头现在比过去定位更准确。我没有密切关注该盘,因此此分析需谨慎。也许驱动器在最初调试时只需要一段磨合期?除非这是机械零件磨损的结果,从而导致今天的摩擦力减少?无论如何,无论原因是什么,该值更多的是一个性能指标,而不是故障预警。所以这并没有太困扰我。
除此之外,以及大约六个月前记录的三个可疑错误,对于已通电超过 1100 天(26423 小时)的库存笔记本电脑驱动器来说,该驱动器的状况出奇地好(根据 S.M.A.R.T.):
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
9 Power_On_Hours 0x0032 070 070 000 Old_age Always - 26423
出于好奇,我在一台更新的配备 SSD 的笔记本电脑上进行了相同的测试:
sh$ sudo smartctl -i /dev/sdb
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.10.0-32-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: TOSHIBA THNSNK256GVN8
Serial Number: 17FS131LTNLV
LU WWN Device Id: 5 00080d 9109b2ceb
Firmware Version: K8XA4103
User Capacity: 256 060 514 304 bytes [256 GB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: M.2
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 (minor revision not indicated)
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Mar 13 01:03:23 2018 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
首先要注意的是,即使该设备是 S.M.A.T.已启用,但它不在 smartctl
数据库中。这不会阻止该工具从 SSD 收集数据,但它将无法报告不同供应商特定属性的确切含义:
sh$ sudo smartctl -a /dev/sdb
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.10.0-32-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 120) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 11) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000a 100 100 000 Old_age Always - 0
2 Throughput_Performance 0x0005 100 100 050 Pre-fail Offline - 0
3 Spin_Up_Time 0x0007 100 100 050 Pre-fail Always - 0
5 Reallocated_Sector_Ct 0x0013 100 100 050 Pre-fail Always - 0
7 Unknown_SSD_Attribute 0x000b 100 100 050 Pre-fail Always - 0
8 Unknown_SSD_Attribute 0x0005 100 100 050 Pre-fail Offline - 0
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 171
10 Unknown_SSD_Attribute 0x0013 100 100 050 Pre-fail Always - 0
12 Power_Cycle_Count 0x0012 100 100 000 Old_age Always - 105
166 Unknown_Attribute 0x0012 100 100 000 Old_age Always - 0
167 Unknown_Attribute 0x0022 100 100 000 Old_age Always - 0
168 Unknown_Attribute 0x0012 100 100 000 Old_age Always - 0
169 Unknown_Attribute 0x0013 100 100 010 Pre-fail Always - 100
170 Unknown_Attribute 0x0013 100 100 010 Pre-fail Always - 0
173 Unknown_Attribute 0x0012 200 200 000 Old_age Always - 0
175 Program_Fail_Count_Chip 0x0013 100 100 010 Pre-fail Always - 0
192 Power-Off_Retract_Count 0x0012 100 100 000 Old_age Always - 18
194 Temperature_Celsius 0x0023 063 032 020 Pre-fail Always - 37 (Min/Max 11/68)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
240 Unknown_SSD_Attribute 0x0013 100 100 050 Pre-fail Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
这通常是全新 SSD 的预期输出。即使由于缺乏供应商特定数据的规范化或元信息,许多属性被报告为“Unknown_SSD_Attribute”。 “我可能只希望 smartctl
的未来版本能够将与该特定驱动器模型相关的数据合并到工具数据库中,这样我就可以更准确地识别可能的问题。
使用 smartctl 在 Linux 中测试您的 SSD
到目前为止,我们已经检查了驱动器在正常运行期间收集的数据。然而,S.M.A.R.T.协议还支持多个“自测试”命令以按需启动诊断。
除非明确要求,否则自检可以在正常磁盘操作期间运行。由于测试和主机 I/O 请求都会竞争驱动器,因此测试期间磁盘性能将会下降。 S.M.A.R.T.规范规定了几种自检方式。最重要的是:
简短自检(-t Short
)
该测试将检查驱动器的电气和机械性能以及读取性能。简短自检通常只需要几分钟即可完成(通常为 2 到 10 分钟)。扩展自检 (-t long
)
此测试需要一两个数量级的时间才能完成。通常,这是简短自检的更深入版本。此外,该测试将扫描整个磁盘表面是否有数据错误,没有时间限制。测试持续时间将与磁盘大小成正比。传送自测试(-t传送
)
该测试套件旨在作为一种相对快速的方法来检查设备运输过程中可能发生的损坏。
以下是取自上述相同磁盘的示例。我让你猜猜哪个是哪个:
sh$ sudo smartctl -t short /dev/sdb
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.10.0-32-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Mon Mar 12 18:06:17 2018
Use smartctl -X to abort test.
现在已经说明了测试。让我们等到完成后显示结果:
sh$ sudo sh -c 'sleep 120 && smartctl -l selftest /dev/sdb'
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.10.0-32-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 171 -
现在让我们在我的另一个磁盘上进行相同的测试:
sh$ sudo smartctl -t short /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Mon Mar 12 21:59:39 2018
Use smartctl -X to abort test.
再次休眠两分钟,显示测试结果:
sh$ sudo sh -c 'sleep 120 && smartctl -l selftest /dev/sdb'
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 26429 -
# 2 Short offline Completed without error 00% 10904 -
# 3 Short offline Completed without error 00% 12 -
# 4 Short offline Completed without error 00% 0 -
有趣的是,在这种情况下,驱动器和计算机制造商似乎都对磁盘执行了一些快速测试(在生命周期 0 小时和 12 小时)。 我绝对不太关心自己监控驱动器的健康状况。因此,由于我正在为该文章运行一些自测试,所以让我们开始一个扩展测试,看看它是如何进行的:
sh$ sudo smartctl -t long /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 110 minutes for test to complete.
Test will complete after Tue Mar 13 00:09:08 2018
Use smartctl -X to abort test.
显然,这次我们要等待的时间比短暂的测试要长得多。那么让我们这样做吧:
sh$ sudo bash -c 'sleep $((110*60)) && smartctl -l selftest /dev/sdb'
[sudo] password for sylvain:
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 20% 26430 810665229
# 2 Short offline Completed without error 00% 26429 -
# 3 Short offline Completed without error 00% 10904 -
# 4 Short offline Completed without error 00% 12 -
# 5 Short offline Completed without error 00% 0 -
在后一种情况下,请特别注意短期和扩展测试所获得的不同结果,即使它们是一个接一个地进行的。好吧,也许该磁盘毕竟不是那么健康!需要注意的重要一点是测试将在第一次读取错误后停止。因此,如果您想对所有读取错误进行详尽的诊断,则必须在每个错误发生后继续测试。我鼓励您查看写得很好的 smartctl(8) 手册页,以获取有关选项 -t select,N-max
和 -t select,cont
的更多信息代码> 为此:
sh$ sudo smartctl -t select,810665230-max /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Selective self-test routine immediately in off-line mode".
SPAN STARTING_LBA ENDING_LBA
0 810665230 976773167
Drive command "Execute SMART Selective self-test routine immediately in off-line mode" successful.
Testing has begun.
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-6-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Selective offline Completed without error 00% 26432 -
# 2 Extended offline Completed: read failure 20% 26430 810665229
# 3 Short offline Completed without error 00% 26429 -
# 4 Short offline Completed without error 00% 10904 -
# 5 Short offline Completed without error 00% 12 -
# 6 Short offline Completed without error 00% 0 -
结论
当然,S.M.A.R.T.报告是一项可以添加到工具箱中以监控服务器磁盘运行状况的技术。在这种情况下,您还应该看看 S.M.A.R.T.磁盘监控守护进程 smartd(8) 可以帮助您通过系统日志报告自动监控。
考虑到故障预测的统计性质,我不太相信激进的 S.M.A.R.T.监控对个人计算机有很大好处。最后,不要忘记,无论采用何种技术,驱动器都会发生故障——我们之前已经看到,在三分之一的情况下,它会在没有事先通知的情况下发生故障。因此,没有什么可以取代 RAID和离线备份来确保您的数据完整性!