Discussion:
Is this a bug or my disk is going to fail?
(too old to reply)
Marguerite Su
2014-07-21 14:18:53 UTC
Permalink
Raw Message
Hi,

A few days ago suddenly I can't mount /root because:

2014-07-17T03:01:17.500184+08:00 linux kernel: [16528.736353] ata1.00:
failed command: WRITE FPDMA QUEUED

2014-07-17T03:01:17.841313+08:00 linux kernel: [16529.077363]
end_request: I/O error, dev sda, sector 59764644

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612

1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail
Always - 2847

But:

196 Reallocated_Event_Count 0x0032 252 252 000 Old_age
Always - 0

Google says it is the hardware problem, but after a reinstall, my
computer still works...and actually the disk is less than 2 years
old..

Error 19579 occurred at disk power-on lifetime: 12857 hours (535 days
+ 17 hours)
When the command that caused the error occurred, the device was
active or idle.


Can anyone help judge this? Do I have to replace the disk?

Marguerite
Basil Chupin
2014-07-21 15:00:38 UTC
Permalink
Raw Message
Post by Marguerite Su
Hi,
failed command: WRITE FPDMA QUEUED
2014-07-17T03:01:17.841313+08:00 linux kernel: [16529.077363]
end_request: I/O error, dev sda, sector 59764644
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail
Always - 2847
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age
Always - 0
Google says it is the hardware problem, but after a reinstall, my
computer still works...and actually the disk is less than 2 years
old..
Error 19579 occurred at disk power-on lifetime: 12857 hours (535 days
+ 17 hours)
When the command that caused the error occurred, the device was
active or idle.
Can anyone help judge this? Do I have to replace the disk?
Marguerite
I have not "here is the answer" answer to your question but let me say
that I have had HDD fail on me on in less than 24 hours, with others
failing anything between 24 hours and 3 years, while some still working
fine even after 15 years.

"Youse buys your ticket and youse take your chances".

You don't mention what brand of HDD you have but that brand will have an
app. from the manufacturer which will check out your drive for any
possible failures.

And, of course, if you have 'smartmontools' installed then this will do
same.

BC
--
Using openSUSE 13.1, KDE 4.13.3 & kernel 3.15.6-1 on a system with-
AMD FX 8-core 3.6/4.2GHz processor
16GB PC14900/1866MHz Quad Channel RAM
Gigabyte AMD3+ m/board; Gigabyte nVidia GTX660 GPU
Larry Finger
2014-07-21 16:02:50 UTC
Permalink
Raw Message
Post by Marguerite Su
Hi,
failed command: WRITE FPDMA QUEUED
2014-07-17T03:01:17.841313+08:00 linux kernel: [16529.077363]
end_request: I/O error, dev sda, sector 59764644
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail
Always - 2847
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age
Always - 0
Google says it is the hardware problem, but after a reinstall, my
computer still works...and actually the disk is less than 2 years
old..
Error 19579 occurred at disk power-on lifetime: 12857 hours (535 days
+ 17 hours)
When the command that caused the error occurred, the device was
active or idle.
Can anyone help judge this? Do I have to replace the disk?
Marguerite
I have not "here is the answer" answer to your question but let me say that I
have had HDD fail on me on in less than 24 hours, with others failing anything
between 24 hours and 3 years, while some still working fine even after 15 years.
"Youse buys your ticket and youse take your chances".
You don't mention what brand of HDD you have but that brand will have an app.
from the manufacturer which will check out your drive for any possible failures.
And, of course, if you have 'smartmontools' installed then this will do same.
From the printouts above, smartctl is available, and a long disk surface test
should be done soon. That will test the disk surfaces, read/write heads, and the
circuitry that handles reading and writing. The parts that get minimal testing
are the interface with the disk controller, the cable to the controller, and the
adapter in your computer. My sense is that the error message arises from
communication between the adapter and the computer.

The manufacturer's app will likely test the entire system.

If this system is a desktop, an easy thing to do is to replace the cable. On a
laptop, that is generally not possible. In any case, I would be sure to have
backups and a spare drive.

Larry
Greg Freemyer
2014-07-21 17:48:51 UTC
Permalink
Raw Message
On Mon, Jul 21, 2014 at 12:02 PM, Larry Finger
Post by Larry Finger
If this system is a desktop, an easy thing to do is to replace the cable. On a
laptop, that is generally not possible. In any case, I would be sure to have
backups and a spare drive.
I agree with Larry in that my first diagnostic step when I see errors
being reported is to replace the sata cable. It that fixes the
problem, just toss the old cable.

If it doesn't is when I start diving into the "smartctl -a" data.

Greg
--
Greg Freemyer
Marguerite Su
2014-07-21 18:15:23 UTC
Permalink
Raw Message
Hi, Greg,

Unfortunately it's my laptop...

And here's smartctl information:

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.11.10-11-desktop] (SUSE RPM)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint M8 (AF)
Device Model: SAMSUNG HN-M101MBB
Serial Number: S2R8J9BB808817
LU WWN Device Id: 5 0024e9 205e8f61c
Firmware Version: 2AR10001
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 6
SATA Version is: SATA 3.0, 3.0 Gb/s (current: 1.5 Gb/s)
Local Time is: Tue Jul 22 02:09:58 2014 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (13320) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection
on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 222) minutes.
SCT capabilities: (0x003f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail
Always - 2850
2 Throughput_Performance 0x0026 252 252 000 Old_age
Always - 0
3 Spin_Up_Time 0x0023 089 089 025 Pre-fail
Always - 3466
4 Start_Stop_Count 0x0032 098 098 000 Old_age
Always - 2161
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail
Always - 0
7 Seek_Error_Rate 0x002e 252 252 051 Old_age
Always - 0
8 Seek_Time_Performance 0x0024 252 252 015 Old_age
Offline - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age
Always - 12867
10 Spin_Retry_Count 0x0032 252 252 051 Old_age
Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age
Always - 135
12 Power_Cycle_Count 0x0032 098 098 000 Old_age
Always - 2242
181 Program_Fail_Cnt_Total 0x0022 080 080 000 Old_age
Always - 452013413
191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age
Always - 820
192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age
Always - 0
194 Temperature_Celsius 0x0002 064 047 000 Old_age
Always - 36 (Min/Max 17/53)
195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age
Always - 0
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age
Always - 0
197 Current_Pending_Sector 0x0032 252 100 000 Old_age
Always - 0
198 Offline_Uncorrectable 0x0030 252 252 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age
Always - 0
200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age
Always - 463
223 Load_Retry_Count 0x0032 100 100 000 Old_age
Always - 135
225 Load_Cycle_Count 0x0032 032 032 000 Old_age
Always - 691031

SMART Error Log Version: 1
Warning: ATA error count 19579 inconsistent with error log pointer 5

ATA Error Count: 19579 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 19579 occurred at disk power-on lifetime: 12857 hours (535 days
+ 17 hours)
When the command that caused the error occurred, the device was
active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 84 ef 8f e3 00 00:00:00.203 READ DMA
ef 10 02 00 00 00 a0 00 00:00:00.203 SET FEATURES [Enable SATA feature]
27 00 00 00 00 00 e0 00 00:00:00.203 READ NATIVE MAX ADDRESS
EXT [OBS-ACS-3]
ec 00 00 00 00 00 a0 00 00:00:00.203 IDENTIFY DEVICE
ef 02 00 00 00 00 a0 00 00:00:00.203 SET FEATURES [Enable write cache]

Error 19578 occurred at disk power-on lifetime: 12725 hours (530 days + 5 hours)
When the command that caused the error occurred, the device was
active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 84 ef 8f e3 00 00:00:00.233 READ DMA
c8 00 08 84 ef 93 e0 00 00:00:00.233 READ DMA
c8 00 08 ac ef 93 e0 00 00:00:00.233 READ DMA
c8 00 08 cc ef 93 e0 00 00:00:00.233 READ DMA
c8 00 08 44 fa 93 e0 00 00:00:00.233 READ DMA

Error 19577 occurred at disk power-on lifetime: 12680 hours (528 days + 8 hours)
When the command that caused the error occurred, the device was
active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 84 ef 8f e3 00 00:00:35.602 READ DMA
25 00 10 15 ed 7a e0 00 00:00:35.602 READ DMA EXT
25 00 10 f5 9c 76 e0 00 00:00:35.602 READ DMA EXT
25 00 10 05 c7 7a e0 00 00:00:35.602 READ DMA EXT
25 00 10 f5 99 76 e0 00 00:00:35.602 READ DMA EXT

Error 19576 occurred at disk power-on lifetime: 12678 hours (528 days + 6 hours)
When the command that caused the error occurred, the device was
active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 84 ef 8f e3 00 00:00:29.969 READ DMA
c8 00 50 d4 d9 6f e0 00 00:00:29.969 READ DMA
c8 00 30 3c e0 6f e0 00 00:00:29.969 READ DMA
c8 00 88 b4 df 6f e0 00 00:00:29.969 READ DMA
c8 00 e8 14 d7 6f e0 00 00:00:29.969 READ DMA

Error 19575 occurred at disk power-on lifetime: 12678 hours (528 days + 6 hours)
When the command that caused the error occurred, the device was
active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 c8 84 ef 8f e3 Error: UNC 200 sectors at LBA = 0x038fef84 = 59764612

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 f8 54 ef 8f e3 00 00:00:29.968 READ DMA
c8 00 08 4c ef 8f e3 00 00:00:29.968 READ DMA
c8 00 08 44 ef 8f e3 00 00:00:29.968 READ DMA
c8 00 20 44 ca 6f e0 00 00:00:29.968 READ DMA
c8 00 b0 b4 be ac e3 00 00:00:29.968 READ DMA

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has
ever been run
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Completed [00% left] (0-65535)
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


I at first didn't treat it as a hardware problem, I treated it as a
software problem because I can't believe all in a sudden my disk was
producting "end_request" for about 10 sectors in a very narrow time
window (in just a few minutes..) while I can still use my openSUSE
without any major problem (the hard shutdown will result a slow boot,
the suspend to ram/disk function still looks fine, and the daily
operation is good too)


Marguerite
Post by Greg Freemyer
On Mon, Jul 21, 2014 at 12:02 PM, Larry Finger
Post by Larry Finger
If this system is a desktop, an easy thing to do is to replace the cable. On a
laptop, that is generally not possible. In any case, I would be sure to have
backups and a spare drive.
I agree with Larry in that my first diagnostic step when I see errors
being reported is to replace the sata cable. It that fixes the
problem, just toss the old cable.
If it doesn't is when I start diving into the "smartctl -a" data.
Greg
--
Greg Freemyer
--
Greg Freemyer
2014-07-21 19:10:42 UTC
Permalink
Raw Message
I'm guessing you have one or more bad sectors in the 8 sectors between
59764612 and 59764619. Since Linux normally issues 4KB reads, you
can't tell from the normal diagnostics / errors which specific sector
is bad.

You can use hdparm to read one sector at a time.

"hdparm --read-sector 59764612 /dev/sda" will test that one sector.
Test all 8, one at a time.

When you know which one(s) it is you can use hpdarm to force a write
to the sector:

hdparm --repair-sector 59764612 /dev/sda" will trash the existing
content of the sector, but it should trigger a re-allocate of the
sector to a good sector.

After you use --repair-sector, try to read it again with
--read-sector. It should now read fine. If not, something is wrong
with the drive.

Be sure and not run --repair-sector on sectors you don't know are failed.

The trouble is you have permanently lost whatever that sector used to
hold. I know of no way general way to get it back. After you repair
the sector, you should try to figure what file / inode / metadata the
sector was associated with and try to fix things back up. If the
sector is a part of a files data page, the simplist thing is just to
delete the file and get a replacement copy from backup.

For the kernel team, this brings up the question if XFS CRC metadata
correction is going to be in openSUSE 13.2? If so, does the new XFS
code have a way to force a write of the corrected data back to disk?

== details of my analysis ==
Post by Marguerite Su
Post by Marguerite Su
9 Power_On_Hours 0x0032 100 100 000 Old_age
Always - 12867

relatively young at 12,867 power on hours
Post by Marguerite Su
Post by Marguerite Su
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail
Always - 0

zero reallocated sectors
Post by Marguerite Su
Post by Marguerite Su
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age
Always - 135

No idea what that is
Post by Marguerite Su
Post by Marguerite Su
181 Program_Fail_Cnt_Total 0x0022 080 080 000 Old_age
Always - 452013413

No idea what that is
Post by Marguerite Su
Post by Marguerite Su
200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age
Always - 463

No idea what that is
Post by Marguerite Su
Post by Marguerite Su
195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age
Always - 0
Post by Marguerite Su
Post by Marguerite Su
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age
Always - 0
Post by Marguerite Su
Post by Marguerite Su
197 Current_Pending_Sector 0x0032 252 100 000 Old_age
Always - 0
Post by Marguerite Su
Post by Marguerite Su
198 Offline_Uncorrectable 0x0030 252 252 000 Old_age
Offline - 0
Post by Marguerite Su
Post by Marguerite Su
199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age
Always - 0

Strange that all of those report zero if the drive is actually having problems.

If the above is all you had to go by, I would say the drive is fine
and it's a problem elsewhere, but it is NOT all we have.

We have the list of recent discrete errors. Look for my comments in
<= comment => blocks

Error 19579 occurred at disk power-on lifetime: 12857 hours (535 days
+ 17 hours)

<= That is 12 power on hours ago =>

When the command that caused the error occurred, the device was
active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 84 ef 8f e3 00 00:00:00.203 READ DMA
ef 10 02 00 00 00 a0 00 00:00:00.203 SET FEATURES [Enable SATA feature]
27 00 00 00 00 00 e0 00 00:00:00.203 READ NATIVE MAX ADDRESS
EXT [OBS-ACS-3]
ec 00 00 00 00 00 a0 00 00:00:00.203 IDENTIFY DEVICE
ef 02 00 00 00 00 a0 00 00:00:00.203 SET FEATURES [Enable write cache]

<= so an uncorrectable read error occurred at sector 59764612, that is
the drive itself reporting the bad sector. 8 sectors is one 4KB page.
I don't think you can tell which specific sector failed. =>

Error 19578 occurred at disk power-on lifetime: 12725 hours (530 days + 5 hours)
<= 32 hours before the last sector error =>

When the command that caused the error occurred, the device was
active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 84 ef 8f e3 00 00:00:00.233 READ DMA
c8 00 08 84 ef 93 e0 00 00:00:00.233 READ DMA
c8 00 08 ac ef 93 e0 00 00:00:00.233 READ DMA
c8 00 08 cc ef 93 e0 00 00:00:00.233 READ DMA
c8 00 08 44 fa 93 e0 00 00:00:00.233 READ DMA

<= Again a read of that same 4 KB page failed Not sure which sector =>

Error 19577 occurred at disk power-on lifetime: 12680 hours (528 days + 8 hours)
<= 2 hours before the last one =>
When the command that caused the error occurred, the device was
active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 84 ef 8f e3 00 00:00:35.602 READ DMA
25 00 10 15 ed 7a e0 00 00:00:35.602 READ DMA EXT
25 00 10 f5 9c 76 e0 00 00:00:35.602 READ DMA EXT
25 00 10 05 c7 7a e0 00 00:00:35.602 READ DMA EXT
25 00 10 f5 99 76 e0 00 00:00:35.602 READ DMA EXT

<= It's that same 4 KB page again. =>

Error 19576 occurred at disk power-on lifetime: 12678 hours (528 days + 6 hours)
When the command that caused the error occurred, the device was
active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 84 ef 8f e3 00 00:00:29.969 READ DMA
c8 00 50 d4 d9 6f e0 00 00:00:29.969 READ DMA
c8 00 30 3c e0 6f e0 00 00:00:29.969 READ DMA
c8 00 88 b4 df 6f e0 00 00:00:29.969 READ DMA
c8 00 e8 14 d7 6f e0 00 00:00:29.969 READ DMA
<= and again, that same bad page 2 hours before the last one =>

Error 19575 occurred at disk power-on lifetime: 12678 hours (528 days + 6 hours)
When the command that caused the error occurred, the device was
active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 c8 84 ef 8f e3 Error: UNC 200 sectors at LBA = 0x038fef84 = 59764612

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 f8 54 ef 8f e3 00 00:00:29.968 READ DMA
c8 00 08 4c ef 8f e3 00 00:00:29.968 READ DMA
c8 00 08 44 ef 8f e3 00 00:00:29.968 READ DMA
c8 00 20 44 ca 6f e0 00 00:00:29.968 READ DMA
c8 00 b0 b4 be ac e3 00 00:00:29.968 READ DMA
.
<= slightly different, just before the single 4KB page read failed, a
200 sector read failed (it was the same hour) =>

===============================================
--
Greg Freemyer
Post by Marguerite Su
Hi, Greg,
Unfortunately it's my laptop...
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.11.10-11-desktop] (SUSE RPM)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint M8 (AF)
Device Model: SAMSUNG HN-M101MBB
Serial Number: S2R8J9BB808817
LU WWN Device Id: 5 0024e9 205e8f61c
Firmware Version: 2AR10001
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 6
SATA Version is: SATA 3.0, 3.0 Gb/s (current: 1.5 Gb/s)
Local Time is: Tue Jul 22 02:09:58 2014 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (13320) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection
on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 222) minutes.
SCT capabilities: (0x003f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail
Always - 2850
2 Throughput_Performance 0x0026 252 252 000 Old_age
Always - 0
3 Spin_Up_Time 0x0023 089 089 025 Pre-fail
Always - 3466
4 Start_Stop_Count 0x0032 098 098 000 Old_age
Always - 2161
5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail
Always - 0
7 Seek_Error_Rate 0x002e 252 252 051 Old_age
Always - 0
8 Seek_Time_Performance 0x0024 252 252 015 Old_age
Offline - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age
Always - 12867
10 Spin_Retry_Count 0x0032 252 252 051 Old_age
Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age
Always - 135
12 Power_Cycle_Count 0x0032 098 098 000 Old_age
Always - 2242
181 Program_Fail_Cnt_Total 0x0022 080 080 000 Old_age
Always - 452013413
191 G-Sense_Error_Rate 0x0022 100 100 000 Old_age
Always - 820
192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age
Always - 0
194 Temperature_Celsius 0x0002 064 047 000 Old_age
Always - 36 (Min/Max 17/53)
195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age
Always - 0
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age
Always - 0
197 Current_Pending_Sector 0x0032 252 100 000 Old_age
Always - 0
198 Offline_Uncorrectable 0x0030 252 252 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age
Always - 0
200 Multi_Zone_Error_Rate 0x002a 100 100 000 Old_age
Always - 463
223 Load_Retry_Count 0x0032 100 100 000 Old_age
Always - 135
225 Load_Cycle_Count 0x0032 032 032 000 Old_age
Always - 691031
SMART Error Log Version: 1
Warning: ATA error count 19579 inconsistent with error log pointer 5
ATA Error Count: 19579 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 19579 occurred at disk power-on lifetime: 12857 hours (535 days
+ 17 hours)
When the command that caused the error occurred, the device was
active or idle.
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 84 ef 8f e3 00 00:00:00.203 READ DMA
ef 10 02 00 00 00 a0 00 00:00:00.203 SET FEATURES [Enable SATA feature]
27 00 00 00 00 00 e0 00 00:00:00.203 READ NATIVE MAX ADDRESS
EXT [OBS-ACS-3]
ec 00 00 00 00 00 a0 00 00:00:00.203 IDENTIFY DEVICE
ef 02 00 00 00 00 a0 00 00:00:00.203 SET FEATURES [Enable write cache]
Error 19578 occurred at disk power-on lifetime: 12725 hours (530 days + 5 hours)
When the command that caused the error occurred, the device was
active or idle.
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 84 ef 8f e3 00 00:00:00.233 READ DMA
c8 00 08 84 ef 93 e0 00 00:00:00.233 READ DMA
c8 00 08 ac ef 93 e0 00 00:00:00.233 READ DMA
c8 00 08 cc ef 93 e0 00 00:00:00.233 READ DMA
c8 00 08 44 fa 93 e0 00 00:00:00.233 READ DMA
Error 19577 occurred at disk power-on lifetime: 12680 hours (528 days + 8 hours)
When the command that caused the error occurred, the device was
active or idle.
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 84 ef 8f e3 00 00:00:35.602 READ DMA
25 00 10 15 ed 7a e0 00 00:00:35.602 READ DMA EXT
25 00 10 f5 9c 76 e0 00 00:00:35.602 READ DMA EXT
25 00 10 05 c7 7a e0 00 00:00:35.602 READ DMA EXT
25 00 10 f5 99 76 e0 00 00:00:35.602 READ DMA EXT
Error 19576 occurred at disk power-on lifetime: 12678 hours (528 days + 6 hours)
When the command that caused the error occurred, the device was
active or idle.
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 84 ef 8f e3 00 00:00:29.969 READ DMA
c8 00 50 d4 d9 6f e0 00 00:00:29.969 READ DMA
c8 00 30 3c e0 6f e0 00 00:00:29.969 READ DMA
c8 00 88 b4 df 6f e0 00 00:00:29.969 READ DMA
c8 00 e8 14 d7 6f e0 00 00:00:29.969 READ DMA
Error 19575 occurred at disk power-on lifetime: 12678 hours (528 days + 6 hours)
When the command that caused the error occurred, the device was
active or idle.
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 c8 84 ef 8f e3 Error: UNC 200 sectors at LBA = 0x038fef84 = 59764612
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 f8 54 ef 8f e3 00 00:00:29.968 READ DMA
c8 00 08 4c ef 8f e3 00 00:00:29.968 READ DMA
c8 00 08 44 ef 8f e3 00 00:00:29.968 READ DMA
c8 00 20 44 ca 6f e0 00 00:00:29.968 READ DMA
c8 00 b0 b4 be ac e3 00 00:00:29.968 READ DMA
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has
ever been run
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Completed [00% left] (0-65535)
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
I at first didn't treat it as a hardware problem, I treated it as a
software problem because I can't believe all in a sudden my disk was
producting "end_request" for about 10 sectors in a very narrow time
window (in just a few minutes..) while I can still use my openSUSE
without any major problem (the hard shutdown will result a slow boot,
the suspend to ram/disk function still looks fine, and the daily
operation is good too)
Marguerite
Post by Marguerite Su
On Mon, Jul 21, 2014 at 12:02 PM, Larry Finger
Post by Larry Finger
If this system is a desktop, an easy thing to do is to replace the cable. On a
laptop, that is generally not possible. In any case, I would be sure to have
backups and a spare drive.
I agree with Larry in that my first diagnostic step when I see errors
being reported is to replace the sata cable. It that fixes the
problem, just toss the old cable.
If it doesn't is when I start diving into the "smartctl -a" data.
Greg
--
Greg Freemyer
--
Jan Kara
2014-07-22 10:24:05 UTC
Permalink
Raw Message
On Mon 21-07-14 15:10:42, Greg Freemyer wrote:
...
Post by Greg Freemyer
The trouble is you have permanently lost whatever that sector used to
hold. I know of no way general way to get it back. After you repair
the sector, you should try to figure what file / inode / metadata the
sector was associated with and try to fix things back up. If the
sector is a part of a files data page, the simplist thing is just to
delete the file and get a replacement copy from backup.
For the kernel team, this brings up the question if XFS CRC metadata
correction is going to be in openSUSE 13.2? If so, does the new XFS
code have a way to force a write of the corrected data back to disk?
XFS has a CRC feature and it will be part of openSUSE 13.2 (at least
kernel and xfsprogs will support it - but note that the ondisk format
changes so you'll have to create filesystem from scratch). But XFS uses
that feature only to detect issues - find out sector has garbage instead of
real data. There is no plan (AFAIK) to use CRC's to recover anything and in
fact the CRC's used are weak for anything like that. IHMO when you want to
restore anything after a sector failure, you should be using RAID...

Honza
--
Jan Kara <jack-***@public.gmane.org>
SUSE Labs, CR
Jean Delvare
2014-07-22 14:40:06 UTC
Permalink
Raw Message
Post by Greg Freemyer
The trouble is you have permanently lost whatever that sector used to
hold. I know of no way general way to get it back. After you repair
the sector, you should try to figure what file / inode / metadata the
sector was associated with and try to fix things back up. If the
sector is a part of a files data page, the simplist thing is just to
delete the file and get a replacement copy from backup.
If the file in question is on the system partition, you can try "rpm
-Va" to find out which file that was, and then reinstall the package in
question.

You should be able to find out which partition the file was on by
comparing the sector number with the disk geometry as reported by fdisk
or cfdisk.
--
Jean Delvare
SUSE L3 Support
--
To unsubscribe, e-mail: opensuse-kernel+unsubscribe-***@public.gmane.org
To contact the owner, e-mail: opensuse-kernel+owner-***@public.gmane.org
Stefan Seyfried
2014-07-22 15:12:13 UTC
Permalink
Raw Message
Post by Jean Delvare
Post by Greg Freemyer
The trouble is you have permanently lost whatever that sector used to
hold. I know of no way general way to get it back. After you repair
the sector, you should try to figure what file / inode / metadata the
sector was associated with and try to fix things back up. If the
sector is a part of a files data page, the simplist thing is just to
delete the file and get a replacement copy from backup.
If the file in question is on the system partition, you can try "rpm
-Va" to find out which file that was, and then reinstall the package in
question.
There is a nice HOWTO on the smartmontools site:

http://smartmontools.sourceforge.net/badblockhowto.html

I used this in the past to find out which file (on an ext3 FS) a given
block number belonged to, so that I could restore it from backup when a
disk started to die (the disk got replaced before it died, but I wanted
to make sure to avoid overwriting the good backup with a bad file).
Post by Jean Delvare
You should be able to find out which partition the file was on by
comparing the sector number with the disk geometry as reported by fdisk
or cfdisk.
That's easy, but then you need the offset into that partition and with
that offset, you can use tune2fs to find out the inode number / file
name this belongs to. I forgot all the details, but the above linked
howto explains it all.

Good Luck Marguerite!

seife
--
Stefan Seyfried
Linux Consultant & Developer -- GPG Key: 0x731B665B

B1 Systems GmbH
Osterfeldstraße 7 / 85088 Vohburg / http://www.b1-systems.de
GF: Ralph Dehner / Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537
--
To unsubscribe, e-mail: opensuse-kernel+unsubscribe-***@public.gmane.org
To contact the owner, e-mail: opensuse-kernel+owner-***@public.gmane.org
Greg Freemyer
2014-07-21 17:45:51 UTC
Permalink
Raw Message
Post by Marguerite Su
Hi,
failed command: WRITE FPDMA QUEUED
2014-07-17T03:01:17.841313+08:00 linux kernel: [16529.077363]
end_request: I/O error, dev sda, sector 59764644
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 84 ef 8f e3 Error: UNC 8 sectors at LBA = 0x038fef84 = 59764612
1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail
Always - 2847
196 Reallocated_Event_Count 0x0032 252 252 000 Old_age
Always - 0
Google says it is the hardware problem, but after a reinstall, my
computer still works...and actually the disk is less than 2 years
old..
Error 19579 occurred at disk power-on lifetime: 12857 hours (535 days
+ 17 hours)
When the command that caused the error occurred, the device was
active or idle.
Can anyone help judge this? Do I have to replace the disk?
Marguerite
It is certainly not a bug, but it may also not be time to replace the drive.

You gave us too little data to even make an educated guess.

Use "smartctl -a /dev/sda" to see all the smart output and post it here.

A couple comments:

- disks are cheap, businesses often replace a disk the first time they
have a read error that is reported all the way up the stack to
userspace. You've just had one of those.

The "smart" data you are reporting is very manufacturer and even disk
model specific. In general you have to record it periodically and
look for changes to see if anything bad is going on. Despite that, I
don't do that myself and neither do many other people.

- a raw read error may just mean that ECC correction had to be
invoked, so who cares. My current laptop is reporting 226 million raw
read errors. It also says the exact same 226 million attempts at ECC
correction worked.

- It is also reporting zero reallocations and zero
pending reallocations. I'm not planning to replace it. You've had
only 2847 raw read errors so that by itself looks like a small,
insignificant number.

- My "reported Uncorrectable" raw value is zero, so not even a single
retry has been required yet. That is surprising because a physical
bump of the laptop while it is reading data can cause an uncorrectable
error that a retry would likely solve.

- If you don't know, many / most (but not all) disks only re-allocate
sectors on write. Thus if you read a sector and the data is not
readable and the ECC correction also fails, many drives will mark that
sector for re-allocation. It will stay in that state until that
specific sector is written with replacement data. After all, why should
the drive re-allocate the sector if it doesn't know what to put in the
new sector.

- Thus if you want to know how many current
bad media sectors you have, it should be reflected in the "pending
reallocation" data.

- If you want to know how many bad sectors you used to have,
but were fixed by reallocation, look at the Re-allocated data.

As I understand it, some drives will monitor those "pending
re-allocation" sectors and if you ever successfully read data from
them, they will go ahead and allocate a new sector and write the good
data there. I absolutely know there have been reports of read only
actions triggering re-allocations. I don't know on which makes/models
of drives this has been observed.

Note that the drive itself will NOT proactively monitor the sectors
for issues, nor will it periodically check pending bad sectors to see
if it can get valid data.

Thus some raid setups have the ability to perform background media
scans and if a media error is reported, get the valid data from an
alternate drive and write it back to the drive reporting the bad
sector. NetApp (among others) calls this scrubbing the array.
(https://library.netapp.com/ecmdocs/ECMP1196912/html/GUID-81F8BEA3-ADC1-4790-81F1-3E376BC98B27.html)

Greg







-
Loading...