hard disk dying or kernel bug?

Discussion:

Ludwig Nussel

2014-10-07 06:34:54 UTC

Hi,

Running current Factory kernel (3.16.3-1.gd2bbe7f-desktop) I have the following messages in dmesg:

[20727.025399] sas: Enter sas_scsi_recover_host busy: 6 failed: 6
[20727.025407] sas: trying to find task 0xffff8800375776c0
[20727.025410] sas: sas_scsi_find_task: aborting task 0xffff8800375776c0
[20727.025418] isci 0000:05:00.0: isci_task_abort_task: dev = (null) (STP/SATA <NULL>), task = ffff8800375776c0, old_request == (null)
[20727.025421] isci 0000:05:00.0: isci_task_abort_task: abort task not needed for ffff8800375776c0
[20727.025425] isci 0000:05:00.0: isci_task_abort_task: Done; dev = (null), task = ffff8800375776c0 , old_request == (null)
[20727.025428] sas: sas_scsi_find_task: task 0xffff8800375776c0 is done
[20727.025430] sas: sas_eh_handle_sas_errors: task 0xffff8800375776c0 is done
[20727.025433] sas: trying to find task 0xffff880037577440
[20727.025435] sas: sas_scsi_find_task: aborting task 0xffff880037577440
[20727.025439] isci 0000:05:00.0: isci_task_abort_task: dev = (null) (STP/SATA <NULL>), task = ffff880037577440, old_request == (null)
[20727.025442] isci 0000:05:00.0: isci_task_abort_task: abort task not needed for ffff880037577440
[20727.025446] isci 0000:05:00.0: isci_task_abort_task: Done; dev = (null), task = ffff880037577440 , old_request == (null)
[20727.025448] sas: sas_scsi_find_task: task 0xffff880037577440 is done
[20727.025450] sas: sas_eh_handle_sas_errors: task 0xffff880037577440 is done
[20727.025452] sas: trying to find task 0xffff880037577940
[20727.025454] sas: sas_scsi_find_task: aborting task 0xffff880037577940
...
[20727.025528] sas: ata7: end_device-6:0: cmd error handler
[20727.025602] sas: ata7: end_device-6:0: dev error handler
[20727.025615] ata7.00: exception Emask 0x0 SAct 0x7e0 SErr 0x0 action 0x6 frozen
[20727.025620] ata7.00: failed command: WRITE FPDMA QUEUED
[20727.025628] ata7.00: cmd 61/40:00:d8:03:1b/00:00:45:00:00/40 tag 5 ncq 32768 out
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[20727.025631] ata7.00: status: { DRDY }
...
[20727.025688] ata7.00: status: { DRDY }
[20727.025694] ata7: hard resetting link
[20727.219155] ata7.00: configured for UDMA/133
[20727.219164] ata7.00: device reported invalid CHS sector 0
[20727.219167] ata7.00: device reported invalid CHS sector 0
[20727.219170] ata7.00: device reported invalid CHS sector 0
[20727.219173] ata7.00: device reported invalid CHS sector 0
[20727.219176] ata7.00: device reported invalid CHS sector 0
[20727.219178] ata7.00: device reported invalid CHS sector 0
[20727.219212] ata7: EH complete
[20727.219262] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
[40650.589614] perf interrupt took too long (2520 > 2500), lowering kernel.perf_event_max_sample_rate to 50000

Is the disk dying (smartctl output attached) or is it a kernel bug?

cu
Ludwig

--
(o_ Ludwig Nussel
//\
V_/_ http://www.suse.de/
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix ImendÃ¶rffer, HRB 16746 (AG NÃŒrnberg)

Hannes Reinecke

2014-10-07 06:40:38 UTC

Permalink

Post by Ludwig Nussel
Hi,
Running current Factory kernel (3.16.3-1.gd2bbe7f-desktop) I have
[20727.025399] sas: Enter sas_scsi_recover_host busy: 6 failed: 6
[20727.025407] sas: trying to find task 0xffff8800375776c0
[20727.025410] sas: sas_scsi_find_task: aborting task
0xffff8800375776c0
[20727.025418] isci 0000:05:00.0: isci_task_abort_task: dev
= (null) (STP/SATA <NULL>), task = ffff8800375776c0,
old_request == (null)
[20727.025421] isci 0000:05:00.0: isci_task_abort_task: abort task
not needed for ffff8800375776c0
[20727.025425] isci 0000:05:00.0: isci_task_abort_task: Done; dev
= (null), task = ffff8800375776c0 , old_request
== (null)
[20727.025428] sas: sas_scsi_find_task: task 0xffff8800375776c0 is done
[20727.025430] sas: sas_eh_handle_sas_errors: task
0xffff8800375776c0 is done
[20727.025433] sas: trying to find task 0xffff880037577440
[20727.025435] sas: sas_scsi_find_task: aborting task
0xffff880037577440
[20727.025439] isci 0000:05:00.0: isci_task_abort_task: dev
= (null) (STP/SATA <NULL>), task = ffff880037577440,
old_request == (null)
[20727.025442] isci 0000:05:00.0: isci_task_abort_task: abort task
not needed for ffff880037577440
[20727.025446] isci 0000:05:00.0: isci_task_abort_task: Done; dev
= (null), task = ffff880037577440 , old_request
== (null)
[20727.025448] sas: sas_scsi_find_task: task 0xffff880037577440 is done
[20727.025450] sas: sas_eh_handle_sas_errors: task
0xffff880037577440 is done
[20727.025452] sas: trying to find task 0xffff880037577940
[20727.025454] sas: sas_scsi_find_task: aborting task
0xffff880037577940
...
[20727.025528] sas: ata7: end_device-6:0: cmd error handler
[20727.025602] sas: ata7: end_device-6:0: dev error handler
[20727.025615] ata7.00: exception Emask 0x0 SAct 0x7e0 SErr 0x0 action 0x6 frozen
[20727.025620] ata7.00: failed command: WRITE FPDMA QUEUED
[20727.025628] ata7.00: cmd 61/40:00:d8:03:1b/00:00:45:00:00/40 tag 5 ncq 32768 out
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[20727.025631] ata7.00: status: { DRDY }

That's an NCQ failure, most likely TLER issue (time-limited error
recovery). IE the device encountered an error which the internal
error recovery couldn't fix up.

And yes, the ATA stack doesn't handle that one well ...

Post by Ludwig Nussel
...
[20727.025688] ata7.00: status: { DRDY }
[20727.025694] ata7: hard resetting link
[20727.219155] ata7.00: configured for UDMA/133
[20727.219164] ata7.00: device reported invalid CHS sector 0
[20727.219167] ata7.00: device reported invalid CHS sector 0
[20727.219170] ata7.00: device reported invalid CHS sector 0
[20727.219173] ata7.00: device reported invalid CHS sector 0
[20727.219176] ata7.00: device reported invalid CHS sector 0
[20727.219178] ata7.00: device reported invalid CHS sector 0
[20727.219212] ata7: EH complete
[20727.219262] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
[40650.589614] perf interrupt took too long (2520 > 2500), lowering
kernel.perf_event_max_sample_rate to 50000
Is the disk dying (smartctl output attached) or is it a kernel bug?

It's on it way out:

1 Raw_Read_Error_Rate 0x000f 112 099 006 Pre-fail
Always - 46434016

A high raw read error rate _is_ worrying.

7 Seek_Error_Rate 0x000f 073 060 030 Pre-fail
Always - 17265656910

And a high seek error rate even more so.
Get a new disk.

Cheers,

Hannes

--
Dr. Hannes Reinecke zSeries & Storage
hare-***@public.gmane.org +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe, e-mail: opensuse-kernel+unsubscribe-***@public.gmane.org
To contact the owner, e-mail: opensuse-kernel+owner-***@public.gmane.org

Ludwig Nussel

2014-10-07 07:47:56 UTC

Permalink

Post by Hannes Reinecke

Post by Ludwig Nussel
[...]
Is the disk dying (smartctl output attached) or is it a kernel bug?

1 Raw_Read_Error_Rate 0x000f 112 099 006 Pre-fail
Always - 46434016
A high raw read error rate _is_ worrying.
7 Seek_Error_Rate 0x000f 073 060 030 Pre-fail
Always - 17265656910
And a high seek error rate even more so.
Get a new disk.

Will do. Thanks! :-)

cu
Ludwig

--
(o_ Ludwig Nussel
//\
V_/_ http://www.suse.de/
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe, e-mail: opensuse-kernel+unsubscribe-***@public.gmane.org
To contact the owner, e-mail: opensuse-kernel+owner-***@public.gmane.org

Bruno Friedmann

2014-10-07 07:48:33 UTC

Permalink

Post by Hannes Reinecke

That's an NCQ failure, most likely TLER issue (time-limited error
recovery). IE the device encountered an error which the internal
error recovery couldn't fix up.
And yes, the ATA stack doesn't handle that one well ...

Ludwig, Don't know if your actual disk can be saved, but if you have other
there's seagate firmware update ( the iso is pxe, usb bootable too)

Device Model: ST2000DM001-1CH164
Serial Number: Z1F3H9EJ
LU WWN Device Id: 5 000c50 0643f2559
Firmware Version: CC27
User Capacity: 2'000'398'934'016 bytes [2.00 TB]

I've seen several time barracuda series dying just because of not being up to date :-)

--
Bruno Friedmann
Ioda-Net Sàrl www.ioda-net.ch

openSUSE Member & Board, fsfe fellowship
GPG KEY : D5C9B751C4653227
irc: tigerfoot

Carlos E. R.

2014-10-07 09:50:49 UTC

Permalink

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

1 Raw_Read_Error_Rate 0x000f 112 099 006 Pre-fail Always - 46434016
A high raw read error rate _is_ worrying.
7 Seek_Error_Rate 0x000f 073 060 030 Pre-fail Always - 17265656910

Not necessarily, as it is a Seagate. Look at one of mine, still young (exact same model, newer firmware (CC27)):

Model Family: Seagate Barracuda 7200.14 (AF)
Device Model: ST2000DM001-1CH164
Firmware Version: CC27
User Capacity: 2,000,398,934,016 bytes [2.00 TB]

1 Raw_Read_Error_Rate 0x000f 119 099 006 Pre-fail Always - 227236888
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 394
9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 2434
7 Seek_Error_Rate 0x000f 074 060 030 Pre-fail Always - 8646212908

Another disk of the same model family:

Model Family: Seagate Barracuda 7200.14 (AF)
Device Model: ST3000DM001-1CH166
Firmware Version: CC27
User Capacity: 3,000,592,982,016 bytes [3.00 TB]

1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 175510856
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 384
7 Seek_Error_Rate 0x000f 064 060 030 Pre-fail Always - 21489215161
9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 2236

An older disk:

Model Family: Seagate Barracuda 7200.12
Device Model: ST3500418AS
Firmware Version: CC37
User Capacity: 500,107,862,016 bytes [500 GB]

1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 199173627
3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 094 094 020 Old_age Always - 6294
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 080 060 030 Pre-fail Always - 103648914
9 Power_On_Hours 0x0032 081 081 000 Old_age Always - 17190

Modern Seagates produce absurdly high error rates, it is normal for them. The disk may be bad, but not based on those numbers.

However, his disk has 15838 hours of use.
The figures from smarctl are not conclusive, as no test has been run recently:

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 0 -

So what I would do is run first the short test, then the long one, then verify again the figures.

Later, I would activate the short test on automatic, periodically, via smartd daemon, for all disks.

- --
Cheers / Saludos,

Carlos E. R.
(from 13.1 x86_64 "Bottle" at Telcontar)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlQzt3cACgkQtTMYHG2NR9XPGwCfUbM8okq+NSTdMYrbOwHTq4lb
5YMAn3rQJOAaQO9K3hLk37s4u9kiKrup
=DuIp
-----END PGP SIGNATURE-----

--
To unsubscribe, e-mail: opensuse-kernel+unsubscribe-***@public.gmane.org
To contact the owner, e-mail: opensuse-kernel+owner-***@public.gmane.org

Greg Freemyer

2014-10-07 11:12:09 UTC

Permalink

Post by Hannes Reinecke

done

Post by Ludwig Nussel
[20727.025430] sas: sas_eh_handle_sas_errors: task
0xffff8800375776c0 is done
[20727.025433] sas: trying to find task 0xffff880037577440
[20727.025435] sas: sas_scsi_find_task: aborting task
0xffff880037577440
[20727.025439] isci 0000:05:00.0: isci_task_abort_task: dev
= (null) (STP/SATA <NULL>), task = ffff880037577440,
old_request == (null)
[20727.025442] isci 0000:05:00.0: isci_task_abort_task: abort task
not needed for ffff880037577440
[20727.025446] isci 0000:05:00.0: isci_task_abort_task: Done; dev
= (null), task = ffff880037577440 , old_request
== (null)
[20727.025448] sas: sas_scsi_find_task: task 0xffff880037577440 is

done

Post by Ludwig Nussel
[20727.025450] sas: sas_eh_handle_sas_errors: task
0xffff880037577440 is done
[20727.025452] sas: trying to find task 0xffff880037577940
[20727.025454] sas: sas_scsi_find_task: aborting task
0xffff880037577940
...
[20727.025528] sas: ata7: end_device-6:0: cmd error handler
[20727.025602] sas: ata7: end_device-6:0: dev error handler
[20727.025615] ata7.00: exception Emask 0x0 SAct 0x7e0 SErr 0x0 action 0x6 frozen
[20727.025620] ata7.00: failed command: WRITE FPDMA QUEUED
[20727.025628] ata7.00: cmd 61/40:00:d8:03:1b/00:00:45:00:00/40 tag 5 ncq 32768 out
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[20727.025631] ata7.00: status: { DRDY }

That's an NCQ failure, most likely TLER issue (time-limited error
recovery). IE the device encountered an error which the internal
error recovery couldn't fix up.
And yes, the ATA stack doesn't handle that one well ...

I'd give odds the drive is fine and the sata cable is bad. Whenever I see that "hard resetting link" message, the cable is my first suspect.

What actual errors does the drive report (smartctl --log)?

If the drive was the source of the problem it will have logged it internally.

Greg

--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.
--
To unsubscribe, e-mail: opensuse-kernel+unsubscribe-***@public.gmane.org
To contact the owner, e-mail: opensuse-kernel+owner-***@public.gmane.org

Carlos E. R.

2014-10-07 13:44:23 UTC

Permalink

Post by Greg Freemyer
I'd give odds the drive is fine and the sata cable is bad. Whenever I see that "hard resetting link" message, the cable is my first suspect.
What actual errors does the drive report (smartctl --log)?

Notice that the syntax is not that simple:

Telcontar:~ # smartctl --health /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.11.10-21-desktop] (SUSE RPM)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

Telcontar:~ # smartctl --log /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.11.10-21-desktop] (SUSE RPM)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=======> INVALID ARGUMENT TO -l: /dev/sda
=======> VALID ARGUMENTS ARE: error, selftest, selective, directory[,g|s], xerror[,N][,error], xselftest[,N][,selftest], background, sasphy[,reset], sataphy[,reset], scttemp[sts,hist], scttempint,N[,p], scterc[,N,M], devstat[,N], ssd, gplog,N[,RANGE], smartlog,N[,RANGE] <=======

Use smartctl -h to get a usage summary

Telcontar:~ #

- --
Cheers / Saludos,

Carlos E. R.
(from 13.1 x86_64 "Bottle" at Telcontar)

Greg Freemyer

2014-10-07 13:58:04 UTC

Permalink

Post by Carlos E. R.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

sudo /usr/sbin/smartctl --log=error /dev/sda

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.11.10-21-desktop] (SUSE RPM)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
No Errors Logged

Ludwig Nussel

2014-10-08 08:23:02 UTC

Permalink

Post by Greg Freemyer
I'd give odds the drive is fine and the sata cable is bad. Whenever I
see that "hard resetting link" message, the cable is my first suspect.
What actual errors does the drive report (smartctl --log)?
If the drive was the source of the problem it will have logged it internally.

After having backed up my data I ran the long self test. No errors
logged. So I'll give the cable a try. Thanks for hint!
Also, we don't have smartd enabled by default. I guess we should though.

cu
Ludwig

Greg Freemyer

2014-10-08 10:57:10 UTC

Permalink

Post by Greg Freemyer

Post by Greg Freemyer
I'd give odds the drive is fine and the sata cable is bad. Whenever

Post by Greg Freemyer
see that "hard resetting link" message, the cable is my first

suspect.

Post by Greg Freemyer
What actual errors does the drive report (smartctl --log)?
If the drive was the source of the problem it will have logged it internally.

Your comment about smartd seems like a non-sequitur.

Smartd has the job of copying internal disk logs to syslog so admins will notice them. The internal disk log exists independent of what smartd is doing.

Greg

Carlos E. R.

2014-10-08 11:17:26 UTC

Permalink

Post by Greg Freemyer
Your comment about smartd seems like a non-sequitur.
Smartd has the job of copying internal disk logs to syslog so
admins will notice them. The internal disk log exists independent
of what smartd is doing.

It does more.

smartd also can run short/long tests on the disks, periodically, and
send you emails with reports or problems. You can define your own
script and send SMSs to your phone, for instance.

- --
Cheers / Saludos,

Carlos E. R.
(from 13.1 x86_64 "Bottle" at Telcontar)

Basil Chupin

2014-10-09 05:47:04 UTC

Permalink

Post by Ludwig Nussel

Post by Greg Freemyer
I'd give odds the drive is fine and the sata cable is bad. Whenever I
see that "hard resetting link" message, the cable is my first suspect.
What actual errors does the drive report (smartctl --log)?
If the drive was the source of the problem it will have logged it internally.

After having backed up my data I ran the long self test. No errors
logged. So I'll give the cable a try. Thanks for hint!

Meant to respond to this re the cable yesterday but other things got in
the way......

Have a look at the cable to see if it is a red coloured one in which
case the probability of it having to be replaced are high.

At one point I was running another distro. and while on it's mail list I
read a comment by a tech who was an "old hand" at the game.

His comment was that cables with red plastic coatings were causing
problems but he couldn't figure out why until one of his own computers
"went down" and he actually decided to "pull" one of these cables apart.
What he found was that the copper wires inside the cable had "rotted"
away - caused by some chemicals used in the production of the cables.
Not an urban myth, BTW. I can see such cables available at stores which
sell components at the "lower end of the market" and always avoid them.
(This advice is along the lines of what occurred some years ago when
cables became available which made it *so* much more easy to connect
HDDs to the m/board because the cables where longer. The only hassle was
that these cables were too long and caused signal-bounce which caused
corruption of data written to/read from the HDDs. The designed max
length of cables was (?)18 inches but the new ones were longer thus
causing the data corruption.)

[pruned]

BC

--
Using openSUSE 13.2, KDE 4.14.1 & kernel 3.16.3-1 on a system with-
AMD FX 8-core 3.6/4.2GHz processor
16GB PC14900/1866MHz Quad Channel RAM
Gigabyte AMD3+ m/board; Gigabyte nVidia GTX660 GPU

Basil Chupin

2014-10-07 09:01:12 UTC

Permalink

Post by Ludwig Nussel
Hi,
Running current Factory kernel (3.16.3-1.gd2bbe7f-desktop) I have the

[pruned]

See the thread "Interpretation please" in 'opensuse help' which I
started on 1 Sept 2014.

If you are, as I was, concerned about the results produced by smartctl
then send them to Seagate and ask them for their comment. The reply I
got was that I had no problems, but to confirm I could run Seatools
against the HDDs I have.

BTW, the responses I got by asking that question in HELP ranged from.
"The sky is falling!" to, "Nothing t worry about - typical Seagate results".

BC

--
Using openSUSE 13.1, KDE 4.14.1 & kernel 3.16.3-1 on a system with-
AMD FX 8-core 3.6/4.2GHz processor
16GB PC14900/1866MHz Quad Channel RAM
Gigabyte AMD3+ m/board; Gigabyte nVidia GTX660 GPU