101357 – (IDE PDC202XX) ata failure with Severn

Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.

Bug 101357 - (IDE PDC202XX) ata failure with Severn

Summary: (IDE PDC202XX) ata failure with Severn

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Red Hat Linux Beta
Classification:	Retired
Component:	kernel
Sub Component:
Version:	beta2
Hardware:	i386
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Dave Jones
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	CambridgeBlocker
TreeView+	depends on / blocked

Reported:	2003-07-31 04:47 UTC by djh
Modified:	2015-01-04 22:02 UTC (History)
CC List:	4 users (show)
Fixed In Version:	2.4.22-1.2086.nptl
Clone Of:
Environment:
Last Closed:	2003-10-10 01:04:08 UTC
Embargoed:

Attachments	(Terms of Use)
lspci (deleted) 2003-07-31 04:51 UTC, djh	no flags	Details
View All

Description djh 2003-07-31 04:47:36 UTC

Description of problem:
An old Athlon system of mine is not liking Severn.  The ata code reports errors 
followed by severe fs corruption.  This occurs with and without acpi.

This particular system has run lots of 2.4 kernels with no major problems, and 
was successfully running RHL9 up until a few weeks ago.  The drive is healthy.

Here's an example of what happens when it fails:
hde: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error }
hde: task_no_data_intr: error=0x04 { DriveStatusError }
...
EXT3-fs error (...) in start_transaction: Journal has aborted


Version-Release number of selected component:
2.4.21-20.1.2024.2.1.nptl  (athlon rpm)


Hardware info:
A7V mobo (KT133 chipset, onboard VIA ata66 and Promise ata100 controllers)
hdc is an old Sony burner
hde: QUANTUM FIREBALLP AS40.0, ATA DISK drive
(see attachments for more details)


How reproducible:
~10 mins of idling seems to reliably reproduce it.  I've also seen it occur 
during bootup, and occasionally during normal use.

I cannot reproduce it with RHL9 errata kernel (2.4.20-19.9) + Severn userland.


Additional info:
I also had a similar problem with the installer kernel, but it's much harder to 
reproduce.  It took a few reboots for anaconda to successfully mount the 
existing ext3 partition.  (IIRC it reported lost irq, disabled DMA, and PIO mode 
didn't work)

Also, I see similar 0x51/0x04 errors with hdc after smartd starts.

Comment 1 djh 2003-07-31 04:51:21 UTC

Created attachment 93286 [details]
lspci

Comment 2 Bill Nottingham 2003-07-31 05:21:56 UTC

What happens if you use the i686 kernel instead of the athlon kernel?

Comment 3 djh 2003-07-31 06:28:37 UTC

Same result with the i686 version. (acpi=off)

I'll try some recent vanilla and -ac kernels later.

Comment 4 djh 2003-08-01 05:54:43 UTC

I could not reproduce it with 2.4.21, 2.4.22-pre6-ac1, or with Arjans 2.6 RPMs
(2.6.0-0.test1.1.26 and 2.6.0-0.test2.1.28).

Comment 5 Peter van Egdom 2003-08-01 18:55:45 UTC

What happens if you turn of the loading of smartd ("chkconfig --level 35 smartd
off") and reboot the machine? On one of my machines smartd does a devicescan and
due to this my ide-tape drive does funny things.

Comment 6 djh 2003-08-06 13:49:07 UTC

It stops the errors about hdc.  No change with hde.

Comment 7 Alan Cox 2003-08-08 20:29:29 UTC

Its something in the severn stuff - I've seen multiple reports and even with
ACPI and "all the usual suspects" enabled it only happens with the RH tree. Its
really quite weird and I really don't know what severn is doing here.

Comment 8 djh 2003-08-12 12:57:34 UTC

I've just tried moving the drive from the Promise to the VIA controller - same
result.

BTW here's another report - (the only hardware in common is the harddrive)
http://www.redhat.com/archives/rhl-beta-list/2003-July/msg00962.html

Comment 9 djh 2003-09-30 04:32:57 UTC

I tried the Severn2 kernel (2.4.22-1.2061.nptl) with the Severn1 installation
and the same errors occur.

I have just noticed what has changed - when using the Severn kernels the hard
drive spins down after 5-10 mins.  (I'd really like to know why)

Comment 10 djh 2003-10-02 04:21:41 UTC

Bugzilla has lost the last few comments, so here is a summary.

laptop_mode is disabled.  "HDD power down" is disabled in BIOS.

After a fresh install of Fedora 0.94 it still occurs.
(0x51/04 errors, ide and ext3 failures, reset, manually fsck if required)

Comment 11 Dave Jones 2003-10-02 15:19:59 UTC

We're starting to suspect DMA problems with fireball drives, as this is the
third report I've been able to find, which is the only common factor.
(Different chipsets each time).

If you feel motivated to investigate this, can you paste the boot messages
of both a RHL9 and a cambridge kernel so we can see how they differ ?

Additionally, booting with ide=nodma may prevent around the corruption if our
guesses are correct.

Comment 12 Alan Cox 2003-10-02 22:41:30 UTC

You might want to add that quantum drive to the local blacklist for the PDC202xx
- not sure why it should bite just the quantumn though

Comment 13 Dave Jones 2003-10-03 10:38:34 UTC

It'll need adding in multiple places if thats the case, as this has been seen on
at least 3 different controllers now.

Comment 14 Dave Jones 2003-10-03 10:46:17 UTC

Also #91932 looks very similar (same hardware, also seeing corruption).
disabling DMA didn't help in that case, so it's back to the drawing board.

Comment 15 Dave Jones 2003-10-03 14:08:26 UTC

Are you using LVM ?

Comment 16 Dave Jones 2003-10-03 14:52:36 UTC

I'm interested to hear if this fares any better...
http://people.redhat.com/davej/2.4.22-1.2086.nptl/

Comment 17 djh 2003-10-03 15:57:07 UTC

No LVM, and ide=nodma didn't help much.

(btw I can't reproduce it with the Taroon kernel - 2.4.21-3.EL)

Comment 18 djh 2003-10-04 02:11:31 UTC

2.4.22-1.2086.nptl is looking good so far.

Comment 19 Dave Jones 2003-10-09 15:02:12 UTC

Any update on this ? Is it behaving now ?

Comment 20 djh 2003-10-10 01:04:08 UTC

With the limited amount of testing I've been able to do, 2.4.22-1.2086.nptl
seems to fix the problem.  2086 lasts for over 6 hours - previous Severn kernels
would fail within 20 mins.

I'll do some further tests, but I believe the problem is fixed.

Comment 21 Dave Jones 2003-10-11 12:51:00 UTC

Sounds promising. Looks like the acoustic management patch doesn't play well
with these drives.
Thanks for chasing this.

Comment 22 Dave Jones 2003-10-16 15:17:12 UTC

Can you paste the output of hdparm -I /dev/hd? from that Quantum Fireball please ?

Comment 23 djh 2003-10-17 03:44:52 UTC

/dev/hde:
                                                                               
     
ATA device, with non-removable media
        Model Number:       QUANTUM FIREBALLP AS40.0
        Serial Number:      194034230190
        Firmware Revision:  A1Y.1300
Standards:
        Used: ATA/ATAPI-5 T13 1321D revision 1
        Supported: 5 4 3 2 & some of 6
Configuration:
        Logical         max     current
        cylinders       16383   16383
        heads           16      16
        sectors/track   63      63
        --
        CHS current addressable sectors:   16514064
        LBA    user addressable sectors:   78177792
        device size with M = 1024*1024:       38172 MBytes
        device size with M = 1000*1000:       40027 MBytes (40 GB)
Capabilities:
        LBA, IORDY(can be disabled)
        bytes avail on r/w long: 4      Queue depth: 1
        Standby timer values: spec'd by Vendor, no device specific minimum
        R/W multiple sector transfer: Max = 16  Current = 16
        Recommended acoustic management value: 254, current value: 254
        DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5
             Cycle time: min=120ns recommended=120ns
        PIO: pio0 pio1 pio2 pio3 pio4
             Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
        Enabled Supported:
           *    READ BUFFER cmd
           *    WRITE BUFFER cmd
           *    Host Protected Area feature set
           *    Look-ahead
           *    Write cache
           *    Power Management feature set
                Security Mode feature set
           *    SMART feature set
           *    Automatic Acoustic Management feature set
           *    DOWNLOAD MICROCODE cmd
Security:
        Master password revision code = 65534
                supported
        not     enabled
        not     locked
        not     frozen
        not     expired: security count
        not     supported: enhanced erase
        24min for SECURITY ERASE UNIT. 8min for ENHANCED SECURITY ERASE UNIT.
HW reset results:
        CBLID- above Vih
        Device num = 0 determined by CSEL
Checksum: correct

Note You need to log in before you can comment on or make changes to this bug.