Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 91932
Summary: | IDE Errors | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Linux Beta | Reporter: | Thornton Prime <thornton> | ||||||
Component: | kernel | Assignee: | Dave Jones <davej> | ||||||
Status: | CLOSED NEXTRELEASE | QA Contact: | Brian Brock <bbrock> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | alpha 2 | CC: | djh, pfrields | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | i386 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2003-10-16 01:25:34 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 100643 | ||||||||
Attachments: |
|
Description
Thornton Prime
2003-05-29 23:07:12 UTC
Still problems in Cambridge alpha3, but I think problems are isolated to LVM. Spoke to soon. Happened on plain ext3 partition (no LVM), though it took a lot longer ... after about 50 passes of bonnie++, the filesystem became unreadable, with the same errors ... hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete} hda: task_no_data_intr: error=0x04 { DriveStatusError } hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error } hda: task_no_data_intr: error=0x04 { DriveStatusError } hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error } hda: task_no_data_intr: error=0x04 { DriveStatusError } Have you gone back to a stable release and these problems have gone away? Do these problems persist with the latest kernel in rawhide? I did go back to a stable release for this machine because the machine was worthless as a test machine since the IDE problems would crop up within minutes. I will try the latest rawhide, though. Problems look solved with Fedora Severn2/2.4.22-1.2061.nptl. Spoke to soon ... my 5th pass of bonnie++ gave this: # bonnie++ -u root -d . -bash: /usr/sbin/bonnie++: /lib/ld-linux.so.2: bad ELF interpreter: No such filySegmentation fault journal_bmap_R16ad4e4d: journal block not found at offset 116)Aborting journal on device lvm(58,0). journal_bmap_R16ad4e4d: journal block not found at offset 269 on lvm(58,1) Aborting journal on device lvm(58,1). ext3_abort called. EXT3-fs abort (device lvm(58,0)): ext3_journal_start: Detected aborted journal Remounting filesystem read-only hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete}hda: task_no_data_intr: error=0x04 { DriveStatusError } hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error } hda: task_no_data_intr: error=0x04 { DriveStatusError } hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error } hda: task_no_data_intr: error=0x04 { DriveStatusError } hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error } hda: task_no_data_intr: error=0x04 { DriveStatusError } (updating with lost bug reports from bugzilla crash). ============================================================================== ------- Additional Comments From christoph.wickert 2003-09-30 17:58 ------- Depends on your Kernelconfig! Quote (help to >CONFIG_IDEDISK_MULTI_MODE: > > If you get this error, try to say Y here: > > hda: set_multmode: status=0x51 { DriveReady SeekComplete Error } > hda: set_multmode: error=0x04 { DriveStatusError } > > If in doubt, say N. ================================================================================ Current Fedora kernels already set this option. The help text is out of date, and those warnings can occur from other parts of the IDE code. Its the drive saying it doesn't understand a command it was passed. Which is quite easy to hit if you use an old drive. The triggers for these commands need to be found so that some of these messages can be silence. They are however, very likely to be unrelated to the corruption problem reported here before bugzilla ate the original reporters posting.. We're starting to suspect DMA problems with fireball drives, as this is the third report I've been able to find, which is the only common factor. (Different chipsets each time). If you feel motivated to investigate this, can you paste the boot messages of both a RHL9 and a cambridge kernel so we can see how they differ ? Additionally, booting with ide=nodma may prevent around the corruption if our guesses are correct. I am testing now with ide=nodma Here are boot messages from a Severn2 (I'll post RH9 once I'm done testing): Linux version 2.4.22-1.2061.nptl (bhcompile.redhat.com) (gcc version3BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009fc00 (usable) BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved) BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 0000000007ec0000 (usable) BIOS-e820: 0000000007ec0000 - 0000000007ef8000 (ACPI data) BIOS-e820: 0000000007ef8000 - 0000000007f00000 (ACPI NVS) BIOS-e820: 00000000ffb80000 - 00000000ffc00000 (reserved) BIOS-e820: 00000000fff00000 - 0000000100000000 (reserved) 0MB HIGHMEM available. 126MB LOWMEM available. On node 0 totalpages: 32448 zone(0): 4096 pages. zone(1): 28352 pages. zone(2): 0 pages. ACPI disabled because your bios is from 2000 and too old You can enable it with acpi=force ACPI: RSDP (v000 AMI ) @ 0x000ff980 ACPI: RSDT (v001 CAYMAN 8C1A100A 0x20000210 MSFT 0x00000097) @ 0x07ef0000 ACPI: FADT (v001 CAYMAN 8C1A100A 0x20000210 MSFT 0x00000097) @ 0x07ef1000 ACPI: DSDT (v001 CAYMAN CA81020A 0x00000012 MSFT 0x0100000b) @ 0x00000000 Kernel command line: ro root=/dev/vg00/lv00 console=tty0 console=ttyS0,9600n81 eide_setup: ide0=nodma,notune -- BAD OPTION Initializing CPU#0 Detected 697.900 MHz processor. Console: colour VGA+ 80x25 Calibrating delay loop... 1392.64 BogoMIPS Memory: 124212k/129792k available (1509k kernel code, 5192k reserved, 1114k dat)Dentry cache hash table entries: 16384 (order: 5, 131072 bytes) Inode cache hash table entries: 8192 (order: 4, 65536 bytes) Mount cache hash table entries: 512 (order: 0, 4096 bytes) Buffer cache hash table entries: 4096 (order: 2, 16384 bytes) Page-cache hash table entries: 32768 (order: 5, 131072 bytes) CPU: L1 I cache: 16K, L1 D cache: 16K CPU: L2 cache: 256K Intel machine check architecture supported. Intel machine check reporting enabled on CPU#0. CPU: Intel Pentium III (Coppermine) stepping 03 Enabling fast FPU save and restore... done. Enabling unmasked SIMD FPU exception support... done. Checking 'hlt' instruction... OK. POSIX conformance testing by UNIFIX mtrr: v1.40 (20010327) Richard Gooch (rgooch.au) mtrr: detected mtrr type: Intel ACPI: Subsystem revision 20030916 ACPI: Interpreter disabled. PCI: PCI BIOS revision 2.10 entry at 0xfda95, last bus=1 PCI: Using configuration type 1 PCI: Probing PCI hardware PCI: Probing PCI hardware (bus 00) Transparent bridge - Intel Corp. 82801AA PCI Bridge PCI: Using IRQ router PIIX/ICH [8086/2410] at 00:1f.0 isapnp: Scanning for PnP cards... isapnp: No Plug & Play device found Linux NET4.0 for Linux 2.4 Based upon Swansea University Computer Society NET3.039 Initializing RT netlink socket apm: BIOS version 1.2 Flags 0x0b (Driver version 1.16) apm: disabled on user request. Starting kswapd VFS: Disk quotas vdquot_6.5.1 Asus Laptop ACPI Extras version 0.24a Couldn't get the DSDT table header Error registering Asus Laptop ACPI Extras Driver -0420: *** Error: Could not allocate an object descriptor Detected PS/2 Mouse Port. pty: 2048 Unix98 ptys configured Serial driver version 5.05c (2001-07-08) with MANY_PORTS MULTIPORT SHARE_IRQ SEdttyS0 at 0x03f8 (irq = 4) is a 16550A ttyS1 at 0x02f8 (irq = 3) is a 16550A Real Time Clock Driver v1.10e NET4: Frame Diverter 0.46 RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize Uniform Multi-Platform E-IDE driver Revision: 7.00beta4-2.4 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx ICH: IDE controller at PCI slot 00:1f.1 ICH: chipset revision 2 ICH: not 100% native mode: will probe irqs later ide0: BM-DMA at 0xffa0-0xffa7, BIOS settings: hda:DMA, hdb:pio hda: QUANTUM FIREBALLP AS30.0, ATA DISK drive blk: queue c040f3a0, I/O limit 4095Mb (mask 0xffffffff) ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 hda: attached ide-disk driver. hda: host protected area => 1 hda: 58633344 sectors (30020 MB) w/1902KiB Cache, CHS=3649/255/63, UDMA(66) Partition check: hda: hda1 hda2 ide: late registration of driver. md: md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27 md: Autodetecting RAID arrays. md: autorun ... md: ... autorun DONE. Initializing Cryptographic API NET4: Linux TCP/IP 1.0 for NET4.0 IP Protocols: ICMP, UDP, TCP, IGMP IP: routing cache hash table of 512 buckets, 4Kbytes TCP: Hash tables configured (established 8192 bind 16384) Linux IP multicast router 0.06 plus PIM-SM NET4: Unix domain sockets 1.0/SMP for Linux NET4.0. RAMDISK: Compressed image found at block 0 ide=nodma resulted in new errors when running bonnie++ tests (these are repeating endlessly). So far no corruption (fingers crossed). I can try again without LVM, but in the past I've seen corruption regardless of LVM ... Writing with putc()...EXT3-fs error (device lvm(58,2)): ext3_new_block: Allocat4EXT3-fs error (device lvm(58,2)): ext3_new_block: Allocating block in system zo5EXT3-fs error (device lvm(58,2)): ext3_new_block: Allocating block in system zo7EXT3-fs error (device lvm(58,2)): ext3_new_block: Allocating block in system zo9EXT3-fs error (device lvm(58,2)): ext3_new_block: Allocating block in system zo3EXT3-fs error (device lvm(58,2)): ext3_new_block: Allocating block in system zo5EXT3-fs error (device lvm(58,2)): ext3_new_block: Allocating block in system zo7EXT3-fs error (device lvm(58,2)): ext3_new_block: Allocating block in system zo8EXT3-fs error (device lvm(58,2)): ext3_new_block: Allocating block in ... OK ... even with ide=nodma, it still looks like I have problems. I ran a few bonnie++ runs. After rebooting, the system couldn't find init. This time I am able to boot with init=/bin/sh and I was able to repair. Created attachment 94910 [details]
Severn Boot Messages
My previous boot up log was booting up with a bad kernel parameter and
ide=nodma wasn't getting loaded.
I rebuilt and rebooted (with the correct parameters -- boot messages attached)
and started over ... I am still getting filesystem corruption. After a few
dozen passes of bonnie++, I got the errors below. The interesting thing is that
bonnie++ was writing to /var on one logical volume, and only reading /usr from
another logical volume, but it was /usr that got corrupt ... this certainly
points to something beneath the filesystem as the source of the corruption.
Rebooting, the /usr volume was pretty hosed. Most of my shared libraries were
unrecoverable.
I'll re-run the same test without LVM, but with ide=nodma.
----------
# EXT3-fs error (device lvm(58,1)): ext3_readdir: bad entry in0EXT3-fs error
(device lvm(58,1)): ext3_readdir: bad entry in directory #80003: 0EXT3-fs error
(device lvm(58,1)): ext3_add_entry: bad entry in directory #800030INIT: version
2.85 reloading
EXT3-fs error (device lvm(58,1)): ext3_readdir: bad entry in directory #80003:
0EXT3-fs error (device lvm(58,1)): ext3_readdir: bad entry in directory #80003:
0EXT3-fs error (device lvm(58,1)): ext3_readdir: bad entry in directory #80003:
0
[root@vajra root]# bonnie++
bonnie++: error while loading shared libraries: libstdc++.so.5: cannot open
shay[root@vajra root]# ldconfig
EXT3-fs error (device lvm(58,1)): ext3_readdir: bad entry in directory #80003:
0
Created attachment 94911 [details]
RedHat 9 Boot Messages
Here are the boot messages from a RH9 install.
I'm interested to hear if this fares any better... http://people.redhat.com/davej/2.4.22-1.2086.nptl/ Any news on this ? Sorry, my testing windows on this machine are rather limited ... but I will hopefully get another one very soon. Sorry, I never got a chance to load 2086, but I am now running 2088 (Severn3). So far, my same tests (bonnie++ plus some large finds) has been working great. It has been running almost 12 hours straight with only one error in bonnie and no kernel errors to speak of. No file system corruption. This works for me! Thanks. Looks like it was due to the AAM patch. Can you paste the output of hdparm -I /dev/hda (or whatever drive that Quantum Fireball is). # hdparm -I /dev/hda /dev/hda: ATA device, with non-removable media Model Number: QUANTUM FIREBALLP AS30.0 Serial Number: 193036239076 Firmware Revision: A1Y.1300 Standards: Used: ATA/ATAPI-5 T13 1321D revision 1 Supported: 5 4 3 2 & some of 6 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 58633344 device size with M = 1024*1024: 28629 MBytes device size with M = 1000*1000: 30020 MBytes (30 GB) Capabilities: LBA, IORDY(can be disabled) bytes avail on r/w long: 4 Queue depth: 1 Standby timer values: spec'd by Vendor, no device specific minimum R/W multiple sector transfer: Max = 16 Current = 16 Recommended acoustic management value: 254, current value: 128 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 *udma4 udma5 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * READ BUFFER cmd * WRITE BUFFER cmd * Host Protected Area feature set * Look-ahead * Write cache * Power Management feature set Security Mode feature set * SMART feature set * Automatic Acoustic Management feature set * DOWNLOAD MICROCODE cmd Security: Master password revision code = 65534 supported not enabled not locked not frozen not expired: security count not supported: enhanced erase 18min for SECURITY ERASE UNIT. 8min for ENHANCED SECURITY ERASE UNIT. HW reset results: CBLID- above Vih Device num = 0 determined by CSEL Checksum: correct I've gotten similar errors and system lockups from an NEC-6500A DVD burner in two different Thinkpad laptops under RH9 (not sure what kernel) and FC4 (the stock kernel, 2.6.9, I believe). |