Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 63296
Summary: | Kernel oops 0000 during SMP boot on Dell dual Pentium/90 server | ||
---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Robert G. 'Doc' Savage <dsavage> |
Component: | kernel | Assignee: | Arjan van de Ven <arjanv> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Brian Brock <bbrock> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 7.3 | CC: | jimrh |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i586 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2004-09-30 15:39:30 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 61901, 67218, 79579, 100644 |
Description
Robert G. 'Doc' Savage
2002-04-12 04:51:29 UTC
Can you try the 2.4.18-0.21 (or later) kernel from rawhide ftp://ftp.redhat.com/pub/redhat/linux/rawhide/i386/RedHat/RPMS and see if it fixes the problem? FYI, Dell doesn't test machines built circa 1994 with new Red Hat releases. Any idea where approximatly this happens during the boot ? also a tiny bit more text of the backtrace (with 2.4.18-0.21 or so -0.12 had a bug there) would be very welcome. Just downloaded kernel-smp-2.4.18-0.21 and installed it. No joy. This time I get a kernel panic rather than an oops. Here is a longer trace listing for 0.13, followed by the one for 0.21. Please forgive any transcription typos. Note that each trace differs a bit from the other: kernel-smp-2.4.18-0.13 ====================== Calibrating delay_loop... 178.99 BogoMIPS Memory: 60912k/65536k available (1932k kernel code, 4240k reserved, 352k data, 304k init, 0k himem) Dentry cache hash table entries: 8192 (order: 4, 65536 bytes) Inode cache hash table entries: 4096 (order: 3, 32768 bytes) Mount-cache hash table entries: 1024 (order: 1, 8192 bytes) Buffer cache hash table entries: 4096 (order: 2, 16384 bytes) Page-cache hash table entries: 16384 (order: 4, 65536 bytes) Intel Pentium with F0 0F bug - workaround enabled. POSIX conformance testing by UNIFIX mtrr: v1.40 (20010327) Richard Gooch (rgooch.au) mtrr: detected mtrr type: none CPU0: Intel Pentium 75 - 200 stepping 05 per-CPU timeslice cutoff: 158.37 usecs. task migration cache decay timeout: 10 msecs. enabled ExtINT on CPU#0 ESR value before enabling vector: 00000000 ESR value after enabling vector: 00000000 Calibrating delay loop... 179.81 BogoMIPS CPU1: Intel Pentium 75 - 200 stepping 05 Total of 2 pricessors activated (358.80 BogoMIPS). ENABLING IO-APIC IRQs Setting 2 in the phys_id_present_map ...changing IO-APIC physical APIC ID to 2 ... ok. ..TIMER: vector=0x31 pin1=-1 pin2=-1 ...trying to set up timer (IRQ0) through the 8259A ... failed. ...trying to set up timer as Virtual Wire IRQ... works. testing the IO APIC....................... .................................... done. Using local APIC timer interrupts. calibrating APIC timer ... ..... CPU clock speed is 89.9966 MHz. ..... host bus clock speed is 59.9974 MHz. cpu: 0, clocks: 599974, slice: 199991 CPU0<T0:599968,T1:399952,D:12,S:199991,C:599974> cpu: 1, clocks: 599974, slice: 199991 CPU1<T0:599968,T1:199984,D:2,S:199991,C:599974> checking TSC synchronization across CPUs: BIOS BUG: CPU#0 improperly initialized, has -1678593 usecs TSC skew! FIXED. BIOS BUG: CPU#1 improperly initialized, has 1678593 usecs TSC skew! FIXED. PCI: PCI BIOS revision 2.00 entry at 0xfcad0, last bus=0 PCI: Using configuration type 2 PCI: Probing PCI hardware Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: c0118607 *pde = 00000000 Oops: 0000 Unable to handle kernel paging request at virtual address c6578237 printing eip: c0130d36 *pde = 00000000 kernel-smp-2.4.18-0.21 ====================== Booting processor 1/1 eip 2000 Initializing CPU #1 masked ExtINT on CPU#1 ESR value before enabling vector: 00000000 ESR value after enabling vector: 00000000 Calibrating delay loop ... 179.81 BogoMIPS CPU1: Intel Pentium 75 - 200 stepping 05 Total of two processors activated (359.21 BogoMIPS). ENABLING IO-APIC IRQs Setting 2 in the phys_id_present_map ...changing IP\L-APIC physical APIC ID to 2 ... ok. ..TIMER: vector=0x31 pin1=-1 pin2 = -1 ...trying to set up timer (IRQ0) through the 8259A ... failed. ...trying to set up timer as Virtual Wire IRQ... works. testing the IO APIC....................... .................................... done. Using local APIC timer interrupts. calibrating APIC timer ... ..... CPU clock speed is 90.0021 MHz. ..... host bus clock speed is 60.0012 MHz. cpu: 0, clocks: 600012, slice: 200004 CPU0<T0:600000,T1:399984,D:12,S:200004,C:600012> cpu: 1, clocks: 600012, slice: 200004 CPU1<T0:600000,T1:199984,D:8,S:2000004,C600012> checking TSC synchronization across CPUs: BIOS BUG: CPU#0 improperly initialized, has 1689876326 usecs TSC skew! FIXED. BIOS BUG: CPU#1 improperly initialized, has -1689876326 usecs TSC skew! FIXED. PCI: PCI BIOS revision 2.00 entry at 0xfcad0, last bus=0 PCI: Using configuration type 2 PCI: Probing PCI hardware Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: c0116d67 *pde = 00000000 Oops: 0000 CPU: 0 EIP: 0010:[<c0116d67>] Not tainted EFLAGS: 00010246 EIP is at IO_APIC_get_PCI_irq_vector [kernel] 0x17 (2.4.18-0.21smp) eax: 00000000 ebx: c3f86000 ecx: 00000000 edx: 00000000 ds: 0018 es: 0018 ss: 0018 Process swapper (pid: 1, stackpage=c3f8d000) Stack: 00000001 ffffffff c3f86000 00000010 00000001 c3f8dfbb c031d89d 00000000 00000004 00000000 0008e000 c3f8c000 c0317fbc c0105000 0008e000 c031d514 c01c20b6 c3f8c000 c031876b c0105078 00010f00 c0317fbc c0105000 0008e000 Call Trace: [<c0105000>] stext [kernel] 0x6 [<c01c20b6>] pci_init [kernel] 0x6 [<c0105078>] init [kernel] 0x28 [<c0105000>] stext [kernel] 0x0 [<c01072a6>] kernel_thread [kernel] 0x26 [<c0105050>] init [kernel] 0x0 Code: 83 3c 90 ff 75 23 52 68 60 e6 24 c0 e8 78 5f 00 00 8b 44 24 <0>Kernel panic: Attempted to kill init! I think I found the cause of this; a fix for my assumption is in version 2.4.18-0.23 that ought to appear in rawhide soon I waited for -23 kernel in Rawhide, but Valhalla arrived first. I just finished a fresh 7.3 installation and got a similar kernel panic with the 2.4.18-3smp kernel. It was having a lot of trouble with the ncr53c8xx driver before finally bombing out with the following lines: .../scrolled off top of 80x25 screen/... EIP is at mega_busyWaitMbox [megaraid] 0x10 (2.4.18-3smp) eax: c3c60084 ebx: 00000000 ecx: 00000007 edx: c3fa5f00 esi: 00000000 edi: c3c60084 ebp: 0000000f esp: c3fa5e0c ds: 0018 es: 0018 ss: 0018 Process swapper (pid: 0, stackpage=c3fa5000) Stack: c3c7ef60 00000000 c4844010 c3c60084 00000000 c3c60084 c3fa5e40 00000000 00000001 00000001 c3fa4000 c3fa4000 c035a640 c3fa5e5c c011id66 c035a640 00000001 00000000 00000001 00000000 0000000b c0125165 00000000 00000001 Call Trace: [<c4844010>] megaraid_isr [megaraid] 0x50 [<c0118d66>] scheduler_tick [kernel] 0x96 [<c0125165>] update_process_times [kernel] 0x25 [<c0125165>] ncr53c8xx_intr [ncr53c8xx] 0x2e [<c4838afe>] handle_IRQ_event [kernel] 0x5e [<c010a5ee>] do_IRQ [kernel] 0xb5 [<c010a805>] handle_IRQ_event [kernel] 0x50 [<c010a5e0>] do_IRQ [kernel] 0xb5 [<c010a805>] schedule [kernel] 0x371 [<c0119331>] cpu_idle [kernel] 0x25 [<c0106f05>] call_console_drivers [kernel] 0xea Code: 80 7e 0f 00 75 0a 31 c0 eb 1d 8d b6 00 00 00 00 68 58 8d 06 <0>Kernel panic: Aiee, killing interrupt handler! In interrupt handler - not syncing _ [<c011caba>] I get a similar error when booting Valhalla Errata kernel 2.4.18-4smp. Need to update the Product / Version from Red Hat Public Beta / Skipjack-beta2 to Red Hat Linux / Valhalla. I get a similar error when booting Valhalla Errata kernel 2.4.18-4smp. Need to update the Product / Version from Red Hat Public Beta / Skipjack-beta2 to Red Hat Linux / Valhalla. Does adding "noapic" to the kernel commandline help ? Yes. All the confusion and 53C8xx SCSI resets disappear when 'noapic' is appended to the kernel command line. It now boots into 2.4.18-4smp as though nothing were ever wrong. As a dual Pentium system, I presume it has an APIC. Can you shed any light on what 'noapic' does? Update: This looks very much like bug 53946 (7.1) which I appended a stack-frame to, which is very simular to the one above. In my case, (a Digital Celebris with dual classic 166 mhz pentiums) even the UP kernel was painfully slow and prone to spontaneous failures. The SMP kernel would load up at what appeared to be 300 baud, and then hang just before "entering runlevel...." I will try the fixes recommended here and report results. Um... let me clarify: I am talking about my experiences with the kernel in 7.3 here. it was awful right out of the box, so I managed a full soup-to-nuts up2date, which made it even worse. I then tried 7.1 (just for grins and giggles) and it was lovely. When I updated to the latest 7.1 kernel build, I got a failure simular to the one shown here. Just updated to 2.4.18-5smp, and the problem persists if I do not append 'noapic' to the GRUB command line. With 'noapic' it boots up perfectly without getting bogged down with failure reports about resetting the on-board SCSI controller. It's just a hunch, but I'm guessing the problem lies in the on-board SCSI driver code. "noapic" seems to do the trick. My Digital Celebris (SMP) box with two classic pentiums in it boots like a champ with "noapic" as a kernel param from a clean install. Can anyone tell me what the parameter "noapic" actually does? Is there a fix for this in the works? Jim When the APIC code is turned on, the kernel needs valid interrupt routing information from the BIOS. MP tables on older hardware with problems are frequently incorrect. BIOS updates are necessary to fix these problems. Specifying NOAPIC tells the kernel to leave the I/O APICs as programmed by the BIOS. You end up with more shared interrupts, but a working system... Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/ |