Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.

Bug 63296

Summary:	Kernel oops 0000 during SMP boot on Dell dual Pentium/90 server
Product:	[Retired] Red Hat Linux	Reporter:	Robert G. 'Doc' Savage <dsavage>
Component:	kernel	Assignee:	Arjan van de Ven <arjanv>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Brian Brock <bbrock>
Severity:	high	Docs Contact:
Priority:	medium
Version:	7.3	CC:	jimrh
Target Milestone:	---
Target Release:	---
Hardware:	i586
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-09-30 15:39:30 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	61901, 67218, 79579, 100644

Description Robert G. 'Doc' Savage 2002-04-12 04:51:29 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.9) Gecko/20020328

Description of problem:
During initial reboot after Skipjack2 installation, SMP kernel fails to handle
(a) kernel NULL pointer dereference and (b) kernel paging request. 

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Boot to default SMP kernel after initial installation.
2. Kernel oops during initial text portion of boot-up.
3. Hardware reset required.
	

Actual Results:  Single CPU kernel boots correctly to logon prompt.

Additional info:

Text messages during SMP boot:
Unable to handle kernel NULL pointer dereference at virtual address 00000000
 printing eip:
c0118607
*pde = 00000000
Oops: 0000
Unable to handle kernel paging request at virtual address c6578237
 printing eip:
c0130d36
*pde = 00000000

I have a detailed installation log available upon request; nothing remarkable.

Comment 1 Michael K. Johnson 2002-04-12 19:57:51 UTC

Can you try the 2.4.18-0.21 (or later) kernel from rawhide
ftp://ftp.redhat.com/pub/redhat/linux/rawhide/i386/RedHat/RPMS
and see if it fixes the problem?

Comment 2 Matt Domsch 2002-04-12 20:03:24 UTC

FYI, Dell doesn't test machines built circa 1994 with new Red Hat releases.

Comment 3 Arjan van de Ven 2002-04-12 21:18:10 UTC

Any idea where approximatly this happens during the boot ?

Comment 4 Arjan van de Ven 2002-04-12 21:19:10 UTC

also a tiny bit more text of the backtrace (with 2.4.18-0.21 or so -0.12 had a
bug there) would be very welcome.

Comment 5 Robert G. 'Doc' Savage 2002-04-12 22:37:16 UTC

Just downloaded kernel-smp-2.4.18-0.21 and installed it. No joy. This time I get
a kernel panic rather than an oops.

Here is a longer trace listing for 0.13, followed by the one for 0.21. Please
forgive any transcription typos. Note that each trace differs a bit from the other:

kernel-smp-2.4.18-0.13
======================
Calibrating delay_loop... 178.99 BogoMIPS
Memory: 60912k/65536k available (1932k kernel code, 4240k reserved, 352k data,
304k init, 0k himem)
Dentry cache hash table entries: 8192 (order: 4, 65536 bytes)
Inode cache hash table entries: 4096 (order: 3, 32768 bytes)
Mount-cache hash table entries: 1024 (order: 1, 8192 bytes)
Buffer cache hash table entries: 4096 (order: 2, 16384 bytes)
Page-cache hash table entries: 16384 (order: 4, 65536 bytes)
Intel Pentium with F0 0F bug - workaround enabled.
POSIX conformance testing by UNIFIX
mtrr: v1.40 (20010327) Richard Gooch (rgooch.au)
mtrr: detected mtrr type: none
CPU0: Intel Pentium 75 - 200 stepping 05
per-CPU timeslice cutoff: 158.37 usecs.
task migration cache decay timeout: 10 msecs.
enabled ExtINT on CPU#0
ESR value before enabling vector: 00000000
ESR value after enabling vector: 00000000
Calibrating delay loop... 179.81 BogoMIPS
CPU1: Intel Pentium 75 - 200 stepping 05
Total of 2 pricessors activated (358.80 BogoMIPS).
ENABLING IO-APIC IRQs
Setting 2 in the phys_id_present_map
...changing IO-APIC physical APIC ID to 2 ... ok.
..TIMER: vector=0x31 pin1=-1 pin2=-1
...trying to set up timer (IRQ0) through the 8259A ...  failed.
...trying to set up timer as Virtual Wire IRQ... works.
testing the IO APIC.......................

.................................... done.
Using local APIC timer interrupts.
calibrating APIC timer ...
..... CPU clock speed is 89.9966 MHz.
..... host bus clock speed is 59.9974 MHz.
cpu: 0, clocks: 599974, slice: 199991
CPU0<T0:599968,T1:399952,D:12,S:199991,C:599974>
cpu: 1, clocks: 599974, slice: 199991
CPU1<T0:599968,T1:199984,D:2,S:199991,C:599974>
checking TSC synchronization across CPUs:
BIOS BUG: CPU#0 improperly initialized, has -1678593 usecs TSC skew! FIXED.
BIOS BUG: CPU#1 improperly initialized, has 1678593 usecs TSC skew! FIXED.
PCI: PCI BIOS revision 2.00 entry at 0xfcad0, last bus=0
PCI: Using configuration type 2
PCI: Probing PCI hardware
Unable to handle kernel NULL pointer dereference at virtual address 00000000
 printing eip:
c0118607
*pde = 00000000
Oops: 0000
Unable to handle kernel paging request at virtual address c6578237
 printing eip:
c0130d36
*pde = 00000000

kernel-smp-2.4.18-0.21
======================
Booting processor 1/1 eip 2000
Initializing CPU #1
masked ExtINT on CPU#1
ESR value before enabling vector: 00000000
ESR value after enabling vector: 00000000
Calibrating delay loop ... 179.81 BogoMIPS
CPU1: Intel Pentium 75 - 200 stepping 05
Total of two processors activated (359.21 BogoMIPS).
ENABLING IO-APIC IRQs
Setting 2 in the phys_id_present_map
...changing IP\L-APIC physical APIC ID to 2 ... ok.
..TIMER: vector=0x31 pin1=-1 pin2 = -1
...trying to set up timer (IRQ0) through the 8259A ...  failed.
...trying to set up timer as Virtual Wire IRQ... works.
testing the IO APIC.......................

.................................... done.
Using local APIC timer interrupts.
calibrating APIC timer ...
..... CPU clock speed is 90.0021 MHz.
..... host bus clock speed is 60.0012 MHz.
cpu: 0, clocks: 600012, slice: 200004
CPU0<T0:600000,T1:399984,D:12,S:200004,C:600012>
cpu: 1, clocks: 600012, slice: 200004
CPU1<T0:600000,T1:199984,D:8,S:2000004,C600012>
checking TSC synchronization across CPUs:
BIOS BUG: CPU#0 improperly initialized, has 1689876326 usecs TSC skew! FIXED.
BIOS BUG: CPU#1 improperly initialized, has -1689876326 usecs TSC skew! FIXED.
PCI: PCI BIOS revision 2.00 entry at 0xfcad0, last bus=0
PCI: Using configuration type 2
PCI: Probing PCI hardware
Unable to handle kernel NULL pointer dereference at virtual address 00000000
 printing eip:
c0116d67
*pde = 00000000
Oops: 0000

CPU:    0
EIP:    0010:[<c0116d67>]    Not tainted
EFLAGS: 00010246

EIP is at IO_APIC_get_PCI_irq_vector [kernel] 0x17 (2.4.18-0.21smp)
eax: 00000000   ebx: c3f86000   ecx: 00000000   edx: 00000000
ds: 0018   es: 0018   ss: 0018
Process swapper (pid: 1, stackpage=c3f8d000)
Stack: 00000001 ffffffff c3f86000 00000010 00000001 c3f8dfbb c031d89d 00000000
       00000004 00000000 0008e000 c3f8c000 c0317fbc c0105000 0008e000 c031d514
       c01c20b6 c3f8c000 c031876b c0105078 00010f00 c0317fbc c0105000 0008e000
Call Trace: [<c0105000>] stext [kernel] 0x6
[<c01c20b6>] pci_init [kernel] 0x6
[<c0105078>] init [kernel] 0x28
[<c0105000>] stext [kernel] 0x0
[<c01072a6>] kernel_thread [kernel] 0x26
[<c0105050>] init [kernel] 0x0

Code: 83 3c 90 ff 75 23 52 68 60 e6 24 c0 e8 78 5f 00 00 8b 44 24
 <0>Kernel panic: Attempted to kill init!

Comment 6 Arjan van de Ven 2002-04-13 08:48:26 UTC

I think I found the cause of this; a fix for my assumption is in version
2.4.18-0.23 that ought to appear in rawhide soon

Comment 7 Robert G. 'Doc' Savage 2002-05-07 04:12:27 UTC

I waited for -23 kernel in Rawhide, but Valhalla arrived first. I just finished
a fresh 7.3 installation and got a similar kernel panic with the 2.4.18-3smp
kernel. It was having a lot of trouble with the ncr53c8xx driver before finally
bombing out with the following lines:

.../scrolled off top of 80x25 screen/...
EIP is at mega_busyWaitMbox [megaraid] 0x10 (2.4.18-3smp)
eax: c3c60084   ebx: 00000000   ecx: 00000007   edx: c3fa5f00
esi: 00000000   edi: c3c60084   ebp: 0000000f   esp: c3fa5e0c
ds: 0018   es: 0018   ss: 0018
Process swapper (pid: 0, stackpage=c3fa5000)
Stack: c3c7ef60 00000000 c4844010 c3c60084 00000000 c3c60084 c3fa5e40 00000000
       00000001 00000001 c3fa4000 c3fa4000 c035a640 c3fa5e5c c011id66 c035a640
       00000001 00000000 00000001 00000000 0000000b c0125165 00000000 00000001
Call Trace: [<c4844010>] megaraid_isr [megaraid] 0x50
[<c0118d66>] scheduler_tick [kernel] 0x96
[<c0125165>] update_process_times [kernel] 0x25
[<c0125165>] ncr53c8xx_intr [ncr53c8xx] 0x2e
[<c4838afe>] handle_IRQ_event [kernel] 0x5e
[<c010a5ee>] do_IRQ [kernel] 0xb5
[<c010a805>] handle_IRQ_event [kernel] 0x50
[<c010a5e0>] do_IRQ [kernel] 0xb5
[<c010a805>] schedule [kernel] 0x371
[<c0119331>] cpu_idle [kernel] 0x25
[<c0106f05>] call_console_drivers [kernel] 0xea


Code: 80 7e 0f 00 75 0a 31 c0 eb 1d 8d b6 00 00 00 00 68 58 8d 06
 <0>Kernel panic: Aiee, killing interrupt handler!
In interrupt handler - not syncing
 _


[<c011caba>]

Comment 8 Robert G. 'Doc' Savage 2002-05-20 01:43:25 UTC

I get a similar error when booting Valhalla Errata kernel 2.4.18-4smp.

Need to update the Product / Version from Red Hat Public Beta / Skipjack-beta2
to Red Hat Linux / Valhalla.

Comment 9 Robert G. 'Doc' Savage 2002-05-20 01:44:27 UTC

I get a similar error when booting Valhalla Errata kernel 2.4.18-4smp.

Need to update the Product / Version from Red Hat Public Beta / Skipjack-beta2
to Red Hat Linux / Valhalla.

Comment 10 Arjan van de Ven 2002-05-28 14:34:19 UTC

Does adding "noapic" to the kernel commandline help ?

Comment 11 Robert G. 'Doc' Savage 2002-05-28 18:01:28 UTC

Yes. All the confusion and 53C8xx SCSI resets disappear when 'noapic' is
appended to the kernel command line. It now boots into 2.4.18-4smp as though
nothing were ever wrong.

As a dual Pentium system, I presume it has an APIC. Can you shed any light on
what 'noapic' does?

Comment 12 Jim Harris 2002-06-19 04:28:01 UTC

Update:

This looks very much like bug 53946 (7.1) which I appended a stack-frame to, 
which is very simular to the one above.

In my case, (a Digital Celebris with dual classic 166 mhz pentiums) even the UP 
kernel was painfully slow and prone to spontaneous failures.  The SMP kernel 
would load up at what appeared to be 300 baud, and then hang just 
before "entering runlevel...."

I will try the fixes recommended here and report results.

Comment 13 Jim Harris 2002-06-22 01:45:06 UTC

Um... let me clarify:

I am talking about my experiences with the kernel in 7.3 here.  it was awful right out of the box, so I managed a full soup-to-nuts up2date, which 
made it even worse.  I then tried 7.1 (just for grins and giggles) and it was lovely.  When I updated to the latest 7.1 kernel build, I got a failure simular 
to the one shown here.

Comment 14 Robert G. 'Doc' Savage 2002-06-23 18:26:35 UTC

Just updated to 2.4.18-5smp, and the problem persists if I do not append
'noapic' to the GRUB command line. With 'noapic' it boots up perfectly without
getting bogged down with failure reports about resetting the on-board SCSI
controller. It's just a hunch, but I'm guessing the problem lies in the on-board
SCSI driver code.

Comment 15 Jim Harris 2002-07-20 16:11:23 UTC

"noapic" seems to do the trick.

My Digital Celebris (SMP) box with two classic pentiums in it boots like a champ with "noapic" as a kernel param from a clean install.

Can anyone tell me what the parameter "noapic" actually does?

Is there a fix for this in the works?

Jim

Comment 16 Philip Pokorny 2002-09-27 01:54:23 UTC

When the APIC code is turned on, the kernel needs valid interrupt routing
information from the BIOS.  MP tables on older hardware with problems are
frequently incorrect.  BIOS updates are necessary to fix these problems.

Specifying NOAPIC tells the kernel to leave the I/O APICs as programmed by the
BIOS.  You end up with more shared interrupts, but a working system...

Comment 17 Bugzilla owner 2004-09-30 15:39:30 UTC

Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/