Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.

Bug 182618

Summary: irqbalance makes K8T800Pro system with Athlon64X2 unstable
Product: [Fedora] Fedora Reporter: Alexandre Oliva <oliva>
Component: irqbalanceAssignee: Neil Horman <nhorman>
Status: CLOSED INSUFFICIENT_DATA QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 5CC: davej, marko.macek, peterd, redhat, rhbz, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-08-08 18:45:32 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 181310, 181920    
Bug Blocks: 182617    

Description Alexandre Oliva 2006-02-23 18:08:27 UTC
+++ This bug was initially created as a clone of Bug #182617 +++

Description of problem:
Evidence is mounting that it is irqbalance that is causing me headaches, leading
to numerous different kinds of failures (bug 181347, bug 181920, bug 181310). 
The box would display any of the symptoms of these bugs within hours of booting
up.  Ever since I ran `service irqbalance stop´, the box has been rock solid.  I
didn't find it frozen, as it would always be, when I got up this, erhm, morning
:-), which is a good sign, and it's heavy on duty since then, without any
casualties so far.

This is unlikely to be a bug in irqbalance per se, but rather a kernel bug, so
this kernel bug report blocks the irqbalance one.

Version-Release number of selected component (if applicable):
kernel-2.6.15-1.1975_FC5.x86_64
irqbalance-1.12-1.24

How reproducible:
Never failed me after leaving the several boxes with similar configuration on
overnight

Steps to Reproduce:
1.Boot the system up
2.Leave it up overnight
  
Actual results:
You'll find that networking died, or that the SATA subsystem is dead, or that
the mouse is jerky, or God knows what else.

Expected results:
No such undesirable surprises.

Additional info:
Hardware is Athlon64X2 3800+, Asus A8V Deluxe, A4Tech USB mouse, 2 SATA disks
connected to the sata_promise controller built into the MoBo.

$ cat /proc/interrupts
           CPU0       CPU1
  0:      87445    9375771    IO-APIC-edge  timer
  1:      24239          0    IO-APIC-edge  i8042
  7:          0          0    IO-APIC-edge  parport0
  8:          0          0    IO-APIC-edge  rtc
  9:          0          0   IO-APIC-level  acpi
 15:     112601        552    IO-APIC-edge  ide1
 16:          0          0   IO-APIC-level  libata
 17:      25254    2148074   IO-APIC-level  libata
 18:    3274119      36277   IO-APIC-level  skge
 19:          0          0   IO-APIC-level  VIA8237
 20:          7    1660312   IO-APIC-level  ohci1394
 21:         48      67448   IO-APIC-level  ehci_hcd:usb1, uhci_hcd:usb2,
uhci_hcd:usb3, uhci_hcd:usb4, uhci_hcd:usb5
NMI:       2744       4102
LOC:    9463891    9463534
ERR:          0
MIS:          0

I still haven't determined what happens if I never run irqbalance after boot up;
so far all I've tested is irqbalance running for some time, and then stopped, so
that every IRQ is assigned to a single CPU, and that appears to make the system
stable.

Comment 1 John W. Linville 2006-03-01 15:46:44 UTC
*** Bug 181347 has been marked as a duplicate of this bug. ***

Comment 2 Marko Macek 2006-04-30 15:27:02 UTC
Same problem here, on 32-bit kernel.

I built myself a stock kernel after having problems, my kernel is currently:

title Fedora Core (2.6.16.11)
        root (hd0,0)
        kernel /vmlinuz-2.6.16.11 ro root=LABEL=/ rhgb quiet report_lost_ticks=1
 notsc clock=pmtmr console=ttyS0,115200n8 noapic
        initrd /initrd-2.6.16.11.img

I added 'noapic' today and disabled 'irqbalance'. We'll see how things go.
If it's ok after a few days, I'll remove the 'noapic'.

Usually fails during heavy network activity, or randomly while I'm away.
I use an offboard 3c59x NIC, cause my onboard one died.

Comment 3 Th0ma7 2006-05-16 11:43:38 UTC
I have the exact same problem...

See http://lkml.org/lkml/2006/5/16/67 for more info!

- vin

Comment 4 Richard Ziegler 2006-09-13 05:25:16 UTC
Same(?) problem, different results.  Disabling irqbalance did not work for me.

Asus P5N32-SLI SE Deluxe motherboard, with Core 2 Duo processor, running Kernel
2.6.7.1-2187_FC5
notable drivers:
sky2
sata_sil24
sata_nv

I am being bit regularly by the sata problems described in bug 181310.
After disabling irqbalance and running bittorrent for many hours, I got my first
occurrence of the network problem described in bug 181347.  That was with kernel
2.6.17.-1_2174_FC5

I have also experienced the jerky mouse movement, but that was with FC6T2, and
only when my mouse was connected through a hub - a dell 2407wfp.  The mouse
would start smooth, but after awhile become jerky.  Motion would be smooth again
if I plugged it directly into a usb port on the computer. 

I have since removed FC6T2, because I hadn't yet found all this other bug
history.   Using a non-beta OS was also important to me because all the hardware
was (is) brand new.  I don't even know if I have a bad motherboard or not..  My
symptoms are almost exactly like what is described by Alexandre, so I'm assuming
the motherboard is good, and the kernel is bad.  But due to the inactivity on
this and the other bz', I wish it were the other way round!



Comment 5 Richard Ziegler 2006-09-13 05:34:34 UTC
Small correction -

- 2.6.7.1-2187_FC5
+ 2.6.17.1-2187_FC5

And I'm running the 64 bit kernels.

Let me know if I can be of any assistance in testing fixes for these issues.

Comment 6 Dave Jones 2006-10-16 19:27:56 UTC
A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed.  See bug 207474 for further details.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.

Thank you.

Comment 7 Matt Olson 2006-10-16 20:37:37 UTC
For me, this is the same problem I was having on this bug:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=166437

The important info. is this problem surfaced after kernel-smp-2.6.13-1.1532_FC4
(>= 2.6.14).  I tried disabling irqbalance and am running
2.6.17-1.2187_FC5(x84_64).  Locked up after 1 hour of heavy CPU load.  MB is
ABIT AV8 K8T800 Pro (Via); CPU  Athlon64X2 4400+; 4GB mem.  

Maybe I'm not seeing the same problem.  All I get are lockups from 1 hour to as
long as 2 days.  High load seems to aggrivate the problem.  

I'll re-test with 2.6.18, when the 64 bit package is released (not seeing it in
updates yet).  

Comment 8 Matt Olson 2006-10-21 03:11:10 UTC
I found a way to get the machine to lock up on cue, so, able to do more rapid
testing . . . 

My problem turned out to be an old ('95) Intel EE Pro 100 PCI NIC card.  I
pulled it and it's been running like a champ for almost a day.  I stand by my
assertion that 2.6.13 was stable even with this old NIC installed.  I wasn't
able to get any info from the NMI watchdog.  IOMMU maybe?  

2.6.18 (2200) running fine.

Comment 9 Richard Ziegler 2006-10-23 13:38:54 UTC
2.6.18 (2200) NOT running fine for me.

I have reproduced the error twice.  The message seems to have changed since the
last kernel though.  But it still is a timeout.

ata5.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata5.00: revalidation failed (errno=-5)
ata5: failed to recover some devices, retrying in 5 secs
ata5.00: qc timeout (cmd 0xec)

ata5.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata5.00: revalidation failed (errno=-5)
ata5: failed to recover some devices, retrying in 5 secs
ata5.00: qc timeout (cmd 0xec)

...

Both times were after I had closed a tvtime window (hardware is a bt848 based
wintv card circa 1996).  I think this may point to irq mismanagement, as another
person commented in this collection of related bugs - bug seems to crop up after
a change to the load on the system.

One big difference this time is that the timeouts did not repeat forever.  The
system seemed to recover after a few timeout errors.  However my raid array was
degraded in the process.  sdd was dropped from the two drive raid-1 array.

Comment 10 Dave Jones 2006-11-20 19:17:40 UTC
if you added a comment above of the form "I disabled irqbalance and my problem
still happened" then it's unlikely to be related to this bug, and you should
open a separate one.

I'm reassigning this to irqbalance in the hope that Neil has some ideas what
could be going wrong in Alexandre's case.


Comment 11 Peter Dawes 2006-11-20 19:29:41 UTC
Confirming this is still a problem with FC6 (uname -r gives 2.6.18-1.2849.fc6).
 K8T800Pro, Athlon64 4400+.  I get the problem where the network interface (a
Marvell 88e8001 controller) stops responding until I unload and reload the
module.  Disabling the irqbalance service resolves the problem.

Comment 12 Neil Horman 2006-11-20 20:03:13 UTC
Alexandre and I have been down this road before.  I am completely unable to
reproduce this error here on any of my systems, and thus far, the only
simmilarity I can find between any of the system that reports what appears to be
the same problem is that they all contain a variant of the Asus A8 motherboard.
 not really sure what to do with this. My reading has indicated that people with
this motherboard have had more success by disabling on board video and using a
separate video card.

Comment 13 Peter Dawes 2006-11-22 15:28:43 UTC
I am using an ASUS A8V Deluxe board, so that part sort of jives.  I'm using a
separate video card though (an Nvidia 6800GT), the A8V Deluxe doesn't have
on-board video.