Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 182617
Summary: | irqbalance makes SMP system unstable | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Alexandre Oliva <oliva> |
Component: | kernel | Assignee: | Neil Horman <nhorman> |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | rawhide | CC: | rhbz |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2007-08-08 18:44:17 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 182618 | ||
Bug Blocks: | |||
Attachments: |
Description
Alexandre Oliva
2006-02-23 18:05:17 UTC
Hmmm, I'd certainly agree with you that this likely isn't a bug in irqbalance specifically. It is certainly possible however that migration of irqs might be causing a problem, although ideally once irqbalance distributes irq's, it trys rather hard to not move them around again. I'd be interested to see a copy of /proc/interrupts with irqbalance running on your system for a few hours, and /proc/interrupts without irqbalance running on your system for a few hours, to compare and see if irqbalance is actually migrating any interrupts more than it should. Also, since I've not heard of this happening with many systems (taking all your referenced bz's in aggregate) I'd wonder if you don't have a specific system problem (perhaps a quirk with the Via chipset in your box, or an ACPI error of some sort). Can you try booting with acpi=off (I think thats the right syntax) and see if you get the same effects? Thanks! It's 4 different boxes experiencing the problem, so it's unlikely to be something specific to this one I have at home (the other 3 are at the uni). I've also found reports of skge problems on the net, so there is something to it. As for acpi, I didn't think acpi=off was supported at all on x86_64, but I can try that on my next reboot. I'll also try to get you /proc/interrupts with irqbalance running, although the sort of workloads the box experiences vary widely depending on the time of day. I'm also thinking of trying a 32-bit OS on it just to determine whether the problem is 64-bit specific. Anyhow, the more I think about it, the more it makes sense: the box would often freeze when I switched from one major activity to another, e.g., it wouldn't crash half-way through a big build, but it would often crash in the beginning or at the end, and generally logging a CRC error. I'd often have network problems right after connecting to the box over vnc from another box, or right after disconnecting. Putting this all together made me wonder if CPU affinity could solve the problem, and so I got to irqbalance. Again, I don't disagree that migrating irqs may have a problem, but its not going to be the irqbalance daemon thats causing it. There may be a problem with migrating irqs between cpu's which is causing panics/deadlocks/etc, but thats going to be a kernel problem. I can certainly help you fix that, but I'm going to need more to go on. If you can provide some of the panic backtraces (I checked the other bugs you reported and there doesn't seem to be any panic/backtrace info in any of them) that would be helpful. It would also be helpful (for the purposes of my debugging any potential problem in irqbalance) to see those /proc/interrupt before/after snapshots. Created attachment 125153 [details]
/proc/interrupts snapshots
Here are some /proc/interrupts dumps. As soon as I got the gdm login prompt, I
switched to VT1 and, as root, dumped /proc/interrupts to a file, and then
scripted an automated
sleep-for-one-minute-then-append-the-date-and-the-contents-of-/proc/interrupts
to the same file, and left it running for a few minutes. Clearly, interrupts
are dancing back and forth between the two processors...
I tried disabling acpi (you meant acpi, not apic, right?), and that had the
unfortunate side effect of disabling Cool&Quiet, so cpuspeed wouldn't work and
I figured I didn't want to leave the machine running like that for very long.
As for panics, I don't ever get any, which is why the bug reports do not
contain them :-) This is what makes this bug particularly tricky to debug, I
guess. irqbalance was a shot in the dark, and I'm happy it hit something. In
case you suspect cpuspeed, that's not it. Before I updated the BIOS to enable
Cool&Quiet with a dual-core processor, I'd already got the very same kind of
problem.
Ok, thats the /proc/interrupts files with irqbalance turned on I assume. What about with irqbalance off? The fact that you are getting any given interrupt on multiple cpu's means either that irqs are being migrated at the same time that a steady stream of interrupts is arriving, or that irqbalance has decided that a given interrupt occurs at a low enough frequency that it can be masked to a subset of, or all of the cpu's in the system. Getting /proc/interrupts with irqbalance off will help me compare that. In fact if you could provide a sysreport as well, so I could check on the state of the rest of your system without having to ask you for things bit by bit, that would be very helpful. Regarding the lack of panics, I assume that you have tried to establish a serial console, or tried to capture a vmcore via netdump or diskdump? If not, thats a road we should explore. How about sysrq's? Is the system responsive to a sysrq key sequence after an error occurs? If so, gathering a sysrq-t and sysrq-m would be helpful Without irqbalance having ever run since boot up: $ cat /proc/interrupts CPU0 CPU1 0: 14167756 0 IO-APIC-edge timer 1: 42507 0 IO-APIC-edge i8042 7: 0 0 IO-APIC-edge parport0 8: 0 0 IO-APIC-edge rtc 9: 0 0 IO-APIC-level acpi 15: 169347 0 IO-APIC-edge ide1 16: 0 0 IO-APIC-level libata 17: 1862640 0 IO-APIC-level libata 18: 7518595 0 IO-APIC-level skge 19: 6 0 IO-APIC-level ohci1394 20: 143902 0 IO-APIC-level uhci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb3, uhci_hcd:usb4, ehci_hcd:usb5 21: 0 0 IO-APIC-level VIA8237 NMI: 5951 5226 LOC: 14168372 14168692 ERR: 0 MIS: 0 No serial console here, and not really necessary, since the console is still usable: in the case of the disk subsystem failure, it's trickier because I have to have everything I need already in memory, and I don't get a permanent record unless I set up some external disk to collect a copy of /var/log/messages, but for networking or mouse the system is still usable and perfectly recoverable as long as I'm physicall in front of it, which is not that uncommon given that this is my primary desktop (which is what makes this box not a very good choice of a system on which to run random testing configurations ;-) I haven't set up netdump or diskdump mainly because I don't know how to do that, and considering that I can't tell in in advance whether it's the disk or the network that is going to fail, and they fail just as often as each other, it's hard to decide which one to set up. The disk failure is more serious, since I generally can't bring the system back up without a reset after it hits. Even in this case, however, the system keeps running (for some arguable definition of running :-), to the point that I can often switch to VT1 (as long as I don't need to page code in to accomplish that; most often I don't) and see SysRq output. I've already looked for interesting stuff when the network failed, and found nothing: no held locks were shown by SysRq-D. For disk failures, there are generally lots of held locks, all of them related with ext3 inodes. Memory is not a problem in either case. I'm collecting the sysreport and will attach it as soon as it is done. It's taking forever to collect the list of packages (everything in today's rawhide). ok, so I've been going over this, and as far as I can see, unless we can capture an oops when this happens (or get a sysrq-t when the system deadlocks), we're not going to make much progress. I suggest the following plan: 1) Since FC5 has been released, make sure the problem still occurs on the latest kernel. If the problem is gone it will be easier to track down what we fixed, than to fix the problem all over again. 2) You have alot of modules loaded, lets play guess and check. I'd start by removing non-essential modules (your sound modules are probably a good place to start, as sound cards can cause a good deal of interrupts. For those modules that you can't remove, do whatever you can to isolate and minimize the number of interrupts the device generates. the skge driver springs to mind here, either maximize the interrupt coalescence factor on the card, or filter the segment that the crashing system is on so that it only receives essential traffic (and minimize the received traffic if you can, I noted that this system seems to be a pretty busy named server, if you can, move dns resolution to a backup DNS). The idea here clearly is to isolate which interrupts (potential) migration is triggering your deadlock. 3) Is this system ping responsive during the deadlock? If so, I can provide you a special module to trigger an oops on the reception of a malformed ping packet. We can use that to capture a core dump. 4) Lets monitor the system more closely. In your attached sysreport, do you have a timestamp you can reference when a hang occured that we can correlate to a point in the sar log? It would be good to know what happened on the system leading up to the hang 1) The problem still occurs in the latest rawhide kernel (2.6.17-1.2642.fc6) 2) I can't really do much in terms of removing the loaded modules et al. The system is my primary desktop, and I won't have another to play with for a while yet. Add to that the fact that the bug doesn't hit very often (once a day or so) and you see that I can't do much, really. I'm trying to get a serial cable to get stack traces, but even that is proving to be very difficult :-( It is not an active name server at all, BTW. It only runs named locally, serving itself, forwarding requests to an internal Red Hat name server or to my main home DNS server. 3) The exact symptoms of the failure vary. When it is the network card that dies, it's no longer responsive to pings. When it's something else, it is. 4) Sorry, I dropped the ball here. I don't have anything that reliably (or even unreliably) triggers the problem; it's not high load, low load, building stuff, browsing the web, watching movies, nothing particular. It just hits all of a sudden, and then the box exposes one of the various problems. I've even tried disabling cpu frequency switching to see if it helped any, but the problem still hit. so that still leaves us where we were before. Unless we can get a stack trace or core dump there isn't much at all I can do here. I would focus on getting the serial cable attached to get that stack trace or vmcore. Also I'm attaching my ping crash module code. You can build that for your system and start auto-loading it in the event that you get a lockup, but your system is still ping responsive. Created attachment 133386 [details]
patch to build the ping crash module
heres the module code I mentioned. Fair warning, it makes your system able to
be crashed through the reception of ICMP echo frames with properly formatted
pad data. So don't use it if your not comfortable with that security risk.
I've finally got a serial cable. I re-enabled irqbalance and quickly got two disk failures, both of which started with nothing but a command time out :-( I'll attach the console log in a moment. Created attachment 133691 [details]
All I saw in the console when disks started failing because of irqbalance being on
Nothing informative, I'm afraid... Disks became inaccessible without anything
useful sent to the serial console, and then I reset the box as it became
unusable.
Did you have sysrq's enabled? Were you able to dump a sysrq-t? I'm afraid whats on here doesn't provide anything to go on really, except to say that it appears that your disks have started to operate poorly. Given that this is what we currently have to go on, I would suggest the following: 1) make sure that sysrq's are enabled in sysctl.conf, and capture a sysrq-t if you can when this happens again. 2) enable smart (It should support sata controllers as of fc5 I think. If the drive itself is actually having a problem, that may help detect it early. 3) check with the sata card manufacturer, see if there is a firmware update available for the controller. Perhaps check to see if there is anything repaired relating to interrupt migration or movement (which may explain why enabling irqbalance triggers this crash). 4) Try the latest FC6 kernel. There were a few problems fixed in the libata code understanding drive return codes I think. Some may be applicable here. 5) If possible, archive the data on the raid array, switch the drive controller to legacy (pata) mode, and rebuild the array. Perhaps there is a heretofore undiscovered bug in the libata code or the sata driver you are using. 1) sysrq-t will do, now that I have serial console, doh! I forgot about it. 2) smart is active and reports no problems 3) I've got the latest non-beta BIOS for the motherboard, and the controller is built into the motherboard. 4) I'm always running the latest FC devel kernel on this box, unless (i) it hasn't finished installing yet, or (ii) it breaks badly. With irqbalance off, that is. 5) I'm using software raid and the controller is already in regular, non-raid mode. Is this what you meant by legacy (pata) mode? If not, please clue me in ;-) Thanks, 1) waiting on sysrq-t 2) good to know, although something seems awry between smart not delivering errors and those log messages that you sent in. 3) have you looked at the beta bios errata list to see if anything relates to your problem? 4) Ok, what were you running on the last crash? 5) by pata I mean parallel ATA. A.K.A IDE mode. many SATA controllers have a bios option by which they will identify themsleves to the bios, and the O/S as a n ide controller. Changing this mode is not recommended normally as it means you will need to change your hardware config in your OS and rebuild your raid array, but if there is a driver problem, this lets you use the ide driver to get to your drives, which may alieviate the problem. Created attachment 134195 [details]
SysRq-T after the network stopped working
With today's kernel, I've been unable to duplicate the disk failures so far,
but I got a mouse failure and a network failure, both fixed by reloading the
corresponding modules ([eu]hci_hcd and skge, respectively), although I'm seeing
some slab corruption errors after reloading skge.
I don't see anything useful in the state dump, do you?
Created attachment 134196 [details]
Oopses I got after reloading skge, after a network failure
There are the oopses I got over the several minutes after I reloaded skge.
Kernel is 2.6.17-1.2564.fc6.x86_64.
2) why/how would smart deliver errors if the entire disk subsystem stopped working? (actually, that might not be entirely true; I didn't try plain IDE HDs during a failure scenario) 3) I have the beta BIOS handy, but I can't see any change list for it. Maybe as soon as I get a new box I purchased I'll give it a try. 4) err, sorry, I don't remember what that was any more, sorry that I didn't mention it :-( 5) I don't see any BIOS options to switch the SATA controllers to plain PATA mode :-( 2) The short answer is that smart won't deliver errors if the entire disk subsystem just stops flat out. But hopefully, if the disk was starting to die, it hopefully reports that to smartd before such a catastrophic failure. 3) Where did you get the BIOS from? I can hunt for a change list if you like. Don't worry about 4 and 5. It just would have been helpful in an analysis if you remembered, and not all sata controllers let you do what I suggested. It just would have been a good test if you were able. As for the oops, I'm guessing that the slab corruption is just a result of an isolated skge bug. I expect that its not as good at cleaning up after itself as it thinks on module unload/reload, and the result is some leaked/reused buffers. Its good news though that you can't reproduce your previous failure. When you say that your mouse and your network driver failed, can you elaborate? Clearly your system didn't hang when these failure occured, as they did before. What did you observe that made you reload those modules? 2) the disks are perfectly fine, it's the entire disk subsystem (or perhaps the sata subsystem, or the Promise controller only, although I've seen such failures affect disks on the VIA SATA controller as well on similar boxes that have more disks) that becomes inoperative. 3) http://support.asus.com/download/download_item.aspx?model=A8V%20Deluxe&type=Latest&SLanguage=en-us# I'm running 1017; 1018.001 is the latest beta. It's not clear that I can't reproduce the previous failure; it sometimes took 2-3 days to get one such failure. As for the network and mouse problems, they're described in detail in this and in other bug reports such as bug 182618, bug 181347, bug 181920, bug 181310. All of them are triggered by having irqbalance enabled, as stated in the beginning of this bug report. Do you need any other info as to the symptoms? You may be in luck. I was trolling about for other who may have had your same set of problems, adn I ran accross this: http://lkml.org/lkml/2006/5/16/89 Apparently, someone else at least has had your sata problems with your motherboard/chipset. It appears to be fixed in the patch referenced in this bug: http://bugzilla.kernel.org/show_bug.cgi?id=5533 I'd rebuild your kernel with that patch to see how you fare (or alternatively, check to be sure that it made it into 2.6.18 and just get that kernel from kernel.org). The URLs mentioned in comment 23 appear to refer to a significantly different problem. At least the symptoms don't match at all what I'm observing. It's not a boot-time problem, the problem only shows up at random after hours (although sometimes just minutes) or regular use. It's true that there's a chance that the patch you mention will fix the SATA problems I've got, but the other problems still remain. Maybe they are independent, after all? It would appear so. Besides, we don't really have anything else to go on here. Please confirm that the referenced patch is in the latest 2.6.18 kernel, and try it out. |