Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 123332
Description
Aleksander Adamowski
2004-05-17 09:35:59 UTC
Created attachment 100261 [details]
Differences between 3Ware driver v1.02.00.036 and v1.02.00.037
v1.02.00.036 is in stock Fedora kernel, but we have to use the latest
v1.02.00.037.
We had to update the controller's firmware because it hardlocked, and 3Ware
highly advises updating OS driver to the latest version even before updating
the controller's firmware.
Created attachment 100262 [details]
Photo of console with kernel stacktrace after panic
Created attachment 100263 [details]
dmesg file from the machine
Created attachment 100264 [details]
output from lspci -vv
Created attachment 100265 [details]
output from dmidecode
More detailed hardware specification: CPU: dual PIV Xeon 2GHz with hyperthreading (4 virtual CPUs) RAM: 1 GB (2 x 512MB DDR Kingstone with parity control) Motherboard: Intel SE7501BR2 NIC: INTEL pro100 Server Board integrateon on the motherboard Storage: Hardware RAID 5 array on 3Ware 8506-4LP controller, built from 4 Serial ATA Seagate Serial ATA 120GB drives Created attachment 100266 [details]
/et/sysconfig/hwconf file
Created attachment 100572 [details]
Another kernel panic
This one occured today, this time the system was running in higher resolution
text mode, so I was able to capture the full text of kernel panic.
Possibly related is the bug 121732... Created attachment 100666 [details] Another kernel panic that occured today on kernel-2.4.22-1.2188.nptlsmp. After this panic I've installed updated kernel 2.4.22-1.2190.nptlsmp which apparently resolves the problem (accorging to bug 121732). Another panic in refile_inode occured just today on kernel-2.4.22-1.2190.nptlsmp. The problem has not been resolved. I'll attach a screenshot next morning. Created attachment 100732 [details]
Yesterday's panic on kernel-2.4.22-1.2190.nptlsmp
Here's the text of the latest panic with 2190 for better searchability and readability: Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: c01691ae *pde = 0e723067 *pte = 00000000 Oops: 0002 e100 iptable_mangle ipt_REJECT ipt_multiport ipt_state ip_conntrack iptable_filter ip_tables floppy sg microcode keybdev mousedev hid input usb-uhci usbcore e CPU: 2 EIP: 0060:[<c01691ae>] Not tainted EFLAGS: 00010246 EIP is at refile_inode [kernel] 0x4e (2.4.22-1.2190.nptlsmp) eax: 00000000 ebx: e28fb900 ecx: 00000000 edx: e28fb908 esi: c0376028 edi: c0374fd8 ebp: 0000772e esp: c3193de0 ds: 0068 es: 0068 ss: 0068 Process spamd (pid: 31686, stackpage=c3193000) Stack: c19187a0 e20fb9c4 c013c642 e28fb900 c19187a0 00000000 c19187a0 c01461ba c19187a0 000001d2 c3192000 000005c3 000001d2 00000012 0000001d 000001d2 c0374fd8 c0374fd8 c01464aa c3193e4c 000001d2 0000003c 00000020 c0146522 Call Trace: [<c013c642>] __remove_inode_page [kernel] 0x82 (0x3c193de8) [<c01461ba>] shrink_cache [kernel] 0x30a (0xc3193dfc) [<c01464aa>] shrink_caches [kernel] 0x4a (0xc3193e28) [<c0146522>] try_to_free_pages_zone [kernel] 0x62 (0xc3193e3c) [<c0147102>] balance_classzone [kernel] 0x52 (0xc3193e60) [<c0147438>] __alloc_pages [kernel] 0x188 (0xc3193e7c) [<c010e968>] call_do_IRQ [kernel] 0x5 (0xc3193e88) [<c0139b5f>] do_wp_page [kernel] 0x6f (0xc3193ebc) [<c013a666>] handle_mm_fault [kernel] 0x106 (0xc3193ee0) [<c011c94c>] do_page_fault [kernel] 0x14c (0xc3193f0c) [<c011e9c0>] scheduler_tick [kernel] 0x120 (0xc3193f28) [<c0107b3f>] __switch_to [kernel] 0x16f (0xc3193f44) [<c011ed8f>] schedule [kernel] 0x7f (0xc3193f68) [<c012e42e>] update_process_times [kernel] 0x3e (0xc3193f84) [<c011c800>] do_page_fault [kernel] 0x0 (0xc3193fb0) [<c0109c18>] error_code [kernel] 0x34 (0xc3193fb8) Code: 89 01 c7 43 08 00 00 00 00 89 48 04 8b 06 89 50 04 89 43 08 Created attachment 100862 [details]
refile_inode kernel-2.4.22-1.2190 panic from today
Possible fix: 3Ware support engineer has pointed out that this issue may have been fixed in kernel 2.4.26: "In the changelog for 2.4.26, there was a bug in refile_inode() that was fixed. I would recommend you try this kernel. Below is the changelog: Marcelo Tosatti: Trond: Avoid refile_inode() from putting locked inodes on the dirty list Changed EXTRAVERSION to -rc1" that patch was merged in the 2190 kernel. It made no difference. I've asked the author of that patch about the new issue, here is his response: ---SNIP--- PÃ¥ to , 17/06/2004 klokka 11:01, skreiv Aleksander Adamowski: >> Hi! >> >> I've seen that you've fixed a bug in linux kernel 2.4 related to refile_inode() (fix applied to kernel-2.4.26). >> >> There's still a related nasty crasher bug in refile_inode(), see this Redhat Bugzilla bug: >> http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=123332 >> I'm not really a VFS person. That said, it looks to me from the dump you sent that refile_inode is calling list_del(&inode->i_list) on an inode that has already been removed from all lists. Normally, such an inode is supposed to be marked as I_FREEING... I couldn't find any code in the 2.4.27-pre series that appeared to be able to put the inode in this bogus state. Somebody else will have to audit the RedHat kernels to see if they have any such bugs. 8-) Cheers, Trond ---SNIP--- We just saw the same bug, note that it is with the 2179 kernel with the refile_inode patch. We can't duplicate this patch in QA yet, only in production :( I'll attach a decoded oops. Created attachment 102818 [details]
decoded oops from this panic
Decoded oops from the panic we had that appears to be the same as the submitter
of this bug, and different from issue 121732.
Our system experiencing this problem is a HP DL380 G3 with: * 2x2.8GHz Xeon processors with Hyperthreading on. * 4GB RAM * Internal HP hardware RAID-1 (cciss driver) So it would appear that the 3Ware driver is not the problem. Since Trond is rarely wrong, I'm assuming that the problem here has been fixed in Linux 2.4.27, and that means the problem was fixed sometime between the 2.4.22-23 series and 2.4.27, but that the fix was not merged into the FC1 kernel. Going through the kernel changelogs on kernel.org line by line, I found two changesets that appear to be significant. Since I am not a kernel hacker I cannot confirm that the errors we're experiencing could be caused by the lack of the two patches referenced below, but I have a feeling that a kernel hacker with VFS knowledge could confirm this relatively quickly. In particular, fixed in 2.4.25-pre7 (2.4.25 rel) by Rik Van Riel with comment: "some more fixes for fs/inode.c inode reclaiming changes" This bug does exactly what Trond refers to, calling list_del(&inode->i_list), the question is, could that inode have already been removed from all lists, that I do not know. Rik's original post: http://www.ussg.iu.edu/hypermail/linux/kernel/0401.2/0962.html David Woodhouse's followup and approval: http://www.ussg.iu.edu/hypermail/linux/kernel/0401.2/0970.html A diff of the fs/inode.c code that is the result of the above mailing list postings: http://source.scl.ameslab.gov:14690/linux-2.4-for-marcelo-ppc64/diffs/fs/inode.c@1.50?nav=index.html|ChangeSet@-9M|cset@1.1330|hist/fs/inode.c In addition there is a second inode cache related bugfix that seems like it belongs in the FC1 kernel, also from 2.4.25: Fixed by David Woodhouse "Do not leave inodes with stale waitqueue on slab cache" http://source.scl.ameslab.gov:14690/linux-2.4-for-marcelo-ppc64/diffs/fs/inode.c@1.50?nav=index.html|ChangeSet@-9M|cset@1.1330|hist/fs/inode.c Both of the above patches apply cleanly to the 2179-2199 kernels (fs/inode.c wasn't changed between those versions). My biggest problem right now is that I can't duplicate the oops in a controlled environment. It happens once a week across all of our dozen or so servers running this kernel. I've got a test machine now running ltp, dbench, kernel compiles, and other processes to try and duplicate this oops, but haven't seen it in 2 straight days of testing. It's not clear that the error was seen much (if at all) in the wild, it looks like Rik fixed the error before lots of people noticed. From an email exchange with Aleksander, he can't duplicate this problem in a controlled setting either, it happens about twice per month for him. Ideally a VFS person could look at the above patches and just say "yes, this patch needs to be applied to the FC1 kernel, it could cause that oops". After further looking the second fix, "do not leave inodes with stale waitqueue on slab cache" was fixed in the FC1.2199 kernel. I will attach the patch to FC1.2199 that implements Rik's fix, which I'm testing now. Note we do not use quotas, so the second part of his fix is not relevant to us, I don't think. Created attachment 102912 [details]
patch which implement's Rik's inode reclaim pach from 2.4.25
Unfortunately I cannot test this fix as I've switched to RHEL kernel on that machine to remedy the panics. For the record, we're running the 2.4.21-15.ELsmp RHEL kernel to avoid the panics. Running with the FC1.2199 kernel that implements Rik's refile_inode fix, we've had 4 weeks (28 days) of uptime without a crash. The best we were doing before was 1 week and often less than that. If anyone is still running/updating the FC1 kernel this would be a good patch to use/apply... Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/ |