121732 – (IT_41260) oops in refile_inode when running high load

Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.

Bug 121732 (IT_41260) - oops in refile_inode when running high load

Summary: oops in refile_inode when running high load

Keywords:
Status:	CLOSED WONTFIX
Alias:	IT_41260
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	1
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Arjan van de Ven
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-04-26 20:47 UTC by Andrew Ryan
Modified:	2007-11-30 22:10 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-09-29 20:22:29 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
ksymoops output (deleted) 2004-04-26 20:50 UTC, Andrew Ryan	no flags	Details
'vmstat 30' output for period preceding crash (deleted) 2004-04-26 20:50 UTC, Andrew Ryan	no flags	Details
SysRq+T output from oopsed state (deleted) 2004-04-26 20:51 UTC, Andrew Ryan	no flags	Details
View All

Description Andrew Ryan 2004-04-26 20:47:11 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7b)
Gecko/20040415

Description of problem:
While running load tests of Subversion with the repository on an NFS
mounted filesystem, we're getting reliable crashes in every Redhat 9 -
through Fedora Core 1 kernel. I've attached the oops and will attach
the ksymoops output shortly. The hang does not seem to occur when we
use a repository mounted on local disk. I don't believe that it has
anything to do with Subversion, but whatever load svn is generating is
tickling a kernel bug.

The hardware is dual Xeon 3.0GHz, running hyperthreading, kernel
2.4.22-1.2179.nptlsmp. The mount options in use are:
rw,tcp,nfsvers=3,rsize=32768,wsize=32768,intr
The NFS server is a NetApp. Both NFS client and server are running at
100Mb switched ethernet.

In the 2.4.26 kernel's Changelog
(http://kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.26) I saw
mention of a refile_inode bug fixed by Trond, which made me think
perhaps this is what is affecting us, but I don't know.

A few minutes before the machine crashes, the virtual memory system
seems to deteriorate rapidly, with large amounts of 'si' and
especially 'so' traffic. I will also attach 'vmstat 30' output for the
30 or so minutes preceding the system crash.

The bug doesn't seem to affect us on a RH 7.2-based system running a
vanilla 2.4.21 kernel that includes Trond's NFS-ALL patch cluster.


Unable to handle kernel NULL pointer dereference at virtual address
00000000
 printing eip:
c01690b7
*pde = 00000000
Oops: 0002
nfs lockd sunrpc iptable_filter ip_tables autofs tg3 keybdev mousedev
hid input usb-ohci usbcore ext3 jbd cciss sd_mod scsi_mod
CPU:    3
EIP:    0060:[<c01690b7>]    Not tainted
EFLAGS: 00010246

EIP is at refile_inode [kernel] 0x47 (2.4.22-1.2179.nptlsmp)
eax: 00000000   ebx: dc141b80   ecx: 00000000   edx: dc141b88
esi: c0375ea8   edi: c0374e58   ebp: 00023354   esp: e76a5dd4
ds: 0068   es: 0068   ss: 0068
Process svnlook (pid: 2038, stackpage=e76a5000)
Stack: c17de430 dc141c44 c013c5e2 dc141b80 c17de430 00000000 c17de430
c01460ca
       c17de430 000001d2 e76a4000 00000a57 000001d2 00000019 00000020
000001d2
       c0374e58 c0374e58 c01463ba e76a5e40 000001d2 0000003c 00000020
c0146432
Call Trace:   [<c013c5e2>] __remove_inode_page [kernel] 0x82 (0xe76a5ddc)
[<c01460ca>] shrink_cache [kernel] 0x30a (0xe76a5df0)
[<c01463ba>] shrink_caches [kernel] 0x4a (0xe76a5e1c)
[<c0146432>] try_to_free_pages_zone [kernel] 0x62 (0xe76a5e30)
[<f885827b>] ext3_do_update_inode [ext3] 0x19b (0xe76a5e38)
[<c0147012>] balance_classzone [kernel] 0x52 (0xe76a5e54)
[<c0147348>] __alloc_pages [kernel] 0x188 (0xe76a5e70)
[<c013df51>] do_generic_file_read [kernel] 0x401 (0xe76a5eb0)
[<c013e3b0>] file_read_actor [kernel] 0x0 (0xe76a5ee0)
[<c013e575>] generic_file_new_read [kernel] 0xc5 (0xe76a5f00)
[<c013e3b0>] file_read_actor [kernel] 0x0 (0xe76a5f10)
[<c0163131>] do_select [kernel] 0x151 (0xe76a5f24)
[<c013e69f>] generic_file_read [kernel] 0x2f (0xe76a5f4c)
[<f89fd608>] nfs_file_read [nfs] 0x98 (0xe76a5f64)
[<c01504ba>] sys_pread [kernel] 0xca (0xe76a5f8c)
[<c0109b27>] system_call [kernel] 0x33 (0xe76a5fc0)


Code: 89 01 c7 43 08 00 00 00 00 89 48 04 8b 06 89 50 04 89 43 08


Version-Release number of selected component (if applicable):
kernel-smp-2.4.22-1.2179.nptl

How reproducible:
Always

Steps to Reproduce:
Right now we can reproduce this using our Subversion load testing with
Silk Performer. We are working on reproducing this with commonly
available command-line tools.

Actual Results:  Test completes.

Expected Results:  Kernel oops.

Additional info:

Comment 1 Andrew Ryan 2004-04-26 20:50:02 UTC

Created attachment 99698 [details]
ksymoops output

Comment 2 Andrew Ryan 2004-04-26 20:50:52 UTC

Created attachment 99699 [details]
'vmstat 30' output for period preceding crash

Comment 3 Andrew Ryan 2004-04-26 20:51:59 UTC

Created attachment 99700 [details]
SysRq+T output from oopsed state

Comment 4 Andrew Ryan 2004-04-27 18:53:45 UTC

I submitted this to the linux-nfs mailing list, and according to
Trond, this is a VM bug which should be fixed in FC1 kernels:
http://marc.theaimsgroup.com/?l=linux-nfs&m=108301692018612&w=2

That it showed up on tests where we were using an NFS-mounted
filesystem is, apparently, just coincidental.

Subject:    Re: [NFS] oops in FC1 update kernel, in refile_inode
From:       Trond Myklebust <trond.myklebust () fys ! uio ! no>
Date:       2004-04-26 21:56:32

That is indeed a fix for a generic VFS/mm race. It has pretty much
nothing to do with NFS itself but just happened to trigger on an NFS
partition for someone.
As far as I can see, that patch hasn't yet been applied to the latest
errata kernel (linux-2.4.22-1.2188.nptl). Have you tried it out to see
if it fixes your Oops?

Steve, could you make sure that patch makes it into any future errata
kernels?

Cheers,
  Trond

["linux-2.4.26-refile_inode.dif" (linux-2.4.26-refile_inode.dif)]

--- linux-2.4.26-up/fs/inode.c.orig	2004-03-19 17:12:46.000000000 -0500
+++ linux-2.4.26-up/fs/inode.c	2004-03-26 13:01:23.000000000 -0500
@@ -319,7 +319,8 @@ void refile_inode(struct inode *inode)
 	if (!inode)
 		return;
 	spin_lock(&inode_lock);
-	__refile_inode(inode);
+	if (!(inode->i_state & I_LOCK))
+		__refile_inode(inode);
 	spin_unlock(&inode_lock);
 }

Comment 5 Andrew Ryan 2004-04-29 22:57:08 UTC

With the above patch applied to the FC1.2179 kernel, we have not seen
the oops in 2 days of constant testing. For reference, we used to see
this oops after 2-8 hours of stress testing.

Comment 6 Dave Jones 2004-04-30 11:17:04 UTC

patch is in cvs, and will be in the next update.

Comment 7 Aleksander Adamowski 2004-05-28 11:39:09 UTC

Can this be the same issue as in bug 123332? I've posted there 2
stacktraces from kerlen panics, captured with a digital camera.

Comment 8 Aleksander Adamowski 2004-05-28 11:45:52 UTC

BTW, forgot to notice, we're having those kernel panics on Fedora
kernel 2.4.22-1.2188.nptlsmp, about once every 2 weeks. This is a
production system, so unfortunately we cannot afford to stress-test it
to reproduce this artificially.

We cannot also connect a serial console, as the machine has only 1
serial port that has to be connected to a UPS.

But the stacktraces captured with digital camera look exactly the same
as the one reported here.

We were suspecting this to be a hardware issue with 3Ware controller
that runs our RAID5 array, but in the light of this bug it seems more
probable to be a kernel bug, right?

Comment 9 Dave Jones 2004-05-28 12:00:20 UTC

there should be a 2190 kernel in updates-testing, which should have
this fixed.

Comment 10 Aleksander Adamowski 2004-05-28 13:21:42 UTC

Out system just crashed again;

I've installed the 2.4.22-1.2190.nptlsmp kernel package from
2004-05-26 - I'll let you know if it remedies the issue, but testing
period will be long since this crash occurs about twice a month on
this particular system.

Comment 11 Aleksander Adamowski 2004-05-28 14:33:55 UTC

Does this issue affect Fedora 2's 2.6 kernel?

Comment 12 Dave Jones 2004-05-28 14:45:13 UTC

no. refile_inode doesn't exist there.

Comment 13 Aleksander Adamowski 2004-05-31 19:28:37 UTC

Another panic in refile_inode occured just today on
kernel-2.4.22-1.2190.nptlsmp.

The problem has not been resolved, or the problem is separate (in that
case, bug 123332 is not a dupe of this one).

Comment 14 Aleksander Adamowski 2004-06-01 09:12:42 UTC

BTW, looking at /usr/src/linux-2.4/fs/inode.c (from
kernel-source-2.4.22-1.2190.nptl RPM) the fix from comment #3 is
present there.

But the panics still happen.

Comment 15 David Lawrence 2004-09-29 20:22:29 UTC

Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/

Comment 16 Larry Troan 2004-10-19 22:04:22 UTC

Problem was found and fixed in RHEL3 U3.

Note You need to log in before you can comment on or make changes to this bug.