Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.

Bug 131251

Summary: kernel Out of Memory: Killed process
Product: [Fedora] Fedora Reporter: Andrea Pasquinucci <cesare>
Component: kernelAssignee: Rik van Riel <riel>
Status: CLOSED ERRATA QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: rawhideCC: ade.rixon, axel.thimm, barryn, hansecke, ian_ar, justdave, kidcrash, nhwuxiaojun, rhbugs, rmj, robert.toole, sandip, shishz, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-09-28 09:41:11 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 125270, 130887, 136452    
Attachments:
Description Flags
Snips of logs
none
Some logs from my OOM experiences
none
Vanilla 2.6.8.1 kernel .config based on .config from FC kernel 624
none
Vanilla 2.6.9 kernel .config based on .config from FC kernel 624
none
syslog just before crash after inserting an USB mass storage device (digital camera)
none
syslog during the reboot after the crash. Note that the USB mass storage device (digital camera) is still plugged in
none
Some of the differences between 2.6.8.1-bk2 and 2.6.9-rc1
none
/var/log/messages snip from (uname -a | cut -d ' ' -f 3-) 2.6.9-1.3_FC2 #1 Mon Nov 15 14:46:43 EST 2004 i686 i686 i386 GNU/Linux none

Description Andrea Pasquinucci 2004-08-30 14:39:37 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.2)
Gecko/20040803

Description of problem:
With kernel-2.6.8-1.521 when the PC is doing a few task at the time,
it runs out of memory and kills some programs, never happened with
previous kernels, and I have not changed anything in the configuration
since the installation of FC2. It is running selinux in permissive
mode. The PC is old with 64MB of RAM, headless, runs in runlevel 3 as
server, no X11 etc., only services, but worked perfectly up to now,
and still does except for this. I append /var/log/messages output.


Version-Release number of selected component (if applicable):
kernel-2.6.8-1.521

How reproducible:
Sometimes

Steps to Reproduce:
1.start few tasks at the same time, like print and up2date
2.
3.
    

Actual Results:  killed some programs

Expected Results:  normal operation

Additional info:

#/var/log/messages
Aug 30 15:49:41 old kernel: DMA per-cpu:
Aug 30 15:49:41 old kernel: cpu 0 hot: low 2, high 6, batch 1
Aug 30 15:49:41 old kernel: cpu 0 cold: low 0, high 2, batch 1
Aug 30 15:49:41 old kernel: Normal per-cpu:
Aug 30 15:49:41 old kernel: cpu 0 hot: low 4, high 12, batch 2
Aug 30 15:49:41 old kernel: cpu 0 cold: low 0, high 4, batch 2
Aug 30 15:49:41 old kernel: HighMem per-cpu: empty
Aug 30 15:49:41 old kernel:
Aug 30 15:49:41 old kernel: Free pages:        1120kB (0kB HighMem)
Aug 30 15:49:41 old kernel: Active:915 inactive:8258 dirty:0
writeback:7565 unstable:0 free:280 slab:4536 mapped:1585 pagetables:487
Aug 30 15:49:41 old kernel: DMA free:496kB min:60kB low:120kB
high:180kB active:1200kB inactive:7524kB present:16384kB
Aug 30 15:49:41 old kernel: protections[]: 30 124 124
Aug 30 15:49:41 old kernel: Normal free:624kB min:188kB low:376kB
high:564kB active:2460kB inactive:25508kB present:49088kB
Aug 30 15:49:41 old kernel: protections[]: 0 94 94
Aug 30 15:49:41 old kernel: HighMem free:0kB min:128kB low:256kB
high:384kB active:0kB inactive:0kB present:0kB
Aug 30 15:49:41 old kernel: protections[]: 0 0 0
Aug 30 15:49:41 old kernel: DMA: 0*4kB 4*8kB 1*16kB 0*32kB 1*64kB
1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 496kB
Aug 30 15:49:41 old kernel: Normal: 66*4kB 1*8kB 0*16kB 1*32kB 1*64kB
0*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 624kB
Aug 30 15:49:42 old kernel: HighMem: empty
Aug 30 15:49:42 old kernel: Swap cache: add 23870, delete 15733, find
2664/3980, race 0+1
Aug 30 15:49:42 old kernel: Out of Memory: Killed process 1672 (imapd).
Aug 30 15:49:42 old kernel: oom-killer: gfp_mask=0x1d2
Aug 30 15:49:42 old kernel: DMA per-cpu:
Aug 30 15:49:42 old kernel: cpu 0 hot: low 2, high 6, batch 1
Aug 30 15:49:43 old kernel: cpu 0 cold: low 0, high 2, batch 1
Aug 30 15:49:43 old kernel: Normal per-cpu:
Aug 30 15:49:43 old kernel: cpu 0 hot: low 4, high 12, batch 2
Aug 30 15:49:43 old kernel: cpu 0 cold: low 0, high 4, batch 2

# free
             total       used       free     shared    buffers     cached
Mem:         61416      59200       2216          0       2256      16456
-/+ buffers/cache:      40488      20928
Swap:       305192       6900     298292

Comment 1 Axel Thimm 2004-09-13 09:15:39 UTC
Similar things happen on FC2/x86_64 with 1GB RAM:
http://www.redhat.com/archives/fedora-list/2004-September/msg02048.html

Here are some numbers from the posting above that show that almost all
memory is consumed in non-userland parts. System is Dual Opteron with
one processor only (Tyan S2880, no SATA/SCSI used).

# free
             total       used       free     shared    buffers     cached
Mem:       1027016    1022600       4416          0        992       7288
-/+ buffers/cache:    1014320      12696
Swap:      2047992       4496    2043496
# vmstat -a
procs -----------memory---------- ---swap-- -----io---- --system--
----cpu----
 r  b   swpd   free  inact active   si   so    bi    bo   in    cs us
sy id wa
 0  0   4496   4352   4548   6556    1    1   399    80 1517   162  2
 2 88  8
# cat /proc/meminfo
MemTotal:      1027016 kB
MemFree:          4352 kB
Buffers:          1008 kB
Cached:           7316 kB
SwapCached:       1148 kB
Active:           6528 kB
Inactive:         4536 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:      1027016 kB
LowFree:          4352 kB
SwapTotal:     2047992 kB
SwapFree:      2043496 kB
Dirty:             236 kB
Writeback:           0 kB
Mapped:           5296 kB
Slab:            14388 kB
Committed_AS:   535496 kB
PageTables:     494900 kB
VmallocTotal: 536870911 kB
VmallocUsed:      1568 kB
VmallocChunk: 536869323 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     2048 kB
# ps uaxwwf
USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0  3472  428 ?        S    Sep12   0:01 init
[3]       
root         2  0.0  0.0     0    0 ?        SWN  Sep12   0:00
[ksoftirqd/0]
root         3  0.0  0.0     0    0 ?        SW<  Sep12   0:00 [events/0]
root         4  0.0  0.0     0    0 ?        SW<  Sep12   0:00  \_
[khelper]
root         5  0.0  0.0     0    0 ?        SW<  Sep12   0:00  \_
[kacpid]
root        30  0.0  0.0     0    0 ?        SW<  Sep12   0:00  \_
[kblockd/0]
root        44  0.0  0.0     0    0 ?        SW   Sep12   0:00  \_
[pdflush]
root        45  0.0  0.0     0    0 ?        SW   Sep12   0:02  \_
[pdflush]
root        47  0.0  0.0     0    0 ?        SW<  Sep12   0:00  \_ [aio/0]
root       186  0.0  0.0     0    0 ?        SW<  Sep12   0:00  \_ [ata/0]
root        31  0.0  0.0     0    0 ?        SW   Sep12   0:00 [khubd]
root        46  0.0  0.0     0    0 ?        SW   Sep12   0:01 [kswapd0]
root       151  0.0  0.0     0    0 ?        SW   Sep12   0:00 [kseriod]
root       188  0.0  0.0     0    0 ?        SW   Sep12   0:00 [scsi_eh_0]
root       189  0.0  0.0     0    0 ?        SW   Sep12   0:00 [scsi_eh_1]
root       204  0.0  0.0     0    0 ?        SW   Sep12   0:00 [kjournald]
root       339  0.0  0.0  2336  216 ?        S<   Sep12   0:00 udevd
root       896  0.0  0.0     0    0 ?        SW   Sep12   0:00 [kjournald]
root       897  0.0  0.0     0    0 ?        SW   Sep12   0:00 [kjournald]
root       898  0.0  0.0     0    0 ?        SW   Sep12   0:00 [kjournald]
root       899  0.0  0.0     0    0 ?        SW   Sep12   0:00 [kjournald]
root      1637  0.0  0.0     0    0 ?        SW<  Sep12   0:00 [krfcommd]
root      1946  0.0  0.0 18104  748 ?        S    Sep12   0:00
/usr/sbin/sshd
root      5189  0.0  0.1 37540 1056 ?        S    02:04   0:00  \_
sshd: root pts/0
root      5195  0.0  0.0 45656 1020 pts/0    S    02:04   0:00  |   \_
-bash
root      5255  0.0  0.1 104764 1892 pts/0   S    02:04   0:00  |    
  \_ gkrellm
root     29075  0.0  0.0 44836  500 pts/0    S    02:38   0:00  |    
  \_ sleep 10
root      6119  0.0  0.0 37284 1020 ?        S    02:19   0:00  \_
sshd: root pts/1
root      6133  0.0  0.1 45656 1120 pts/1    S    02:19   0:00  |   \_
-bash
root     29079  0.0  0.0 44476  924 pts/1    S    02:38   0:00  |    
  \_ /bin/sh ./memory.sh
root     29083  0.0  0.0  5228  784 pts/1    R    02:38   0:00  |    
      \_ ps uaxwwf
root      6193  0.0  0.0 37284 1020 ?        S    02:20   0:00  \_
sshd: root pts/2
root      6212  0.0  0.1 45656 1136 pts/2    S    02:20   0:00  |   \_
-bash
root     29077  0.0  0.1 35936 1932 ?        S    02:38   0:00  \_
sshd: bin [priv]
sshd     29078  0.0  0.1 19448 1120 ?        S    02:38   0:00      \_
sshd: bin [net]
root      2542  0.0  0.0  2344  272 tty1     S    Sep12   0:00
/sbin/mingetty tty1
root      2543  0.0  0.0  2344  272 tty2     S    Sep12   0:00
/sbin/mingetty tty2
root      2544  0.0  0.0  2344  272 tty3     S    Sep12   0:00
/sbin/mingetty tty3
root      2545  0.0  0.0  2344  276 tty4     S    Sep12   0:00
/sbin/mingetty tty4
root      2546  0.0  0.0  2344  276 tty5     S    Sep12   0:00
/sbin/mingetty tty5
root      2547  0.0  0.0  2344  276 tty6     S    Sep12   0:00
/sbin/mingetty tty6

Comment 2 Axel Thimm 2004-09-13 09:30:46 UTC
As a follow-up: This is unrelated to the other memory leak concerning
SG_IO/bio_uncopy_user (bugs #132180 and #131414). The system in
question is plain-old IDE driven and has no
SATA/SCSI/CD-ROMs/USB-devices attached.

The memory leak occurs for instance while trying to rebuild the
src.rpm to kernel-2.6.8-1.521 (w/o modifications), or trying to build
kernel modules for it (the lirc build for instance eats up the 1GB
memory already in the configure phase).

Comment 3 Axel Thimm 2004-09-17 02:43:40 UTC
This also happens with kernel-2.6.8-1.541.

Comment 4 Axel Thimm 2004-09-17 19:45:57 UTC
A posting on lkml suggest that the CFQ scheduler may be the cause of
the leak.

http://lkml.org/lkml/2004/8/27/102

Comment 5 Warren Togami 2004-09-18 19:38:51 UTC
If this is the case, please test with "elevator=deadline" or
"elevator=as" boot options and report back.

Comment 6 Axel Thimm 2004-09-20 10:04:08 UTC
It turns out that the bug I am seeing is x86_64 specific when running
32 bit applications on FC2/x86_64. I have therefore opened a new
bugzilla entry at #132947.

Comment 7 Petr Vita 2004-10-06 09:08:52 UTC
Created attachment 104830 [details]
Snips of logs

We are observing the same bug on one of our production machines. It is Intel
Pentium 4 with 1 GB of memory. Memory is simply consumed outside of userland.
"free" is showing that near all the memory is used. "ps xuawf" states that
processes are barely using not even cca 15% of memory. At the end of the show
you can read in "/var/log/messages" that "oom-killer" is killing all the
processes together with X server. But it not helps a lot a restart is needed.

Comment 8 Petr Vita 2004-10-07 07:42:44 UTC
We have observed the same behaviour on the kernel 2.6.7 as well. We
are running now on the custom made kernel 2.6.8.1 with a voluntary
preemptive patch. Bug is still present. The kernel was whole time
running with an anticipatory io scheduler. We are trying a kernel
parameter "elevator=deadline" at this moment.

Comment 9 Petr Vita 2004-10-08 07:01:58 UTC
Parameter "elevator=deadline" has no influence on the bug presence.
Any idea what we can do?

Comment 10 Toshio Kuratomi 2004-10-09 13:45:45 UTC
I've experienced the same symptoms until recently on an old K6.  I'm
using a custom compile of 2.6.8-1.521 with gcc-3.3.3-7.  I was able to
resolve it by unsetting CONFIG_CC_OPTIMIZE_FOR_SIZE.  This is the only
change I have made in my broken and functional kernel.  I don't know
what other differences I might have between my kernel and the stock
one, though.

Comment 11 Jeremy Sanders 2004-10-14 09:43:11 UTC
We've experienced something like this with a 128MB Celeron system
which  backs up files over NFS to tape using tar. It's fallen over a
couple of times with the OOM killer, even though sar shows that very
little swap is consumed.

Comment 12 Warren Togami 2004-10-17 21:34:20 UTC
Is this issue solved if you try the 6XX rawhide kernels on FC2?  It
should work.

Comment 13 Anssi Johansson 2004-10-18 16:13:04 UTC
Even though this bug is filed against FC2, I'd like to notify you that
something related is happening with FC3T3. I can file a separate bug
about this if desired.

From my point of view memory gets used normally but instead of
swapping out some pages to swap, the kernel starts killing innocent
processes even though there's plenty of swap space available. The
attachment oom-tiikeri.txt contains some additional information about
the problem. 

I'm running 2.6.8-1.624 on x86_64 architecture, with 512MB of RAM and
4GB of swap. I loaded a 680MB SQL database dump in nano. So far things
work fine, swap usage has grown to 573MB which was expected. When I
try finding a non-existent string in nano (ctrl-w and some random
string) things start going bad. The kernel starts by killing mysqld
and httpd, and then eventually kills nano itself. The swap space usage
peaks at about 600MB, meaning that there was always some 3.4GB of free
swap available. After nano gets killed the used memory and swap is
freed properly, no problems there. The problem is that the kernel
starts killing processes a bit too eagerly, instead of swapping things
out to swap area.

I tried elevator=as and elevator=deadline, but they didn't help at
all. The good(?) news is that the behaviour can be reproduced with
100% certainty.

My FC2 computer (2.6.8-1.521, AMD Duron) behaves similarly, except
that the first victim is nano itself. Probably because there are no
other memory-hungry processes running on that box. Swap space usage
peaked at about 650MB, out of a total of 3GB.

Comment 14 Anssi Johansson 2004-10-18 16:15:40 UTC
Created attachment 105388 [details]
Some logs from my OOM experiences

free, /proc/meminfo, /proc/slabinfo, vmstat, /var/log/messages entries

Comment 15 Axel Thimm 2004-10-18 16:27:48 UTC
To Anssi:

Are you running 32 bit apps on the x86_64 box? If yes, then this could
be bug #132947, another kernel memory leak bug.

Does downgrading to 2.6.7 kernel rpms help?


Comment 16 Anssi Johansson 2004-10-18 16:54:52 UTC
Axel: No, all the binaries are all 64bit, at least according to
'file'. The installation is pretty much the basic FC3T3 (the x86_64
variant) without X, with the addition of MySQL4 which was downloaded
from the MySQL website. MySQL is also the 64bit version. Doing the
same experiment without mysql running gives similar results, ie.
processes (such as httpd) get killed and then eventually nano, so I
can't really blame any external software for these problems.

Comment 17 Anssi Johansson 2004-10-18 20:42:35 UTC
Hmm, this is interesting. I downloaded 2.6.7-1.494.2.2 as instructed
(from FC2 x86_64 updates) and installed it. I had to disable the Via
Velocity Gbit ethernet module to make it not crash during the boot,
though.. 

However, after the 2.6.7 kernel was running it worked beautifully in
this respect. I loaded 4 copies of nano (each had a memory footprint
of about 1 gigabyte) and searched for strings simultaneously with all
of them. No problems whatsoever. Loading the fifth nano pushed the
memory usage beyond the available 4GB swap + 512M RAM, after which
mysqld was killed, then httpd and finally nano. This was the expected
behaviour. 

Looks like something has definitely broken between 2.6.7-1.494.2.2 and
2.6.8-1.624. Do you have suggestions for specific kernel versions that
I should try to pinpoint the exact version which started causing problems?

Comment 18 Axel Thimm 2004-10-18 21:31:34 UTC
I have encountered the memory leak to enter between and
2.6.7-1.494.2.2 and 2.6.8-1.521 but only for mixed 64/32 bits (bug
#132947). It could be the same bug though triggered by different
conditions.

If you find the bug to occur between these two versions, I would
suggest to try a vanilla kernel (2.6.8 with a configuration derived
from 2.6.8-1.521).


Comment 19 Andrea Pasquinucci 2004-10-19 06:39:58 UTC
After a few tests (I reported this bug at the end of august) I shifted
to a vanilla kernel since I could not run servers with this problem.
I have absolutely no problem with a vanilla kernel. As someone as
indicated, I think that the bug has something to do with
CONFIG_CC_OPTIMIZE_FOR_SIZE which seems to be set in the fedora
kernels but it is not in the vanilla ones (at least the ones I run). 

Comment 20 Petr Vita 2004-10-19 07:02:19 UTC
Our vanilla kernel 2.6.8.1 was functional but broken with
CONFIG_CC_OPTIMIZE_FOR_SIZE not being set. After we switched to Redhat
stock kernel 2.6.8-1.521 the memory leak disappeared.

Comment 21 Sandip Bhattacharya 2004-10-19 21:36:20 UTC
I got a similar problem in the middle of writing an audio CD. I must
admit, because of the recent cdrecord scsi permission problem in
2.6.8, I was running cdrecord as root.

Based on a similar report quite some time back, Andrew Morton on the
lkml list had said[1] that it was an untraceable problem occuring when
you burn audio CDs.

Debian seems to have released[2] a version of the kernel which fixes
the problem.

The problem still exists in current FC2 kernel(2.6.8-1.521). Are there
any fixes forthcoming for the FC2 kernel?


[1] http://lkml.org/lkml/2004/8/7/6
[2] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=267464




Comment 22 Anssi Johansson 2004-10-19 22:25:16 UTC
Created attachment 105473 [details]
Vanilla 2.6.8.1 kernel .config based on .config from FC kernel 624 

If this is any help: I downloaded 2.6.8.1, installed it, copied the .config
from Fedora Core kernel build 624 and ran "make oldconfig". The resulting
.config file is attached. After the new kernel was compiled and installed, I
rebooted and did the usual tests. The tests passed fine. Later on I noticed
that the CONFIG_CC_OPTIMIZE_FOR_SPEED option had been turned off during "make
oldconfig". I changed the Makefile so that it'd pass the option -O2 always
instead of -Os , cleaned up the previous build and recompiled the kernel. The
tests passed nicely again, regardless of this configuration change. I'll try to
compare the differences between vanilla kernels and FC kernels later on.

Comment 23 Warren Togami 2004-10-20 01:09:19 UTC
2.6.8.1 vanilla is significantly older than 624.  Please try the same
test on 2.6.9.


Comment 24 Anssi Johansson 2004-10-20 06:58:28 UTC
Created attachment 105495 [details]
Vanilla 2.6.9 kernel .config based on .config from FC kernel 624 

Typo: in comment #22 I should have written CONFIG_CC_OPTIMIZE_FOR_SIZE instead
of CONFIG_CC_OPTIMIZE_FOR_SPEED.

Vanilla 2.6.9 with the attached .config (624 + "make oldconfig") suffers from
the premature OOM kills as well.

Comment 25 Warren Togami 2004-10-20 08:00:02 UTC
Let us summarize.

Comment #10 Toshio indicates that CONFIG_CC_OPTIMIZE_FOR_SIZE when
unset prevents this trouble.  Comment #22 and Comment #24 indicate
that CONFIG_CC_OPTIMIZE_FOR_SIZE makes no difference and the upstream
vanilla kernel works fine, while RH's patched kernel does not.

Axel Thimm indicates that he sees this on x86_64.  What archs are the
other reporters seeing this?

http://people.redhat.com/wtogami/temp/kernel-2.6.9-1.639.src.rpm
Can anybody confirm Comment #10 with a rebuilt 639 kernel?  I would do
it myself, but I totally cannot reproduce this behavior.

Comment 26 Sandip Bhattacharya 2004-10-20 08:37:00 UTC
> Axel Thimm indicates that he sees this on x86_64.  What archs are the
> other reporters seeing this?

x86_32 for me.

Comment 27 Warren Togami 2004-10-20 08:44:55 UTC
i586 or i686?

Comment 28 Sandip Bhattacharya 2004-10-20 08:57:27 UTC
i686

[root@pluto root]# uname -a
Linux pluto.home 2.6.8-1.521 #1 Mon Aug 16 09:01:18 EDT 2004 i686
athlon i386 GNU/Linux


Comment 29 Axel Thimm 2004-10-20 09:41:29 UTC
In reply to comment #25:
> Axel Thimm indicates that he sees this on x86_64.

The bug is see is of different nature and x86_64 specific (ia32
emulation within x86_64, see comment #6 and bug #132947).

It is possible that the two bugs are related, and triggered by
different subsystems, but for now I would deal with them as separate
entities, so I would remove my datapoint, and also ignore the posted
vmstats.

OTOH Anssi in comment #16 has confirmed pure x86_64 to cause his problems.

Comment 30 Anssi Johansson 2004-10-20 11:21:06 UTC
My main test rig is indeed an x86_64 computer, an AMD Athlon 64 3500+
on an Abit AV8 motherboard. This system is running FC3T3, but I've
upgraded to the latest packages found from the development repository.

In comment #13 (the last paragraph) I mentioned that I'm able to
reproduce the problem on a FC2 computer with a 32-bit AMD Duron
processor (Linux karhu.aj 2.6.8-1.521 #1 Mon Aug 16 09:01:18 EDT 2004
i686 athlon i386 GNU/Linux).

A clarification for my comment #24: I didn't test 2.6.9 (yet) with
CONFIG_CC_OPTIMIZE_FOR_SIZE enabled. I'll test that option, and the
639 version once I get back home from work.

Comment 31 Warren Togami 2004-10-20 11:23:21 UTC
OK great.  We are in agreement that there is a problem on multiple
archs, we only now need to isolate exactly what causes this problem.

Comment 32 Anssi Johansson 2004-10-20 11:56:46 UTC
In reply to comment #25:
> upstream vanilla kernel works fine, while RH's patched kernel does not.

2.6.8.1 works fine but 2.6.9 does not, see comment #24. I guess this
means I'll have to test some 2.6.9 release candidates as well.

Comment 33 Robert Toole 2004-10-20 14:55:41 UTC
All, apologies for the long post, I have a lot of info here:

I have 3 systems with this issue:

I am using FC2 on 2 athlon 2200 with 1 GB ram. I see this bug crop up 
during continuous  use of an IDE RAID controller. (IT8212 - 
integrated raid on Giga-byte mother boards) One a production system 
using zonemonitor to record video from security cameras. The second 
is a test / devel box. The leak progresses very slowly, and takes 
about 3 weeks for the oom killer to kick in on the production box. 
the test system has identical hardware, except the drives attached to 
the raid controller are rarely used, hence this machine did not 
display this behaviour.

I also have an HP LH3000r Dual PIII 866, 2 GB ram FC2, 2.6.8-1.521 
running on the integrated netraid controller (megaraid.ko) on this 
system, not all physical memory is being used, yet over the last few 
weeks since installing 2.6.8.1-521 the swap usage is also creeping 
upward. (currently at 25MB)

I wrote a script to simply copy a bunch of files between two folders 
on drives attached to the RAID controller on the test box) In 24 
hours, my swap file usage went from 0 to 550 Mb using 2.6.8-1.521. In 
48 hours, the page file was full and the oom killer kicked in. I 
reverted to 2.6.5-1.358 and have observed an improvement, but after 
24 hours the page file is at 1 MB. The only thing the test machine is 
doing is copying those files back and forth, no apache, or X or other 
services. 

I have noticed that the harder the file-system is being used, the 
faster the memory fills up. If you completely stop using the 
filesystem mounted from a disk using the scsi generic modules, the 
pagefile clears out. Seems that the more heavilly loaded the system 
is, the more prevalent the bug.

I am currently compiling a vanilla 2.6.8 kernel right now (without 
using the FC2 config). I will also try using the FC2 config.(I only 
have one machine to test with, and the results take up to 48 hours.

I'll try anything anyone wants here, and will also provide output and 
system info as requested.

robert [dot] toole [at] keuhne [dash] nagel [dot] com

Comment 34 Anssi Johansson 2004-10-20 16:32:50 UTC
I did some further testing on 2.6.9. First of all, the -Os and -O2
parameters (CONFIG_CC_OPTIMIZE_FOR_SIZE) do not seem to have an effect
to this problem. I also tried the elevator=deadline option for both
kernels, one with -Os and the other with -O2. The elevator option
didn't seem to have an effect either, so I think I'll concentrate on
finding the exact kernel version which broke the memory management for
me. I know it's somewhere between 2.6.8.1 and 2.6.9. 

Now that I finally got my memory sticks I tried upgrading the memory
to 4GB. And boy did I open another can of worms by doing that. After
booting up I ran "free" and saw that I had 4GB of usable memory in
Linux. I loaded the SQL dump file with nano and did the usual text
search trick. It worked fine, but that's probably because the system
had enough RAM this time (one text editor instance uses only about 1GB
of memory). I loaded another copy in another session, it loaded nicely
as well. I started loading the third file when the system seemed to
get stuck, nano just stopped at "Reading File". On another channel I
tried running "free" to see if the system was already swapping at this
point (it shouldn't). Surprise surprise, the "free" command itself got
stuck, it refused to output anything and also refused to return back
to command prompt. Running "ls" and "vmstat 2" on other channels
resulted in the same, nothing was printed on the screen and those
commands didn't return to the command prompt. The editors on channels
1 and 2 were responsive and worked, until I tried to move to the next
page. This move made both of them get stuck. This happened with 2.6.9
and elevator=as. I've now gone back to 512M until the OOM problems get
solved.

Next up, results from the 639 build and from some 2.6.9 release
candidates..

Comment 35 Anssi Johansson 2004-10-20 20:13:22 UTC
Sorry, the 639 build didn't help in my situation.

Comment 36 Sandip Bhattacharya 2004-10-20 21:05:54 UTC
Created attachment 105549 [details]
syslog just before crash after inserting an USB mass storage device (digital camera)

Comment 37 Sandip Bhattacharya 2004-10-20 21:16:21 UTC
Created attachment 105550 [details]
syslog during the reboot after the crash. Note that the USB mass storage device (digital camera) is still plugged in

This and the last attachment (105549) are the logs after my computer crashed
again due to that crazy oom-killer :). This time all that I did was plug in my
Olympus-C120 into the usb slot. The computer froze with the HDD light
continuously red. I rebooted. 

The camera was still attached to the computer while rebooting.

The weird thing is that as you can see in the post boot logs, even during the
bootup process, the kernel oom-killer kept on killing. Yyou can find out that
it is the usb.agent process that kept on being killed. Ofcourse I couldnt see
all this happening, 'coz it kept happening in the background. After I logged
in, I could simply mount the camera (remember that it was still plugged in),
and use it normally. So all those oom-killer massacre actually didnt prevent me
from using the device the second time.

To say the least all this terribly upsets me because this is a workstation. I
cant keep on rebooting and testing. Is it advisable to downgrade to an earlier
kernel?

Comment 38 Anssi Johansson 2004-10-21 01:19:13 UTC
Quoting myself from comment #34:
> I think I'll concentrate on finding the exact kernel version which
> broke the memory management for me. I know it's somewhere between 
> 2.6.8.1 and 2.6.9. 

A couple of patches and kernel recompiles later I was able to
determine that the breakage happened between 2.6.8.1-bk2 and 2.6.9-rc1
on August 23 or thereabouts.

Given enough time I'll eventually figure this out by myself, but I'd
appreciate it if someone more familiar with kernel development would
take a look at what happened between those releases.

Comment 39 Andrea Pasquinucci 2004-10-21 06:39:24 UTC
Sorry guys if I cannot contribute about this even if the original post
is mine (I have no hardware on which I can test as of today). Anyway,
from my experience at the time of my posting, disk activity (i.e.
filesystem) could have something to do with this. When the OOMkill
kicked in, the disks led was constantly on and the disks were spinning
loudly. At that time I thought it was swap activity, but it could have
very well been file system activity since there were programs doing
something on the filesystem, like printing large files or up2date. My
arch is x86_32 i686, both intel and AMD.

Comment 40 Robert Toole 2004-10-21 14:44:42 UTC
Just did 2.6.9 vanilla with the fc2 config, after 24 hours of 
continous heavy disk usage, it was much better than 2.6.8-1.521, but 
swap use grew to 300 Mb in that time.

arch is x86_32 i686, intel and AMD. I am going to roll my own config 
for 2.6.9 and try again.

I agree that the heavier the disk usage, the worse the problem.

Comment 41 Anssi Johansson 2004-10-22 00:55:12 UTC
Created attachment 105621 [details]
Some of the differences between 2.6.8.1-bk2 and 2.6.9-rc1

Bug tracking issue: Looks like there are at least two kinds of bugs being
discussed here. One seems to be a memory leak issue and the one I've been
writing about is the odd habit of the kernel to OOM kill processes even though
there's plenty of swap space available. I haven't seen any memory leaks (that I
know of) during my testing, so I can't comment on those. Unfortunately the
original bug report is a bit unclear on which problem he has been experiencing,
so it looks like we're going to be discussing both problems here until someone
makes some administrative cleanup decisions.

On the debugging front I'm going to mention that I've just tried 2.6.9-bk6 but
it didn't help in my problem, the kernel still happily OOM killed my text
editor even though there's more than 3GB of available swap space. bk{3,4,5}
didn't even compile and bk2 didn't help either. Haven't tried bk1, but I
suppose it'd be useless.

As I mentioned earlier in comment #38 I was able to determine that the breakage
happened between 2.6.8.1-bk2 and 2.6.9-rc1. I've been comparing the differences
between those versions. I started with 2.6.8.1-bk2 and integrated some of the
files from the 2.6.9-rc1 mm directory, one by one, taking care of the
dependencies when the compiler complained about them. The end result is that I
started seeing those OOM problems once I installed the new mm/memory.c (and the
related mm/thrash.c which was introduced with -rc1). Going back to the
2.6.8.1-bk2 version of mm/memory.c makes the problem go away again, so I think
I'm on the right track. One minor problem is that mm/memory.c has since then
been modified again at least in the standard kernel (dunno about Fedora Core),
so any possible fixes would probably have to be forward-ported to the newest
version.

Attached is a diff file that should illustrate the relevant changes between
2.6.8.1-bk2 and 2.6.9-rc1. I'm unable to spot the bug in those files, I think
I'll conveniently blame the current time of day (4am) for my blindness. Let me
know if you think something should be changed. I'll be away for most of the
weekend (stupid work trip), so don't expect any replies from me in the next few
days.

Comment 42 Anssi Johansson 2004-10-22 01:23:09 UTC
Oh yes, one more thing. Commenting out the grab_swap_token() function
call in mm/memory.c also made the problems disappear, so there's
definitely something fishy going on in mm/thrash.c. 

Comment 43 Anssi Johansson 2004-10-22 07:26:12 UTC
"Kernel build 640 should be enough for anybody"
  -- Fedora Core

.. but unfortunately it isn't enough at least for me, the OOM massacre
continues despite upgrading to the latest available development kernel
which is 640 at the moment. 

On the other hand, plain vanilla 2.6.9-bk6 works when the
grab_swap_token function call is commented out.

Comment 44 Andrea Pasquinucci 2004-10-22 15:08:53 UTC
(I'm the original poster) Bug tracking issue: I agree it seems that
there are 2 kinds of bugs here, they could be related or could be not.
The one I experienced was the OOM kill with plenty of swap available
(just look at the report from 'free' in my original posting, that is
the normal situation of the machine, 'free' gives similar numbers on
that machine almost always). I did not experienced memory leaks to
fill up the swap. On the other hand, when OOM killing kicked in there
was a lot of disk activity, of which kind I cannot say. Sorry again I
have no hadrware to test on these days.

Comment 45 Anssi Johansson 2004-10-23 09:03:30 UTC
Based on my limited understanding of kernel internals I'd say the
memory leaks won't get fixed when/if this OOM killing bug gets fixed,
so if you're suffering from memory leaks I'd suggest either finding
another memory leak bug from Bugzilla or submitting a new bug report.
The system freeze problem I mentioned in comment 34 is probably
related to bug #135312.

Comment 46 Sandip Bhattacharya 2004-10-24 18:35:12 UTC
It crashed again on me. Again just after inserting the digital 
camera. However, just before inserting the camera I had top running. 
And after inserting the camera, the hard disk light stayed red. And 
the swap usage grew very fast from virtually 0 till all the swap was 
exhausted. And then the machine froze. The usb.agent process remained 
at the top of the display, slowly increasing in memory and CPU usage 
till the end. The load average shot up, I think, to 30+. 
 
The machine had been on just for about an hour. So this is not 
dependent on how long the machine has been up. 
 
I am not sure whether this USB hotplug problem (which has the same 
oom-killer murder trail) merits a separate bug report. I wonder if 
others have similar problems in using camera kinds of usb mass 
storage. 
 
Whatever be the case, this seems to be a VM problem. Almost as if 
killing a process doesnt free up VM space, and the process is spawned 
again and killed again and so on (which happened to the usb.agent 
program as you can see in an earlier attachment that I had added to 
this issue, and also can be made out from the load average) 
 
Are there any VM patch in the Fedora kernel which has been added to 
the vanilla source? 
 
 

Comment 47 Rik van Riel 2004-10-25 16:06:28 UTC
Comment #42 is a big hint.  Thank you Anssi, I'll brew up a test patch
soon.

Comment 48 Rik van Riel 2004-10-25 23:41:49 UTC
OK, I just posted a patch to the linux kernel mailing list.  It
appears to help my (simple) test case:

http://lkml.org/lkml/2004/10/25/357

Comment 49 Anssi Johansson 2004-10-26 21:16:28 UTC
Hi, I applied your patch to 2.4.10-rc1 and the OOM kills are gone now
:) Of course it still kills when it has no other choice, but now the
kernel understands to swap as needed.

However, during my testing I managed to get a stack trace from kswapd,
something along the lines of "page allocation failure". 
In my test I consumed all the memory and swap space until Linux
started killing processes. First it hit mysqld a couple of times, then
I got that kswapd error message. Unfortunately the message scrolled
past the screen so quickly that I couldn't make a screen capture or
anything, and there are no traces of that in /var/log/messages.
Probably because vanilla kernels don't seem to coexist nicely with
FC3's SELinux stuff. This kswapd error message might be a separate
issue (or actually a non-issue), but I'm reporting it here just for
completeness. 

Comment 50 Dave Jones 2004-10-29 20:10:43 UTC
*** Bug 137618 has been marked as a duplicate of this bug. ***

Comment 51 Ian Ashton-Reader 2004-11-23 18:07:08 UTC
Created attachment 107318 [details]
/var/log/messages snip from (uname -a | cut -d ' ' -f 3-) 2.6.9-1.3_FC2 #1 Mon Nov 15 14:46:43 EST 2004 i686 i686 i386 GNU/Linux

I think the attached simply duplicates previous evidence.

Comment 52 Iuri Gomes Diniz 2005-02-23 09:52:47 UTC
Is this bug resolved whitout the Riel's patch in kernel 2.6.10?

I have same bug in a Redhat 9 with static kernel 2.6.8 from kernel.
org.

Comment 53 Dave Miller 2005-02-26 23:49:16 UTC
We just ran into this on an RHEL 4 box with kernel-smp-2.6.9-5.0.3.EL

The OOMkill was firing like crazy with no swap in use and 2.8 GB of
physical RAM free.

I have no additional evidence to offer beyond what's already here (it
all looks the same).  I rebooted with the kernel set to use the
deadline scheduler, I'll give it a few days to see if it does any better.

Comment 54 Dave Miller 2005-03-01 10:07:43 UTC
It lasted 2 days 12 hours and started doing it again.  So the deadline
scheduler didn't help us, either, as with one other report on this bug.

Comment 55 Warren Togami 2005-05-23 03:23:52 UTC
Is this still an issue with the latest RHEL4, FC3, or FC4 kernel?


Comment 56 Sandip Bhattacharya 2005-05-24 02:40:51 UTC
Hasnt happened to me for quite a while. :)


Comment 57 Sandip Bhattacharya 2005-05-24 03:01:38 UTC
Sorry. I forgot to mention that I am using FC3 - 2.6.11-1.14_FC3

Comment 58 Juliano A. B. Gonçalves 2005-06-08 18:17:02 UTC
I'm also using 2.6.11-1.14_FC3 with 512MB and it has happened to me!

May 19 14:54:56 asterix kernel: HighMem free:0kB min:128kB low:160kB high:192kB
active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
May 19 14:54:56 asterix kernel: lowmem_reserve[]: 0 0 0
May 19 14:54:57 asterix kernel: DMA: 1*4kB 0*8kB 1*16kB 0*32kB 0*64kB 0*128kB
0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 2068kB
May 19 14:54:57 asterix kernel: Normal: 156*4kB 3*8kB 2*16kB 3*32kB 1*64kB
0*128kB 0*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 3400kB
May 19 14:54:58 asterix kernel: HighMem: empty
May 19 14:54:58 asterix kernel: Swap cache: add 370744, delete 370504, find
112200/125273, race 0+0
May 19 14:54:58 asterix kernel: Free swap  = 0kB
May 19 14:54:59 asterix kernel: Total swap = 1052216kB
May 19 14:54:59 asterix kernel: Free swap:            0kB
May 19 14:54:59 asterix kernel: 131068 pages of RAM
May 19 14:54:59 asterix kernel: 0 pages of HIGHMEM
May 19 14:54:59 asterix kernel: 2350 reserved pages
May 19 14:54:59 asterix kernel: 247152 pages shared
May 19 14:54:59 asterix kernel: 240 pages swap cached
May 19 14:54:59 asterix kernel: Out of Memory: Killed process 18683 (firefox-bin).

Comment 59 Jukka Lehtonen 2005-06-09 10:59:18 UTC
To add my penny, although it may be irrelevant:

Dual Pentium 3 system, 512MB RAM, two Mylex SCSI RAID5 cards, PATA system hd.
This used to work "just fine" for several months.  A headless FC2 file server:
LDAP, NFSv3, CUPS, Samba.

It was running 2.6.8-1.521smp, after update to kernel-smp-2.6.10-1.771_FC2
the LTO tape writes slowed unacceptably, and we resumed the 'good'
#uname -a
2.6.8-1.521smp #1 SMP Mon Aug 16 09:25:06 EDT 2004 i686 i686 i386 GNU/Linux

After that we have got 'oom-killer' twice, first on 28.-29.5. and second time
yesterday (9.6.).  We even had one maintenance reboot between those dates.

It used to work, so I wonder.  After yesterdays reboot I got on the console
messages from ip_conntrack table being full and dropping packets. 
Simultaneously at least one NFS clients (all FC3) autofs was attempting
to mount a non-existent directory (autofs wildcard), of course failing,
repeating from different port, running out of port numbers, etc.  Similar
thing had occurred before both oom-killer incidents.

Apparently the mount now defaults to tcp, while it used to use udp.  The
OpenLDAP does also use tcp, and several connections from each client.
The number of tcp connections does explain the 'out of ports' problem.

Furthermore, autofs (with 60 second timeout) seems needlesly active:
<--snip-->
Jun  9 12:07:14 fs rpc.mountd: authenticated unmount request from client:603 for
/NFS/home/foo (/NFS)
Jun  9 12:08:15 fs rpc.mountd: authenticated unmount request from client:621 for
/NFS/home/bar (/NFS)
Jun  9 12:08:15 fs rpc.mountd: authenticated mount request from client:623 for
/NFS/home/bar (/NFS)
Jun  9 12:08:15 fs rpc.mountd: authenticated mount request from client:625 for
/NFS/home/foo (/NFS)
Jun  9 12:09:30 fs rpc.mountd: authenticated unmount request from client:651 for
/NFS/home/foo (/NFS)
Jun  9 12:10:30 fs rpc.mountd: authenticated unmount request from client:670 for
/NFS/home/bar (/NFS)
Jun  9 12:10:30 fs rpc.mountd: authenticated mount request from client:672 for
/NFS/home/bar (/NFS)
Jun  9 12:10:30 fs rpc.mountd: authenticated mount request from client:674 for
/NFS/home/foo (/NFS)
Jun  9 12:11:45 fs rpc.mountd: authenticated unmount request from client:700 for
/NFS/home/foo (/NFS)
Jun  9 12:11:45 fs rpc.mountd: authenticated mount request from client:702 for
/NFS/home/foo (/NFS)
Jun  9 12:12:45 fs rpc.mountd: authenticated unmount request from client:724 for
/NFS/home/foo (/NFS)
Jun  9 12:12:45 fs rpc.mountd: authenticated mount request from client:726 for
/NFS/home/foo (/NFS)
Jun  9 12:13:45 fs rpc.mountd: authenticated unmount request from client:748 for
/NFS/home/foo (/NFS)
Jun  9 12:13:45 fs rpc.mountd: authenticated mount request from client:751 for
/NFS/home/foo (/NFS)
<--/snip-->

User 'foo' and 'bar' were not logged into the 'client' and 'fstat -mv'
did not list any users of those mounts.  The automounter seems to have
trouble letting unused mounts go.

Now, my question is: could the frequently repeated umount/mount from
the most recent version of FC3 nfs-client utilities somehow invoke
a leak within 2.6.8-1.521smp NFSv3?