Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 131251
Description
Andrea Pasquinucci
2004-08-30 14:39:37 UTC
Similar things happen on FC2/x86_64 with 1GB RAM: http://www.redhat.com/archives/fedora-list/2004-September/msg02048.html Here are some numbers from the posting above that show that almost all memory is consumed in non-userland parts. System is Dual Opteron with one processor only (Tyan S2880, no SATA/SCSI used). # free total used free shared buffers cached Mem: 1027016 1022600 4416 0 992 7288 -/+ buffers/cache: 1014320 12696 Swap: 2047992 4496 2043496 # vmstat -a procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free inact active si so bi bo in cs us sy id wa 0 0 4496 4352 4548 6556 1 1 399 80 1517 162 2 2 88 8 # cat /proc/meminfo MemTotal: 1027016 kB MemFree: 4352 kB Buffers: 1008 kB Cached: 7316 kB SwapCached: 1148 kB Active: 6528 kB Inactive: 4536 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 1027016 kB LowFree: 4352 kB SwapTotal: 2047992 kB SwapFree: 2043496 kB Dirty: 236 kB Writeback: 0 kB Mapped: 5296 kB Slab: 14388 kB Committed_AS: 535496 kB PageTables: 494900 kB VmallocTotal: 536870911 kB VmallocUsed: 1568 kB VmallocChunk: 536869323 kB HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 2048 kB # ps uaxwwf USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 3472 428 ? S Sep12 0:01 init [3] root 2 0.0 0.0 0 0 ? SWN Sep12 0:00 [ksoftirqd/0] root 3 0.0 0.0 0 0 ? SW< Sep12 0:00 [events/0] root 4 0.0 0.0 0 0 ? SW< Sep12 0:00 \_ [khelper] root 5 0.0 0.0 0 0 ? SW< Sep12 0:00 \_ [kacpid] root 30 0.0 0.0 0 0 ? SW< Sep12 0:00 \_ [kblockd/0] root 44 0.0 0.0 0 0 ? SW Sep12 0:00 \_ [pdflush] root 45 0.0 0.0 0 0 ? SW Sep12 0:02 \_ [pdflush] root 47 0.0 0.0 0 0 ? SW< Sep12 0:00 \_ [aio/0] root 186 0.0 0.0 0 0 ? SW< Sep12 0:00 \_ [ata/0] root 31 0.0 0.0 0 0 ? SW Sep12 0:00 [khubd] root 46 0.0 0.0 0 0 ? SW Sep12 0:01 [kswapd0] root 151 0.0 0.0 0 0 ? SW Sep12 0:00 [kseriod] root 188 0.0 0.0 0 0 ? SW Sep12 0:00 [scsi_eh_0] root 189 0.0 0.0 0 0 ? SW Sep12 0:00 [scsi_eh_1] root 204 0.0 0.0 0 0 ? SW Sep12 0:00 [kjournald] root 339 0.0 0.0 2336 216 ? S< Sep12 0:00 udevd root 896 0.0 0.0 0 0 ? SW Sep12 0:00 [kjournald] root 897 0.0 0.0 0 0 ? SW Sep12 0:00 [kjournald] root 898 0.0 0.0 0 0 ? SW Sep12 0:00 [kjournald] root 899 0.0 0.0 0 0 ? SW Sep12 0:00 [kjournald] root 1637 0.0 0.0 0 0 ? SW< Sep12 0:00 [krfcommd] root 1946 0.0 0.0 18104 748 ? S Sep12 0:00 /usr/sbin/sshd root 5189 0.0 0.1 37540 1056 ? S 02:04 0:00 \_ sshd: root pts/0 root 5195 0.0 0.0 45656 1020 pts/0 S 02:04 0:00 | \_ -bash root 5255 0.0 0.1 104764 1892 pts/0 S 02:04 0:00 | \_ gkrellm root 29075 0.0 0.0 44836 500 pts/0 S 02:38 0:00 | \_ sleep 10 root 6119 0.0 0.0 37284 1020 ? S 02:19 0:00 \_ sshd: root pts/1 root 6133 0.0 0.1 45656 1120 pts/1 S 02:19 0:00 | \_ -bash root 29079 0.0 0.0 44476 924 pts/1 S 02:38 0:00 | \_ /bin/sh ./memory.sh root 29083 0.0 0.0 5228 784 pts/1 R 02:38 0:00 | \_ ps uaxwwf root 6193 0.0 0.0 37284 1020 ? S 02:20 0:00 \_ sshd: root pts/2 root 6212 0.0 0.1 45656 1136 pts/2 S 02:20 0:00 | \_ -bash root 29077 0.0 0.1 35936 1932 ? S 02:38 0:00 \_ sshd: bin [priv] sshd 29078 0.0 0.1 19448 1120 ? S 02:38 0:00 \_ sshd: bin [net] root 2542 0.0 0.0 2344 272 tty1 S Sep12 0:00 /sbin/mingetty tty1 root 2543 0.0 0.0 2344 272 tty2 S Sep12 0:00 /sbin/mingetty tty2 root 2544 0.0 0.0 2344 272 tty3 S Sep12 0:00 /sbin/mingetty tty3 root 2545 0.0 0.0 2344 276 tty4 S Sep12 0:00 /sbin/mingetty tty4 root 2546 0.0 0.0 2344 276 tty5 S Sep12 0:00 /sbin/mingetty tty5 root 2547 0.0 0.0 2344 276 tty6 S Sep12 0:00 /sbin/mingetty tty6 As a follow-up: This is unrelated to the other memory leak concerning SG_IO/bio_uncopy_user (bugs #132180 and #131414). The system in question is plain-old IDE driven and has no SATA/SCSI/CD-ROMs/USB-devices attached. The memory leak occurs for instance while trying to rebuild the src.rpm to kernel-2.6.8-1.521 (w/o modifications), or trying to build kernel modules for it (the lirc build for instance eats up the 1GB memory already in the configure phase). This also happens with kernel-2.6.8-1.541. A posting on lkml suggest that the CFQ scheduler may be the cause of the leak. http://lkml.org/lkml/2004/8/27/102 If this is the case, please test with "elevator=deadline" or "elevator=as" boot options and report back. It turns out that the bug I am seeing is x86_64 specific when running 32 bit applications on FC2/x86_64. I have therefore opened a new bugzilla entry at #132947. Created attachment 104830 [details]
Snips of logs
We are observing the same bug on one of our production machines. It is Intel
Pentium 4 with 1 GB of memory. Memory is simply consumed outside of userland.
"free" is showing that near all the memory is used. "ps xuawf" states that
processes are barely using not even cca 15% of memory. At the end of the show
you can read in "/var/log/messages" that "oom-killer" is killing all the
processes together with X server. But it not helps a lot a restart is needed.
We have observed the same behaviour on the kernel 2.6.7 as well. We are running now on the custom made kernel 2.6.8.1 with a voluntary preemptive patch. Bug is still present. The kernel was whole time running with an anticipatory io scheduler. We are trying a kernel parameter "elevator=deadline" at this moment. Parameter "elevator=deadline" has no influence on the bug presence. Any idea what we can do? I've experienced the same symptoms until recently on an old K6. I'm using a custom compile of 2.6.8-1.521 with gcc-3.3.3-7. I was able to resolve it by unsetting CONFIG_CC_OPTIMIZE_FOR_SIZE. This is the only change I have made in my broken and functional kernel. I don't know what other differences I might have between my kernel and the stock one, though. We've experienced something like this with a 128MB Celeron system which backs up files over NFS to tape using tar. It's fallen over a couple of times with the OOM killer, even though sar shows that very little swap is consumed. Is this issue solved if you try the 6XX rawhide kernels on FC2? It should work. Even though this bug is filed against FC2, I'd like to notify you that something related is happening with FC3T3. I can file a separate bug about this if desired. From my point of view memory gets used normally but instead of swapping out some pages to swap, the kernel starts killing innocent processes even though there's plenty of swap space available. The attachment oom-tiikeri.txt contains some additional information about the problem. I'm running 2.6.8-1.624 on x86_64 architecture, with 512MB of RAM and 4GB of swap. I loaded a 680MB SQL database dump in nano. So far things work fine, swap usage has grown to 573MB which was expected. When I try finding a non-existent string in nano (ctrl-w and some random string) things start going bad. The kernel starts by killing mysqld and httpd, and then eventually kills nano itself. The swap space usage peaks at about 600MB, meaning that there was always some 3.4GB of free swap available. After nano gets killed the used memory and swap is freed properly, no problems there. The problem is that the kernel starts killing processes a bit too eagerly, instead of swapping things out to swap area. I tried elevator=as and elevator=deadline, but they didn't help at all. The good(?) news is that the behaviour can be reproduced with 100% certainty. My FC2 computer (2.6.8-1.521, AMD Duron) behaves similarly, except that the first victim is nano itself. Probably because there are no other memory-hungry processes running on that box. Swap space usage peaked at about 650MB, out of a total of 3GB. Created attachment 105388 [details]
Some logs from my OOM experiences
free, /proc/meminfo, /proc/slabinfo, vmstat, /var/log/messages entries
To Anssi: Are you running 32 bit apps on the x86_64 box? If yes, then this could be bug #132947, another kernel memory leak bug. Does downgrading to 2.6.7 kernel rpms help? Axel: No, all the binaries are all 64bit, at least according to 'file'. The installation is pretty much the basic FC3T3 (the x86_64 variant) without X, with the addition of MySQL4 which was downloaded from the MySQL website. MySQL is also the 64bit version. Doing the same experiment without mysql running gives similar results, ie. processes (such as httpd) get killed and then eventually nano, so I can't really blame any external software for these problems. Hmm, this is interesting. I downloaded 2.6.7-1.494.2.2 as instructed (from FC2 x86_64 updates) and installed it. I had to disable the Via Velocity Gbit ethernet module to make it not crash during the boot, though.. However, after the 2.6.7 kernel was running it worked beautifully in this respect. I loaded 4 copies of nano (each had a memory footprint of about 1 gigabyte) and searched for strings simultaneously with all of them. No problems whatsoever. Loading the fifth nano pushed the memory usage beyond the available 4GB swap + 512M RAM, after which mysqld was killed, then httpd and finally nano. This was the expected behaviour. Looks like something has definitely broken between 2.6.7-1.494.2.2 and 2.6.8-1.624. Do you have suggestions for specific kernel versions that I should try to pinpoint the exact version which started causing problems? I have encountered the memory leak to enter between and 2.6.7-1.494.2.2 and 2.6.8-1.521 but only for mixed 64/32 bits (bug #132947). It could be the same bug though triggered by different conditions. If you find the bug to occur between these two versions, I would suggest to try a vanilla kernel (2.6.8 with a configuration derived from 2.6.8-1.521). After a few tests (I reported this bug at the end of august) I shifted to a vanilla kernel since I could not run servers with this problem. I have absolutely no problem with a vanilla kernel. As someone as indicated, I think that the bug has something to do with CONFIG_CC_OPTIMIZE_FOR_SIZE which seems to be set in the fedora kernels but it is not in the vanilla ones (at least the ones I run). Our vanilla kernel 2.6.8.1 was functional but broken with CONFIG_CC_OPTIMIZE_FOR_SIZE not being set. After we switched to Redhat stock kernel 2.6.8-1.521 the memory leak disappeared. I got a similar problem in the middle of writing an audio CD. I must admit, because of the recent cdrecord scsi permission problem in 2.6.8, I was running cdrecord as root. Based on a similar report quite some time back, Andrew Morton on the lkml list had said[1] that it was an untraceable problem occuring when you burn audio CDs. Debian seems to have released[2] a version of the kernel which fixes the problem. The problem still exists in current FC2 kernel(2.6.8-1.521). Are there any fixes forthcoming for the FC2 kernel? [1] http://lkml.org/lkml/2004/8/7/6 [2] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=267464 Created attachment 105473 [details]
Vanilla 2.6.8.1 kernel .config based on .config from FC kernel 624
If this is any help: I downloaded 2.6.8.1, installed it, copied the .config
from Fedora Core kernel build 624 and ran "make oldconfig". The resulting
.config file is attached. After the new kernel was compiled and installed, I
rebooted and did the usual tests. The tests passed fine. Later on I noticed
that the CONFIG_CC_OPTIMIZE_FOR_SPEED option had been turned off during "make
oldconfig". I changed the Makefile so that it'd pass the option -O2 always
instead of -Os , cleaned up the previous build and recompiled the kernel. The
tests passed nicely again, regardless of this configuration change. I'll try to
compare the differences between vanilla kernels and FC kernels later on.
2.6.8.1 vanilla is significantly older than 624. Please try the same test on 2.6.9. Created attachment 105495 [details] Vanilla 2.6.9 kernel .config based on .config from FC kernel 624 Typo: in comment #22 I should have written CONFIG_CC_OPTIMIZE_FOR_SIZE instead of CONFIG_CC_OPTIMIZE_FOR_SPEED. Vanilla 2.6.9 with the attached .config (624 + "make oldconfig") suffers from the premature OOM kills as well. Let us summarize. Comment #10 Toshio indicates that CONFIG_CC_OPTIMIZE_FOR_SIZE when unset prevents this trouble. Comment #22 and Comment #24 indicate that CONFIG_CC_OPTIMIZE_FOR_SIZE makes no difference and the upstream vanilla kernel works fine, while RH's patched kernel does not. Axel Thimm indicates that he sees this on x86_64. What archs are the other reporters seeing this? http://people.redhat.com/wtogami/temp/kernel-2.6.9-1.639.src.rpm Can anybody confirm Comment #10 with a rebuilt 639 kernel? I would do it myself, but I totally cannot reproduce this behavior. > Axel Thimm indicates that he sees this on x86_64. What archs are the
> other reporters seeing this?
x86_32 for me.
i586 or i686? i686 [root@pluto root]# uname -a Linux pluto.home 2.6.8-1.521 #1 Mon Aug 16 09:01:18 EDT 2004 i686 athlon i386 GNU/Linux In reply to comment #25: > Axel Thimm indicates that he sees this on x86_64. The bug is see is of different nature and x86_64 specific (ia32 emulation within x86_64, see comment #6 and bug #132947). It is possible that the two bugs are related, and triggered by different subsystems, but for now I would deal with them as separate entities, so I would remove my datapoint, and also ignore the posted vmstats. OTOH Anssi in comment #16 has confirmed pure x86_64 to cause his problems. My main test rig is indeed an x86_64 computer, an AMD Athlon 64 3500+ on an Abit AV8 motherboard. This system is running FC3T3, but I've upgraded to the latest packages found from the development repository. In comment #13 (the last paragraph) I mentioned that I'm able to reproduce the problem on a FC2 computer with a 32-bit AMD Duron processor (Linux karhu.aj 2.6.8-1.521 #1 Mon Aug 16 09:01:18 EDT 2004 i686 athlon i386 GNU/Linux). A clarification for my comment #24: I didn't test 2.6.9 (yet) with CONFIG_CC_OPTIMIZE_FOR_SIZE enabled. I'll test that option, and the 639 version once I get back home from work. OK great. We are in agreement that there is a problem on multiple archs, we only now need to isolate exactly what causes this problem. In reply to comment #25: > upstream vanilla kernel works fine, while RH's patched kernel does not. 2.6.8.1 works fine but 2.6.9 does not, see comment #24. I guess this means I'll have to test some 2.6.9 release candidates as well. All, apologies for the long post, I have a lot of info here: I have 3 systems with this issue: I am using FC2 on 2 athlon 2200 with 1 GB ram. I see this bug crop up during continuous use of an IDE RAID controller. (IT8212 - integrated raid on Giga-byte mother boards) One a production system using zonemonitor to record video from security cameras. The second is a test / devel box. The leak progresses very slowly, and takes about 3 weeks for the oom killer to kick in on the production box. the test system has identical hardware, except the drives attached to the raid controller are rarely used, hence this machine did not display this behaviour. I also have an HP LH3000r Dual PIII 866, 2 GB ram FC2, 2.6.8-1.521 running on the integrated netraid controller (megaraid.ko) on this system, not all physical memory is being used, yet over the last few weeks since installing 2.6.8.1-521 the swap usage is also creeping upward. (currently at 25MB) I wrote a script to simply copy a bunch of files between two folders on drives attached to the RAID controller on the test box) In 24 hours, my swap file usage went from 0 to 550 Mb using 2.6.8-1.521. In 48 hours, the page file was full and the oom killer kicked in. I reverted to 2.6.5-1.358 and have observed an improvement, but after 24 hours the page file is at 1 MB. The only thing the test machine is doing is copying those files back and forth, no apache, or X or other services. I have noticed that the harder the file-system is being used, the faster the memory fills up. If you completely stop using the filesystem mounted from a disk using the scsi generic modules, the pagefile clears out. Seems that the more heavilly loaded the system is, the more prevalent the bug. I am currently compiling a vanilla 2.6.8 kernel right now (without using the FC2 config). I will also try using the FC2 config.(I only have one machine to test with, and the results take up to 48 hours. I'll try anything anyone wants here, and will also provide output and system info as requested. robert [dot] toole [at] keuhne [dash] nagel [dot] com I did some further testing on 2.6.9. First of all, the -Os and -O2 parameters (CONFIG_CC_OPTIMIZE_FOR_SIZE) do not seem to have an effect to this problem. I also tried the elevator=deadline option for both kernels, one with -Os and the other with -O2. The elevator option didn't seem to have an effect either, so I think I'll concentrate on finding the exact kernel version which broke the memory management for me. I know it's somewhere between 2.6.8.1 and 2.6.9. Now that I finally got my memory sticks I tried upgrading the memory to 4GB. And boy did I open another can of worms by doing that. After booting up I ran "free" and saw that I had 4GB of usable memory in Linux. I loaded the SQL dump file with nano and did the usual text search trick. It worked fine, but that's probably because the system had enough RAM this time (one text editor instance uses only about 1GB of memory). I loaded another copy in another session, it loaded nicely as well. I started loading the third file when the system seemed to get stuck, nano just stopped at "Reading File". On another channel I tried running "free" to see if the system was already swapping at this point (it shouldn't). Surprise surprise, the "free" command itself got stuck, it refused to output anything and also refused to return back to command prompt. Running "ls" and "vmstat 2" on other channels resulted in the same, nothing was printed on the screen and those commands didn't return to the command prompt. The editors on channels 1 and 2 were responsive and worked, until I tried to move to the next page. This move made both of them get stuck. This happened with 2.6.9 and elevator=as. I've now gone back to 512M until the OOM problems get solved. Next up, results from the 639 build and from some 2.6.9 release candidates.. Sorry, the 639 build didn't help in my situation. Created attachment 105549 [details]
syslog just before crash after inserting an USB mass storage device (digital camera)
Created attachment 105550 [details]
syslog during the reboot after the crash. Note that the USB mass storage device (digital camera) is still plugged in
This and the last attachment (105549) are the logs after my computer crashed
again due to that crazy oom-killer :). This time all that I did was plug in my
Olympus-C120 into the usb slot. The computer froze with the HDD light
continuously red. I rebooted.
The camera was still attached to the computer while rebooting.
The weird thing is that as you can see in the post boot logs, even during the
bootup process, the kernel oom-killer kept on killing. Yyou can find out that
it is the usb.agent process that kept on being killed. Ofcourse I couldnt see
all this happening, 'coz it kept happening in the background. After I logged
in, I could simply mount the camera (remember that it was still plugged in),
and use it normally. So all those oom-killer massacre actually didnt prevent me
from using the device the second time.
To say the least all this terribly upsets me because this is a workstation. I
cant keep on rebooting and testing. Is it advisable to downgrade to an earlier
kernel?
Quoting myself from comment #34: > I think I'll concentrate on finding the exact kernel version which > broke the memory management for me. I know it's somewhere between > 2.6.8.1 and 2.6.9. A couple of patches and kernel recompiles later I was able to determine that the breakage happened between 2.6.8.1-bk2 and 2.6.9-rc1 on August 23 or thereabouts. Given enough time I'll eventually figure this out by myself, but I'd appreciate it if someone more familiar with kernel development would take a look at what happened between those releases. Sorry guys if I cannot contribute about this even if the original post is mine (I have no hardware on which I can test as of today). Anyway, from my experience at the time of my posting, disk activity (i.e. filesystem) could have something to do with this. When the OOMkill kicked in, the disks led was constantly on and the disks were spinning loudly. At that time I thought it was swap activity, but it could have very well been file system activity since there were programs doing something on the filesystem, like printing large files or up2date. My arch is x86_32 i686, both intel and AMD. Just did 2.6.9 vanilla with the fc2 config, after 24 hours of continous heavy disk usage, it was much better than 2.6.8-1.521, but swap use grew to 300 Mb in that time. arch is x86_32 i686, intel and AMD. I am going to roll my own config for 2.6.9 and try again. I agree that the heavier the disk usage, the worse the problem. Created attachment 105621 [details] Some of the differences between 2.6.8.1-bk2 and 2.6.9-rc1 Bug tracking issue: Looks like there are at least two kinds of bugs being discussed here. One seems to be a memory leak issue and the one I've been writing about is the odd habit of the kernel to OOM kill processes even though there's plenty of swap space available. I haven't seen any memory leaks (that I know of) during my testing, so I can't comment on those. Unfortunately the original bug report is a bit unclear on which problem he has been experiencing, so it looks like we're going to be discussing both problems here until someone makes some administrative cleanup decisions. On the debugging front I'm going to mention that I've just tried 2.6.9-bk6 but it didn't help in my problem, the kernel still happily OOM killed my text editor even though there's more than 3GB of available swap space. bk{3,4,5} didn't even compile and bk2 didn't help either. Haven't tried bk1, but I suppose it'd be useless. As I mentioned earlier in comment #38 I was able to determine that the breakage happened between 2.6.8.1-bk2 and 2.6.9-rc1. I've been comparing the differences between those versions. I started with 2.6.8.1-bk2 and integrated some of the files from the 2.6.9-rc1 mm directory, one by one, taking care of the dependencies when the compiler complained about them. The end result is that I started seeing those OOM problems once I installed the new mm/memory.c (and the related mm/thrash.c which was introduced with -rc1). Going back to the 2.6.8.1-bk2 version of mm/memory.c makes the problem go away again, so I think I'm on the right track. One minor problem is that mm/memory.c has since then been modified again at least in the standard kernel (dunno about Fedora Core), so any possible fixes would probably have to be forward-ported to the newest version. Attached is a diff file that should illustrate the relevant changes between 2.6.8.1-bk2 and 2.6.9-rc1. I'm unable to spot the bug in those files, I think I'll conveniently blame the current time of day (4am) for my blindness. Let me know if you think something should be changed. I'll be away for most of the weekend (stupid work trip), so don't expect any replies from me in the next few days. Oh yes, one more thing. Commenting out the grab_swap_token() function call in mm/memory.c also made the problems disappear, so there's definitely something fishy going on in mm/thrash.c. "Kernel build 640 should be enough for anybody" -- Fedora Core .. but unfortunately it isn't enough at least for me, the OOM massacre continues despite upgrading to the latest available development kernel which is 640 at the moment. On the other hand, plain vanilla 2.6.9-bk6 works when the grab_swap_token function call is commented out. (I'm the original poster) Bug tracking issue: I agree it seems that there are 2 kinds of bugs here, they could be related or could be not. The one I experienced was the OOM kill with plenty of swap available (just look at the report from 'free' in my original posting, that is the normal situation of the machine, 'free' gives similar numbers on that machine almost always). I did not experienced memory leaks to fill up the swap. On the other hand, when OOM killing kicked in there was a lot of disk activity, of which kind I cannot say. Sorry again I have no hadrware to test on these days. Based on my limited understanding of kernel internals I'd say the memory leaks won't get fixed when/if this OOM killing bug gets fixed, so if you're suffering from memory leaks I'd suggest either finding another memory leak bug from Bugzilla or submitting a new bug report. The system freeze problem I mentioned in comment 34 is probably related to bug #135312. It crashed again on me. Again just after inserting the digital camera. However, just before inserting the camera I had top running. And after inserting the camera, the hard disk light stayed red. And the swap usage grew very fast from virtually 0 till all the swap was exhausted. And then the machine froze. The usb.agent process remained at the top of the display, slowly increasing in memory and CPU usage till the end. The load average shot up, I think, to 30+. The machine had been on just for about an hour. So this is not dependent on how long the machine has been up. I am not sure whether this USB hotplug problem (which has the same oom-killer murder trail) merits a separate bug report. I wonder if others have similar problems in using camera kinds of usb mass storage. Whatever be the case, this seems to be a VM problem. Almost as if killing a process doesnt free up VM space, and the process is spawned again and killed again and so on (which happened to the usb.agent program as you can see in an earlier attachment that I had added to this issue, and also can be made out from the load average) Are there any VM patch in the Fedora kernel which has been added to the vanilla source? Comment #42 is a big hint. Thank you Anssi, I'll brew up a test patch soon. OK, I just posted a patch to the linux kernel mailing list. It appears to help my (simple) test case: http://lkml.org/lkml/2004/10/25/357 Hi, I applied your patch to 2.4.10-rc1 and the OOM kills are gone now :) Of course it still kills when it has no other choice, but now the kernel understands to swap as needed. However, during my testing I managed to get a stack trace from kswapd, something along the lines of "page allocation failure". In my test I consumed all the memory and swap space until Linux started killing processes. First it hit mysqld a couple of times, then I got that kswapd error message. Unfortunately the message scrolled past the screen so quickly that I couldn't make a screen capture or anything, and there are no traces of that in /var/log/messages. Probably because vanilla kernels don't seem to coexist nicely with FC3's SELinux stuff. This kswapd error message might be a separate issue (or actually a non-issue), but I'm reporting it here just for completeness. *** Bug 137618 has been marked as a duplicate of this bug. *** Created attachment 107318 [details]
/var/log/messages snip from (uname -a | cut -d ' ' -f 3-) 2.6.9-1.3_FC2 #1 Mon Nov 15 14:46:43 EST 2004 i686 i686 i386 GNU/Linux
I think the attached simply duplicates previous evidence.
Is this bug resolved whitout the Riel's patch in kernel 2.6.10? I have same bug in a Redhat 9 with static kernel 2.6.8 from kernel. org. We just ran into this on an RHEL 4 box with kernel-smp-2.6.9-5.0.3.EL The OOMkill was firing like crazy with no swap in use and 2.8 GB of physical RAM free. I have no additional evidence to offer beyond what's already here (it all looks the same). I rebooted with the kernel set to use the deadline scheduler, I'll give it a few days to see if it does any better. It lasted 2 days 12 hours and started doing it again. So the deadline scheduler didn't help us, either, as with one other report on this bug. Is this still an issue with the latest RHEL4, FC3, or FC4 kernel? Hasnt happened to me for quite a while. :) Sorry. I forgot to mention that I am using FC3 - 2.6.11-1.14_FC3 I'm also using 2.6.11-1.14_FC3 with 512MB and it has happened to me! May 19 14:54:56 asterix kernel: HighMem free:0kB min:128kB low:160kB high:192kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no May 19 14:54:56 asterix kernel: lowmem_reserve[]: 0 0 0 May 19 14:54:57 asterix kernel: DMA: 1*4kB 0*8kB 1*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 2068kB May 19 14:54:57 asterix kernel: Normal: 156*4kB 3*8kB 2*16kB 3*32kB 1*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 3400kB May 19 14:54:58 asterix kernel: HighMem: empty May 19 14:54:58 asterix kernel: Swap cache: add 370744, delete 370504, find 112200/125273, race 0+0 May 19 14:54:58 asterix kernel: Free swap = 0kB May 19 14:54:59 asterix kernel: Total swap = 1052216kB May 19 14:54:59 asterix kernel: Free swap: 0kB May 19 14:54:59 asterix kernel: 131068 pages of RAM May 19 14:54:59 asterix kernel: 0 pages of HIGHMEM May 19 14:54:59 asterix kernel: 2350 reserved pages May 19 14:54:59 asterix kernel: 247152 pages shared May 19 14:54:59 asterix kernel: 240 pages swap cached May 19 14:54:59 asterix kernel: Out of Memory: Killed process 18683 (firefox-bin). To add my penny, although it may be irrelevant: Dual Pentium 3 system, 512MB RAM, two Mylex SCSI RAID5 cards, PATA system hd. This used to work "just fine" for several months. A headless FC2 file server: LDAP, NFSv3, CUPS, Samba. It was running 2.6.8-1.521smp, after update to kernel-smp-2.6.10-1.771_FC2 the LTO tape writes slowed unacceptably, and we resumed the 'good' #uname -a 2.6.8-1.521smp #1 SMP Mon Aug 16 09:25:06 EDT 2004 i686 i686 i386 GNU/Linux After that we have got 'oom-killer' twice, first on 28.-29.5. and second time yesterday (9.6.). We even had one maintenance reboot between those dates. It used to work, so I wonder. After yesterdays reboot I got on the console messages from ip_conntrack table being full and dropping packets. Simultaneously at least one NFS clients (all FC3) autofs was attempting to mount a non-existent directory (autofs wildcard), of course failing, repeating from different port, running out of port numbers, etc. Similar thing had occurred before both oom-killer incidents. Apparently the mount now defaults to tcp, while it used to use udp. The OpenLDAP does also use tcp, and several connections from each client. The number of tcp connections does explain the 'out of ports' problem. Furthermore, autofs (with 60 second timeout) seems needlesly active: <--snip--> Jun 9 12:07:14 fs rpc.mountd: authenticated unmount request from client:603 for /NFS/home/foo (/NFS) Jun 9 12:08:15 fs rpc.mountd: authenticated unmount request from client:621 for /NFS/home/bar (/NFS) Jun 9 12:08:15 fs rpc.mountd: authenticated mount request from client:623 for /NFS/home/bar (/NFS) Jun 9 12:08:15 fs rpc.mountd: authenticated mount request from client:625 for /NFS/home/foo (/NFS) Jun 9 12:09:30 fs rpc.mountd: authenticated unmount request from client:651 for /NFS/home/foo (/NFS) Jun 9 12:10:30 fs rpc.mountd: authenticated unmount request from client:670 for /NFS/home/bar (/NFS) Jun 9 12:10:30 fs rpc.mountd: authenticated mount request from client:672 for /NFS/home/bar (/NFS) Jun 9 12:10:30 fs rpc.mountd: authenticated mount request from client:674 for /NFS/home/foo (/NFS) Jun 9 12:11:45 fs rpc.mountd: authenticated unmount request from client:700 for /NFS/home/foo (/NFS) Jun 9 12:11:45 fs rpc.mountd: authenticated mount request from client:702 for /NFS/home/foo (/NFS) Jun 9 12:12:45 fs rpc.mountd: authenticated unmount request from client:724 for /NFS/home/foo (/NFS) Jun 9 12:12:45 fs rpc.mountd: authenticated mount request from client:726 for /NFS/home/foo (/NFS) Jun 9 12:13:45 fs rpc.mountd: authenticated unmount request from client:748 for /NFS/home/foo (/NFS) Jun 9 12:13:45 fs rpc.mountd: authenticated mount request from client:751 for /NFS/home/foo (/NFS) <--/snip--> User 'foo' and 'bar' were not logged into the 'client' and 'fstat -mv' did not list any users of those mounts. The automounter seems to have trouble letting unused mounts go. Now, my question is: could the frequently repeated umount/mount from the most recent version of FC3 nfs-client utilities somehow invoke a leak within 2.6.8-1.521smp NFSv3? |