Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 90116
Summary: | unusual performance degredation between RHL 8.0 and RHL 9 | ||
---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Chris Ricker <chris.ricker> |
Component: | kernel | Assignee: | Dave Jones <davej> |
Status: | CLOSED RAWHIDE | QA Contact: | Brian Brock <bbrock> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 9 | CC: | chuck, djh, dkelson, jroyse, krmaxwell, mingo, petrosyan, pfrields, rdieter |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2003-10-21 20:08:08 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 100643 |
Description
Chris Ricker
2003-05-02 18:55:01 UTC
nosysinfo doesn't disable nptl; it just makes glibc use another thread library. Could it be that userspace just got bigger? On the "slow" machines, installing the the RHL8.0 kernel fixes the problem. In case it's not clear, all machines are running RHL 9. RHL 9 userspace + RHL 9 kernel is slow on "slow" boxes and fast on "fast" boxes. RHL 9 userspace + RHL 8.0 kernel is fast on both "slow" and "fast" boxes "slow" boxes and "fast" boxes are identical hardware and software wise (RHL9). The "slow" boxes are slow because their system calls are 4x-10x slower than the other boxes. In a room of 14 systems, about 50% are slow and %50 are fast. Installing the RHL8.0 2.4.18-14 kernel on the "slow" boxes makes their system run at a normal speed. Maybe we should have said "slow" and "normal" (instead of fast). please try the 2.4.20-9.5 kernel at http://people.redhat.com/arjanv/testkernels it has some tweaks for performance No change. # uname -a Linux station11.example.com 2.4.20-9.5 #1 Thu May 1 12:55:16 EDT 2003 i686 i686 i386 GNU/Linux # strace -c ls -al /etc > /dev/null execve("/bin/ls", ["ls", "-al", "/etc"], [/* 30 vars */]) = 0 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 41.08 0.018893 65 290 lstat64 11.13 0.005118 205 25 read 8.93 0.004105 103 40 13 open 6.53 0.003002 131 23 munmap 5.91 0.002719 124 22 mmap2 5.38 0.002473 80 31 fcntl64 4.12 0.001894 105 18 readlink 3.63 0.001667 238 7 getdents64 3.32 0.001525 51 30 close 2.34 0.001078 98 11 old_mmap 1.95 0.000897 32 28 fstat64 1.42 0.000654 38 17 brk 1.04 0.000476 238 2 socket 1.01 0.000466 233 2 2 connect 0.90 0.000414 83 5 write 0.83 0.000380 127 3 3 ioctl 0.43 0.000196 196 1 uname 0.03 0.000014 5 3 2 rt_sigaction 0.02 0.000010 10 1 set_thread_area 0.01 0.000005 5 1 gettimeofday ------ ----------- ----------- --------- --------- ---------------- 100.00 0.045986 560 20 total Just to clarify further, the -8 and -9 kernels are OK on about 50% of the machines. On those "normal" machines, that third column doesn't have any values higher than 70. So this isn't a problem that manifests itself consistently. Quite a head scratcher since the machines are precisely identical hardware and software. BTW, the "ls -al /etc" is just a test case that quantifies the problem. Trying to use the machine for any work is nearly pointless due to the slowness. Installing the 2.4.18-14 kernel makes the "slow" machines run with normal speed. kernel-2.4.20-13.9.i686 doesn't solve the problem. system calls on "slow" boxes are still markedly slower Hmm, this is interesting kernel-BOOT-2.4.20-13.9.i386.rpm solves the problem. On both "slow" and "fast" systems, system calls operate at "fast" speeds (of course, the network cards don't work, but hey ;-) I recompiled kernel-2.4.20-13.9.src.rpm specifically for i386. The 2.4.20-13.9.i386.rpm kernel is slow on the "slow" machines and fast on the "fast" machines It appears that something that's different between kernel-BOOT and kernel (and not the i386 / i686 difference between kernel-BOOT-2.4.20-13.9.i386.rpm and kernel-2.4.20-13.9.i686.rpm) is what's breaking on these "slow" machines Try booting with "apm=off" ? and also maybe comparing the bios settings for power manglement ? Booting with apm=off resulted in no change. I believe the BIOS settings, including APM settings are identical between machines. We checked them closely when we opened the bug. I'm on the road this week, so I'm not there in person to check again. Just inspecting the differences in the strace -c output, the only system class that don't seem to be impacted are getttimeofday and rt_sigaction. To recap, RHL8.0 kernels don't exhibit the slowdown, RHL9 -boot kernels don't exhibit the slowdown. Some computers in the classroom have no slow down with RHL9 kernels. All computers are identical hardware wise. Arghhh. I meant "system call" not "system class". more data: Latest RHL9 errata kernel (still) has the slowdown Latest RHL8.0 errata kernel (kernel-2.4.20-20.8) has the slowdown Original RHL8.0 kernel (2.4.18-14) *NO SLOWDOWN* ftp.kernel.org vanilla 2.4.20 *NO SLOWDOWN* I compiled the vanilla 2.4.20 kernel using: /usr/src/linux-2.4.20-8/configs/kernel-2.4.20-i686.config stats from vanilla 2.4.20 kernel: [root@station9 root]# uname -r 2.4.20 [root@station9 root]# strace -c ls -al /etc > /dev/null execve("/bin/ls", ["ls", "-al", "/etc"], [/* 30 vars */]) = 0 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 41.25 0.002895 10 289 lstat64 12.92 0.000907 36 25 read 12.61 0.000885 22 41 14 open 5.34 0.000375 21 18 readlink 5.27 0.000370 12 31 old_mmap 5.14 0.000361 52 7 getdents64 5.06 0.000355 15 23 munmap 2.82 0.000198 7 30 close 2.11 0.000148 5 28 fstat64 1.44 0.000101 3 31 fcntl64 1.44 0.000101 51 2 socket 1.42 0.000100 50 2 2 connect 1.27 0.000089 5 17 brk 0.80 0.000056 56 1 mmap2 0.57 0.000040 8 5 write 0.19 0.000013 4 3 2 rt_sigaction 0.17 0.000012 4 3 3 ioctl 0.13 0.000009 9 1 uname 0.06 0.000004 4 1 gettimeofday ------ ----------- ----------- --------- --------- ---------------- 100.00 0.007019 558 21 total So the slow down comes from one of the RH kernel patches? results from vanilla ftp.kernel.org 2.4.21 kernel *NO SLOWDOWN* [root@station9 root]# uname -r; strace -c ls -al /etc > /dev/null 2.4.21 execve("/bin/ls", ["ls", "-al", "/etc"], [/* 30 vars */]) = 0 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 39.13 0.002485 9 289 lstat64 14.52 0.000922 37 25 read 12.52 0.000795 19 41 14 open 5.84 0.000371 21 18 readlink 5.81 0.000369 53 7 getdents64 5.53 0.000351 11 31 old_mmap 4.94 0.000314 14 23 munmap 2.63 0.000167 6 30 close 1.97 0.000125 4 28 fstat64 1.75 0.000111 56 2 socket 1.56 0.000099 50 2 2 connect 1.34 0.000085 3 31 fcntl64 1.26 0.000080 5 17 brk 0.47 0.000030 6 5 write 0.20 0.000013 13 1 mmap2 0.17 0.000011 4 3 2 rt_sigaction 0.16 0.000010 3 3 3 ioctl 0.14 0.000009 9 1 uname 0.06 0.000004 4 1 gettimeofday ------ ----------- ----------- --------- --------- ---------------- 100.00 0.006351 558 21 total results from vanilla ftp.kernel.org 2.4.22 kernel *NO SLOWDOWN* [root@station9 root]# uname -r; strace -c ls -al /etc > /dev/null 2.4.22 execve("/bin/ls", ["ls", "-al", "/etc"], [/* 30 vars */]) = 0 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 43.11 0.003011 10 289 lstat64 12.76 0.000891 22 41 14 open 12.37 0.000864 35 25 read 5.33 0.000372 53 7 getdents64 5.18 0.000362 20 18 readlink 5.03 0.000351 11 31 old_mmap 4.42 0.000309 13 23 munmap 2.78 0.000194 6 30 close 2.12 0.000148 5 28 fstat64 1.56 0.000109 4 31 fcntl64 1.50 0.000105 53 2 2 connect 1.42 0.000099 50 2 socket 1.23 0.000086 5 17 brk 0.49 0.000034 7 5 write 0.19 0.000013 4 3 2 rt_sigaction 0.16 0.000011 4 3 3 ioctl 0.16 0.000011 11 1 mmap2 0.13 0.000009 9 1 uname 0.07 0.000005 5 1 gettimeofday ------ ----------- ----------- --------- --------- ---------------- 100.00 0.006984 558 21 total Latest RHL9 errata kernel *SLOWDOWN PRESENT* Take a look at that third column. [root@station9 root]# uname -r; strace -c ls -al /etc > /dev/null 2.4.20-20.9 execve("/bin/ls", ["ls", "-al", "/etc"], [/* 30 vars */]) = 0 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 43.57 0.019390 67 289 lstat64 10.88 0.004841 194 25 read 9.82 0.004369 109 40 13 open 5.84 0.002601 145 18 readlink 5.78 0.002574 112 23 munmap 4.51 0.002007 72 28 fstat64 4.17 0.001857 265 7 getdents64 3.12 0.001387 45 31 fcntl64 2.53 0.001124 37 30 close 2.42 0.001078 98 11 old_mmap 1.87 0.000834 49 17 brk 1.07 0.000475 238 2 2 connect 1.06 0.000473 237 2 socket 1.00 0.000446 20 22 mmap2 0.91 0.000405 81 5 write 0.52 0.000233 78 3 2 rt_sigaction 0.44 0.000197 66 3 3 ioctl 0.44 0.000197 197 1 uname 0.02 0.000011 11 1 set_thread_area 0.01 0.000003 3 1 gettimeofday ------ ----------- ----------- --------- --------- ---------------- 100.00 0.044502 559 20 total Culprit found!! While I was in the shower this morning I was pondering why the -boot kernel (which has the RH patches) had no slowdown while the normal does. I concluded that it was likely one of the kernel modules. Using trial and error I narrowed it down to the lm_sensors modules. The ftp.kernel.org kernels don't have lm_sensors (hence, no slowdown with them). We have the following in our rc.local: modprobe i2c-isa; modprobe w83781d This has worked fine ever since RH has included lm_sensors until the RH 2.4.20 based kernels. Without those loaded at boot time all the "slow" machines return to "normal" speed and the strace test confirms it. In testing I found that unloading lm_sensors via rmmod is *not sufficient* to return to the machine to normal speed. Once loaded (and even unloaded) the slowdown is in effect until a reboot. To reiterate, lm_sensors in RH 2.4.18 kernels OK, lm_sensors in RH 2.4.20 kernels NOT OK. The funny thing is that *some* of these "identical" computers are OK with the lm_sensors in RH 2.4.20 kernels. I suspect that the normal-speed-with-RH-2.4.20 kernel boxes actually have a new rev of motherboard. Since over the years we've RMA'd some of the motherboards and gotten replacements. The replacement, while same model, were probably a newer revision. I'm just guessing. I also note that the lm_sensors shipped in RH kernels is several revisions out-of-date. I observed similar behaviour on an Asus CUSL2 with an Intel Celeron 850. With once loaded i2c-i801 -> slow, without normal speed. Once i installed an Intel Pentium3 866 it went away. I never figured out what the problem was. Thanks for tracking this down. Could you grab a newer lm_sensors, and see if that has the same problem ? If it does, working with upstream to fix that would be the easiest way forward, as you have the hardware to reproduce and chase this. If this is fixed already in upstream, I'll look at getting it in for an errata update. Thanks again. Dax said:
>I also note that the lm_sensors shipped in RH kernels is several revisions
>out-of-date.
It is not appropriate for Red Hat to go randomly pulling updates of lots of
different drivers into our kernel. The right place for things to go when
they are updated is the upstream kernel, which we then pull from.
If a driver isn't good enough for the upstream kernel, why should we think
that it is good enough for a Red Hat product?
lm_sensors isn't in mainline 2.4 it's not so much a 'not good enough' situation, it was just never pushed for inclusion. It's now in 2.6 > If a driver isn't good enough for the upstream kernel, why should we think
> that it is good enough for a Red Hat product?
This statement makes no sense. You ship lots of drivers that aren't in the
upstream kernel (including the lm_sensors stuff in question here)
Sorry, mea culpa. I forgot that this was one of the demonstrations of why we have that policy of following upstream. Spoke too early in the morning, I guess... My deep apologies. I'll go crawl back into my hole now... I tested lm_sensors 2.8.0 (the most current version), the slowdown is still there. I installed the following RPMs to test: http://atrpms.physik.fu-berlin.de/dist/rh9/atrpms/atrpms-13-16.rh9.at.noarch.rpm http://atrpms.physik.fu-berlin.de/dist/rh9/lm_sensors/lm_sensors-2.8.0-1_15.rh9.at.i386.rpm http://atrpms.physik.fu-berlin.de/dist/rh9/kernel/kernel-2.4.20-20_23.1.rh9.at.i686.rpm I have narrowed down specifically which lm_sensor kernel module causes the slowdown. These modules are OK, no slowdown: eeprom i2c-proc i2c-isa i2c-piix4 i2c-core dmi_scan When I load the following kernel module is when the slowdown occurs: w83781d Unloading it with rmmod does not fix the slowdown, a reboot is required. Im feeling the same thing on my workstations. Im not using lm_sensors. Its gotten to the point where I cant run Quake3 while mozilla is open on a page that has a animated gif on it. (These are fully updated RH 9 machines) I also have performance degradation and I don't use lm_sensors. XMMS starts skipping OGG/MP3 files if I browse Internet or do any other work on the computer ( PIII 700 ). this bug has been fixed in Fedora Core release 0.94 kernel-2.4.22-1.2061.nptl |