Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 84032 - starting profiling sometimes crashes the kernel
Summary: starting profiling sometimes crashes the kernel
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: oprofile
Version: 9
Hardware: i686
OS: Linux
medium
high
Target Milestone: ---
Assignee: William Cohen
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 79578 CambridgeBlocker
TreeView+ depends on / blocked
 
Reported: 2003-02-11 04:17 UTC by Ulrich Drepper
Modified: 2007-04-18 16:51 UTC (History)
0 users

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2003-10-07 02:07:59 UTC
Embargoed:


Attachments (Terms of Use)
script to crash kernel via oprofile (deleted)
2003-02-14 05:49 UTC, Ulrich Drepper
no flags Details
S scripted revised to use "rm -rf /var/lib/oprofile/samples/*" (deleted)
2003-02-14 14:58 UTC, William Cohen
no flags Details

Description Ulrich Drepper 2003-02-11 04:17:37 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3b) Gecko/20030203

Description of problem:
One combination of event counters crash my UP P4 HT system sooner or later. 
Until 2.4.20-2.40 every use of MEMSYNC_CANCEL seemed to be fatal.  With
2.4.20-2.41 I see the problem only after some preparation.

I'm using oprofile-0.4-40 but earlier versions had the problem, too.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
Run the following steps:

    rm -f /root/.oprofile/daemonrc
rm -f /var/lib/oprofile/lock
rm -f /var/lib/oprofile/samples/*
opcontrol --init
opcontrol --setup --vmlinux=/boot/vmlinux-$(uname -r) --ctr3-event=INSTR_RETIRED
--ctr3-count=600000 --ctr3-unit-mask=15
opcontrol --start

<Run some test program.  Mine was thread and created 1,000,000 threads in sequence.>

opcontrol --stop
oprofpp -c 3 -l <SOMEDSO>

opcontrol --setup --vmlinux=/boot/vmlinux-$(uname -r) --ctr3-event=INSTR_RETIRED
--ctr3-count=600000 --ctr3-unit-mask=15 --ctr2-event=MEMORY_CANCEL
--ctr2-count=240000 --ctr2-unit-mask=8
opcontrol --start
^^^^ Sometimes this opcontrol call crashes the kernel

<Run the program again>

opcontrol --stop
rm -f /root/.oprofile/daemonrc
rm -f /var/lib/oprofile/lock
rm -f /var/lib/oprofile/samples/*
opcontrol --setup --vmlinux=/boot/vmlinux-$(uname -r) --ctr3-event=INSTR_RETIRED
--ctr3-count=600000 --ctr3-unit-mask=15 --ctr2-event=MEMORY_CANCEL
--ctr2-count=240000 --ctr2-unit-mask=8
opcontrol --start

^^^^ I never managed to get past this last opcontrol call.


Actual Results:  Kernel bug:

kernel BUG at ../../../drivers/oprofile/cpu_buffer.c:95!
invalid operand: 0000
oprofile nfs lockd sunrpc parport_pc lp parport autofs e100 ipt_REJECT iptable_
CPU:    1
EIP:    0060:[<d0933554>]    Not tainted
EFLAGS: 00010046

EIP is at oprofile_add_sample [oprofile] 0xc4 (2.4.20-2.41smp)
eax: 00000000   ebx: d0938700   ecx: 00000080   edx: 00000002
esi: 0806d880   edi: 00000002   ebp: 00000000   esp: caaf1f68
ds: 0068   es: 0068   ss: 0068
Process awk (pid: 26649, stackpage=caaf1000)
Stack: 00000000 00000002 00000048 d0935d61 d09396f8 00000004 cb333620 00000001
       caaf1fc4 0808bac8 08089eb6 bffff628 d0934ebd 00000001 d0939d88 caaf1fc4
       c010a682 caaf1fc4 00000001 0808bac4 c0109a4a caaf1fc4 00000000 0808bac4
Call Trace:   [<d0935d61>] p4_check_ctrs [oprofile] 0xc1 (0xcaaf1f74))
[<d09396f8>] counter_config [oprofile] 0x38 (0xcaaf1f78))
[<d0934ebd>] nmi_callback [oprofile] 0x2d (0xcaaf1f98))
[<d0939d88>] cpu_msrs [oprofile] 0x5e8 (0xcaaf1fa0))
[<c010a682>] do_nmi [kernel] 0x22 (0xcaaf1fa8))
[<c0109a4a>] nmi [kernel] 0x1e (0xcaaf1fb8))


Expected Results:  Profiling works.

Additional info:

I can provide you a version of the test program.  It's in my home dir on devserv.

Comment 1 William Cohen 2003-02-11 15:23:53 UTC
Does the last opcontrol --setup /opcontrol --start work if the the previous
opcontrol --setup/--starts are not run?


Comment 2 William Cohen 2003-02-11 23:46:37 UTC
Using the oprofile-0.4-41, I encounter a different problem. Removing the
/var/lib/oprofile/lock, and staring with new setup, yields

[root@dhcp59-189 SPECS]# opcontrol --start
Failed to open profile device: Device or resource busy
Couldn't start oprofiled.
Check the log file "/var/lib/oprofile/oprofiled.log" and /var/log/messages

It appears that the old daemon does an access to one of the special files, and
the new daemon can not access the file (probably /dev/oprofile/buffer).

Can you reproduce the problem with "opcontrol --shutdown" instead of "opcontrol
--stop" and without the "rm -f ..."?

-Will

Comment 3 William Cohen 2003-02-13 21:36:10 UTC
Uli, do you have a watchdog timer set up on this machine?

Comment 4 Ulrich Drepper 2003-02-13 21:39:35 UTC
> Uli, do you have a watchdog timer set up on this machine?

I usually have nmi_watchdog defined, yes.

Comment 5 Ulrich Drepper 2003-02-14 05:49:00 UTC
Created attachment 90077 [details]
script to crash kernel via oprofile

Executing this script crashes my UP P4 HT machine reliable (100%) during the
last opcontrol --start.

Comment 6 William Cohen 2003-02-14 14:58:04 UTC
Created attachment 90087 [details]
S scripted revised to use "rm -rf /var/lib/oprofile/samples/*"

I tried the attached script S (S2 merely does a better job cleaning out
/var/lib/oprofile/samples/*). The kernel did not crash the machine. It did get
following output:

[root@dhcp59-189 root]# ./S
Using log file /var/lib/oprofile/oprofiled.log
Daemon started.
Profiler running.
Stopping profiling.
Profiler running.
Stopping profiling.
Daemon not running
Failed to open profile device: Device or resource busy
Couldn't start oprofiled.
Check the log file "/var/lib/oprofile/oprofiled.log" and /var/log/messages

What is is doing is not right, but it isn't crashing. Here is the configuration
information. This is a Dell precision 430 running a freshly installed
GinGin-re2011.nightly.	What differences are their between the machine that the
script crashes on and the machine I am using to try to replicate the problem?
Here are the details about the machine and software:

rpm packages:
oprofile-0.4-41
kernel-smp-2.4.20-2.44



The processor is

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 15
model		: 2
model name	: Intel(R) Xeon(TM) CPU 2.66GHz
stepping	: 7
cpu MHz 	: 2657.857
cache size	: 512 KB
physical id	: 0
siblings	: 2
fdiv_bug	: no
hlt_bug 	: no
f00f_bug	: no
coma_bug	: no
fpu		: yes
fpu_exception	: yes
cpuid level	: 2
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov

pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
bogomips	: 5308.41

Comment 7 William Cohen 2003-06-03 14:26:49 UTC
Uli, is this bug still occuring?

Comment 8 William Cohen 2003-08-26 16:37:39 UTC
Uli, is this bug still occuring with the current RHL9 kernels?

Comment 11 Ulrich Drepper 2003-10-07 02:07:59 UTC
Sorry for the delay.  I cannot reproduce the problem anymore.  Assumed to be
fixed.  Closing as such.


Note You need to log in before you can comment on or make changes to this bug.