Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 858384 - pmcd segv during linux-pmda query
Summary: pmcd segv during linux-pmda query
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: pcp
Version: 16
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Nathan Scott
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-09-18 20:16 UTC by Frank Ch. Eigler
Modified: 2012-12-20 15:14 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-12-20 15:14:04 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Frank Ch. Eigler 2012-09-18 20:16:42 UTC
pcp-debuginfo-3.6.8-1.fc16.x86_64
pcp-gui-1.5.5-1.fc16.x86_64
python-pcp-3.6.8-1.fc16.x86_64
pcp-libs-3.6.8-1.fc16.x86_64
pcp-3.6.8-1.fc16.x86_64
perl-PCP-PMDA-3.6.8-1.fc16.x86_64

# service pmcd start
# gdb .../pmcd $PID
(gdb) continue

---- meanwhile, from another window ----

% pmval kernel.pernode.cpu.sys

pmval: pmGetInDom(60.19): Timeout waiting for a response from PMCD

---- pmcd suffers segv ----

Program received signal SIGSEGV, Segmentation fault.
linux_table_scan (fp=0x2b01f0140d00, table=0x0) at linux_table.c:80
80		for (t=table; t->field; t++) {
(gdb) bt
#0  linux_table_scan (fp=0x2b01f0140d00, table=0x0) at linux_table.c:80
#1  0x00002b01f0bca715 in refresh_numa_meminfo (numa_meminfo=0x2b01f0dd6fb0)
    at numa_meminfo.c:121
#2  0x00002b01f0bc1bf4 in linux_refresh (pmda=0x2b01f013f6d0, 
    need_refresh=0x7fff6b2b9a80) at pmda.c:3787
#3  0x00002b01f0bc1f6d in linux_instance (indom=251658259, inst=-1, name=0x0, 
    result=0x7fff6b2b9bb0, pmda=<optimized out>) at pmda.c:3911
#4  0x00002b01eed90bc9 in DoInstance (cp=0x2b01f0140810, pb=0x2b01f0141000)
    at dopdus.c:315
#5  0x00002b01eed89e9c in HandleClientInput (fdsPtr=0x7fff6b2b9cc0) at pmcd.c:495
#6  0x00002b01eed887f1 in ClientLoop () at pmcd.c:869
#7  main (argc=<optimized out>, argv=<optimized out>) at pmcd.c:1161

(gdb) frame 2
(gdb) p numa_meminfo->node_info[0]
$3 = {meminfo = 0x0, memstat = 0x2b8724958f80}
(gdb) p numa_meminfo->node_info[0]->memstat
$4 = (struct linux_table *) 0x2b8724958f80
(gdb) p *numa_meminfo->node_info[0]->memstat
$5 = {field = 0x0, maxval = 97, val = 47859434300912, this = 47859414894504, 
  prev = 47859434420404, field_len = 613841155, valid = 11143}

You see the null meminfo ptr that causes the segv. The memstat also seems to be trash.  It's as though the struct just wasn't initialized.


There is also a larger issue.  If these .so pmda's are not rock solid, they should be invoked by pmcd via a pipe connection rather than the .so linkage.

Comment 1 Frank Ch. Eigler 2012-09-18 20:30:29 UTC
Interestingly, if preceded by 

% pminfo -f kernel.pernode

a subsequent

% pmval kernel.pernode.cpu.sys

will not crash.

Comment 2 Nathan Scott 2012-09-19 02:22:15 UTC
I have a fix, and pcp/qa test 286 will exercise this.  mgoodwin is just reviewing, will then commit.

It's to do with ordering of code execution, there are several one-trip guards in the linux pmda, and if metrics values/instances are fetched in a specific order, there's a case where some expected initialisation has not yet been performed.

A simpler test case (which the qa test uses) is to run pmval in local context mode, using: pmval -s 1 @:kernel.pernode.cpu.sys

The test also uses "pmprobe -L -i" which exercises the other unusual path (instance PDU only, no fetch).  These commands are run for every kernel metric.

Comment 3 Nathan Scott 2012-09-19 02:26:36 UTC
Oh, regarding the other issue...

| There is also a larger issue.  If these .so pmda's are not rock solid, they
| should be invoked by pmcd via a pipe connection rather than the .so linkage.

some discussion happened on IRC...

<fche> what about the concern that .so pmda's should be dispreferred?
<nathans> they are dispreferred in general, but not for the kernel agents (since they are used vastly more than any other)
<nathans> when its a separate process, its additional context switches, additional syscalls
<nathans> traditionally, we've leaned toward keeping kernel pmdas in-process, and most others left up to sysadmin to choose (but generally defaulting to separate process)
<fche> does this bug make you reconsider the balance of performance vs stability ?
<nathans> a little, for sure
<nathans> certainly makes me take up my axe (to grind) about shovelling more and more metrics into pmdalinux
<nathans> kernel metrics that is
<nathans> which could go out, like pmdakvm and such


Historically, the interaction between the CPU indom and the NUMA indom has been problematic in the Linux kernel PMDA.  These are a fair bit more complex and intertwined than any of the other instance domains unfortunately.

Comment 4 Nathan Scott 2012-09-21 01:15:27 UTC
Fix is committed upstream.

Comment 5 Fedora Update System 2012-10-25 22:17:54 UTC
pcp-3.6.9-1.el5 has been submitted as an update for Fedora EPEL 5.
https://admin.fedoraproject.org/updates/pcp-3.6.9-1.el5

Comment 6 Fedora Update System 2012-10-25 22:18:30 UTC
pcp-3.6.9-1.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/pcp-3.6.9-1.fc18

Comment 7 Fedora Update System 2012-10-25 22:18:57 UTC
pcp-3.6.9-1.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/pcp-3.6.9-1.fc16

Comment 8 Fedora Update System 2012-10-25 22:19:25 UTC
pcp-3.6.9-1.el6 has been submitted as an update for Fedora EPEL 6.
https://admin.fedoraproject.org/updates/pcp-3.6.9-1.el6

Comment 9 Fedora Update System 2012-10-25 22:19:55 UTC
pcp-3.6.9-1.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/pcp-3.6.9-1.fc17

Comment 10 Fedora Update System 2012-10-26 18:34:01 UTC
Package pcp-3.6.9-1.el5:
* should fix your issue,
* was pushed to the Fedora EPEL 5 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=epel-testing pcp-3.6.9-1.el5'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-EPEL-2012-13283/pcp-3.6.9-1.el5
then log in and leave karma (feedback).

Comment 11 Fedora Update System 2012-12-20 15:14:06 UTC
pcp-3.6.9-1.el6 has been pushed to the Fedora EPEL 6 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.