Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1065803 - proc-pmda can timeout on fetch
Summary: proc-pmda can timeout on fetch
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: pcp
Version: 23
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Nathan Scott
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-02-17 01:44 UTC by Frank Ch. Eigler
Modified: 2016-04-06 14:55 UTC (History)
6 users (show)

Fixed In Version: pcp-3.11.1-1.fc24 pcp-3.11.1-1.fc23 pcp-3.11.1-1.fc22 pcp-3.11.1-1.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-03-26 17:56:06 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Frank Ch. Eigler 2014-02-17 01:44:15 UTC
pcp-3.8.12-1.fc19 on x86-64 server, servicing a remote pmmgr, intermittently
appears to crash the proc/linux pmda:


% pcp
Performance Co-Pilot configuration on FOO:

 platform: Linux FOO 3.10.5-201.fc19.x86_64 #1 SMP Wed Aug 7 16:25:24 UTC 2013 x86_64
 hardware: 4 cpus, 9 disks, 1 node, 19949MB RAM
 timezone: EST+5
     pmcd: Version 3.8.12-1, 5 agents, 2 clients
     pmda: pmcd proc[8] sendmail linux systemd rpm

% pminfo -f proc
...
proc.id.sgid_nm: pmLookupDesc: No PMCD agent for domain of request
...

pmcd.log:

Log for pmcd on FOO started Sun Feb  9 07:47:15 2014


active agent dom   pid  in out ver protocol parameters
============ === ===== === === === ======== ==========
pmcd           2                 2 dso i:5  lib=/var/lib/pcp/pmdas/pmcd/pmda_pmcd.so entry=pmcd_init [0x2aaaae3ab8c0]
linux         60                 2 dso i:4  lib=/var/lib/pcp/pmdas/linux/pmda_linux.so entry=linux_init [0x2aaaae7cd730]
proc           3 18900  13  14   2 bin pipe cmd=/var/lib/pcp/pmdas/proc/pmdaproc -d 3
systemd      114 18905  15  16   2 bin pipe cmd=/var/lib/pcp/pmdas/systemd/pmdasystemd -d 114
sendmail      15 18971  17  18   2 bin pipe cmd=/var/lib/pcp/pmdas/sendmail/pmdasendmail -d 15
rpm          123 18972  19  20   2 bin pipe cmd=/var/lib/pcp/pmdas/rpm/pmdarpm -d 123

Host access list:
00 01 Cur/MaxCons host-spec                               host-mask                               lvl host-name
== == =========== ======================================= ======================================= === ==============
 y  y     0     0 192.168.1.10                            255.255.255.255                           0 localhost
 y  y     0     0 /                                       /                                         1 unix:
    n     0     0 0.0.0.0                                 0.0.0.0                                   4 .*
    n     0     0 ::                                      ::                                        8 :*
User access list empty: user-based access control turned off
Group access list empty: group-based access control turned off


pmcd: PID = 18892, PDU version = 2
pmcd request port(s):
  sts fd   port  family address
  === ==== ===== ====== =======
  ok  1026       unix   /var/run/pcp/pmcd.socket
  ok  1024 44321 inet   INADDR_ANY
  ok  1025 44321 ipv6   INADDR_ANY
[Sun Feb  9 08:27:39] pmcd(18892) Warning: pduread: timeout (after 5.000 sec) while attempting to read 12 bytes out of 12 in HDR on fd=14
[Sun Feb  9 08:28:02] pmcd(18892) Info: CleanupAgent ...
Cleanup "proc" agent (dom 3): protocol failure for fd=14, exit(0)

^^^ note the pmda died about 45 minutes after it started.

proc.log extract:

UNKNOWN ) : 5713 (/usr/bin/spamd ) S 1 5713 5713 0 -1 4218944 15361 0 1 0 352 31 0 0 20 0 1 0 1549014777 291778560 16843 18446744073709551615 4194304 4199407 140734374877360 140734374876616 227976074547 0 0 4224 92163 18446744071580590201 0 0 17 2 0 0 87 0 0 6298944 6304288 29331456 140734374883070 140734374883147 140734374883147 140734374883305 0
4 140733360050934 140733360054251 0

40 140734363951840 140734363955164 0
0892 0
^A
UNKNOWN c : 5715 (spamd child) S 5713 5713 5713 0 -1 4219200 13521 0 1 0 2769 103 0 0 20 0 1 0 1549016009 309153792 21145 18446744073709551615 4194304 4199407 140734374877360 140734374876616 227976074547 0 0 4224 10240 18446744071580590201 0 0 17 0 0 0 2472 0 0 6298944 6304288 29331456 140734374883070 140734374883147 140734374883147 140734374883305 0
40733360050934 140733360054251 0

40 140734363951840 140734363955164 0
0892 0
^A
[...]
UNKNOWN c : 5716 (spamd child) S 5713 5713 5713 0 -1 4219200 8679 0 0 0 37 8 0 0 20 0 1 0 1549016009 296628224 17993 18446744073709551615 4194304 4199407 140734374877360 140734374876616 227976074547 0 0 4224 10240 18446744071580590201 0 0 17 2 0 0 83 0 0 6298944 6304288 29331456 140734374883070 140734374883147 140734374883147 140734374883305 0
83305 0
0733360050934 140733360054251 0

840 140734363951840 140734363955164 0
80892 0


Log finished Sun Feb  9 08:28:02 2014

^^^^ note how the pmda emitted that message before dying

Comment 1 Nathan Scott 2014-02-17 02:37:35 UTC
> 
> [Sun Feb  9 08:27:39] pmcd(18892) Warning: pduread: timeout (after 5.000 sec)
> while attempting to read 12 bytes out of 12 in HDR on fd=14
> [Sun Feb  9 08:28:02] pmcd(18892) Info: CleanupAgent ...
> Cleanup "proc" agent (dom 3): protocol failure for fd=14, exit(0)
> 
> ^^^ note the pmda died about 45 minutes after it started.

*nod* - note also the exit(0) - indicating the PMDA chose to exit.
And from proc.log...

> 840 140734363951840 140734363955164 0
> 80892 0
> 
> Log finished Sun Feb  9 08:28:02 2014
> 
> ^^^^ note how the pmda emitted that message before dying

This message is from the atexit handler - it appears pmdaproc either called exit(0) somewhere or just exited out of the main PMDA loop (have never seen that happen before, FWLIW) and returned zero from main.

Very odd, not sure ... but those diagnostics ... ohhh, wait - proc_runq.c is reporting that for process states it does not recognise.  Looks like that is because there's whitespace in the command names "(spamd )" and "(spamd child)" which must be taking something in the code by surprise.  So, thats one issue - doesn't explain where the exit call is coming from though.

Comment 2 Nathan Scott 2014-02-17 05:01:08 UTC
Failing to reproduce the problem here - specifically, trying to get that runq diagnostic to trigger with whitespace in the task_struct comm field... (in the hope that may trigger the clean exit somehow):

diff --git a/qa/src/qa_test.c b/qa/src/qa_test.c
index 8c0305d..5c033e6 100644
--- a/qa/src/qa_test.c
+++ b/qa/src/qa_test.c
@@ -1,7 +1,12 @@
 #include <stdio.h>
+#include <unistd.h>
 
+int
+main(int argc, char *argv[])
+{
+    static char *kevin_spacey = "qa test ";
 
-main(int argc, char *argv[]) {
-    printf("This program does nothing, and wastes a lot of space doing it!\n");
-    exit(0);
+    /* modify argv[0], see how pmdaproc reacts */
+    argv[0] = kevin_spacey;
+    return pause();
 }

$ ln 'qa_test' 'linked qa test '
$ ./linked\ qa\ test
(pauses)

... pmdaproc seems to report the right thing for me - any proc.runq fetches cause no logged messages to appear either.  Manual inspection of the /proc/<PID>/stat file suggests the problem *should* occur.  Argh.

Comment 4 Nathan Scott 2014-05-18 21:53:22 UTC
(Just updating the title to reflect our current understanding of the issue)

Comment 5 Fedora End Of Life 2015-01-09 21:10:03 UTC
This message is a notice that Fedora 19 is now at end of life. Fedora 
has stopped maintaining and issuing updates for Fedora 19. It is 
Fedora's policy to close all bug reports from releases that are no 
longer maintained. Approximately 4 (four) weeks from now this bug will
be closed as EOL if it remains open with a Fedora 'version' of '19'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 19 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 6 Jaroslav Reznik 2015-03-03 15:29:39 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 22 development cycle.
Changing version to '22'.

More information and reason for this action is here:
https://fedoraproject.org/wiki/Fedora_Program_Management/HouseKeeping/Fedora22

Comment 7 Frank Ch. Eigler 2015-07-13 19:04:06 UTC
Still appears with 3.10.5.

Comment 8 Jan Kurik 2015-07-15 14:43:04 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 23 development cycle.
Changing version to '23'.

(As we did not run this process for some time, it could affect also pre-Fedora 23 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 23 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora23

Comment 9 Fedora Update System 2016-03-21 22:30:54 UTC
pcp-3.11.1-1.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-90112fb9ca

Comment 10 Fedora Update System 2016-03-22 00:17:26 UTC
pcp-3.11.1-1.el5 has been pushed to the Fedora EPEL 5 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-5b519318e0

Comment 11 Fedora Update System 2016-03-22 01:26:34 UTC
pcp-3.11.1-1.fc22 has been pushed to the Fedora 22 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-4969de37e5

Comment 12 Nathan Scott 2016-03-22 05:08:29 UTC
To clarify, the fix here is to auto-restart agents that are unresponsive, which is typically due to an unexpected, very large latency during PMDA sampling (and fixing the source of that latency is outside of PCP, hence intractable).

This is achieved through a combination of pmdaroot starting PMDAs (i.e. set PMCD_ROOT_AGENT=1 in /etc/sysconfig/pmcd - which is now the default) and:

# chkconfig pmie on
# service pmie start

This enables the pmie rule which checks for agents that have exited, and automates their restart (within ~5 seconds - with a holdoff of 1 minute after any such attempt).  A message is also logged to syslog at the time a restart is attempted.

These two components to the fix first came together in pcp-3.11.1, however the pmie rule could be used in pcp-3.11.0 as well if anyone needs that.

Comment 13 Fedora Update System 2016-03-22 15:22:48 UTC
pcp-3.11.1-1.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-e687eabad0

Comment 14 Frank Ch. Eigler 2016-03-23 13:19:19 UTC
> (and fixing the source of that latency is outside of PCP, hence
> intractable).

PCP must already tolerate high latency underlying data sources, and does in cases such as pmdarpm with background threads.  This is not intractable - it just requires a well designed program.

Comment 15 Nathan Scott 2016-03-23 20:50:47 UTC
> This is not intractable [...]

Oh, my note was unclear - the part that is not solvable in PCP is getting valid values at the time point requested.  As to how PCP responds and deals with that, yes, many different potential solutions exist there, and of varying levels of complexity.

Comment 16 Fedora Update System 2016-03-26 17:55:50 UTC
pcp-3.11.1-1.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report.

Comment 17 Fedora Update System 2016-03-30 20:53:52 UTC
pcp-3.11.1-1.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.

Comment 18 Fedora Update System 2016-03-30 21:20:11 UTC
pcp-3.11.1-1.fc22 has been pushed to the Fedora 22 stable repository. If problems still persist, please make note of it in this bug report.

Comment 19 Fedora Update System 2016-04-06 14:53:40 UTC
pcp-3.11.1-1.el5 has been pushed to the Fedora EPEL 5 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.