1323521 – remote operation of pmie based pmda restarter interferes with local pmcd

Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.

Bug 1323521 - remote operation of pmie based pmda restarter interferes with local pmcd

Summary: remote operation of pmie based pmda restarter interferes with local pmcd

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	pcp
Sub Component:
Version:	rawhide
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Frank Ch. Eigler
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1334815
TreeView+	depends on / blocked

Reported:	2016-04-04 00:02 UTC by Frank Ch. Eigler
Modified:	2016-07-09 20:19 UTC (History)
CC List:	8 users (show)
Fixed In Version:	pcp-3.11.2-2.fc24 pcp-3.11.2-1.fc22 pcp-3.11.2-2.fc23 pcp-3.11.3-1.el5
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-07-09 20:19:38 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Frank Ch. Eigler 2016-04-04 00:02:33 UTC

The workaround for bug #1065803 (proc-pmda hanging/timing-out) was for some reason to write a pmie rule that pmsignals pmcd if a pmda-monitoring metric indicates a damaged pmda.

This rule is ignorant of whether it's running against a local pmcd, or a remote one (so it has no hope at all of signalling the remote pmcd).  pmieconf does not take a -h HOST parameter either, so cannot express a different ruleset for local vs. remote pmie targets.

The effect of the new status quo is to have a default-configured "pmie -h REMOTE" job send HUP signals to an innocent local pmcd ... and do nothing about the suffering remote pmcd.

(I remain convinced that pmda restarting ought to be performed by logic within the local pmcd, and not require external imperfect assistance.)

Comment 1 Nathan Scott 2016-04-04 00:39:41 UTC

There's really no reason to worry about this.  Yes, we can end up signaling the local pmcd when a remote pmda fails.  We could workaround this, by adding knowledge to pmieconf about local pmcd vs not when generating config files - but its not worth it just for this rule.  If we add more localhost-only rules, sure, lets look into it.

Signaling pmcd does not cause any problems, and is a very lightweight operation when no work needs to be done.  There is no reason not to run a local pmie on every host where there is concern about pmda/domain-induced timeouts.

It only happens once in a blue moon (in the relatively unlikely case where a pmda has failed, and people are using only the default-generated rulesets with remote hosts - these can be overridden if there was a genuine concern / actual issue here).

> (I remain convinced that pmda restarting ought to be performed by logic within > the local pmcd, and not require external imperfect assistance.)

That's nice.  Please send through the code implementing this & lets see if it can be made to work as reliably, and how much complexity it adds.

Comment 2 Frank Ch. Eigler 2016-04-04 16:41:11 UTC

(In reply to Nathan Scott from comment #1)

> [...] We could workaround this, by
> adding knowledge to pmieconf about local pmcd vs not when generating config
> files - but its not worth it just for this rule. [...]

Yes, that would be a partial workaround.


> Signaling pmcd does not cause any problems, and is a very lightweight
> operation when no work needs to be done.

That's not obvious, if you consider a high-fanout remote-pmie installation, where impotent remote-wannabe-SIGHUPs barrage the local pmcd.  Have you tested this scenario before making this assertion?


> There is no reason not to run a local pmie on every host where there
> is concern about pmda/domain-induced timeouts.

Wrong.  One simple reason not to run a local pmie is to avoid paying its performance cost (polling a variety of irrelevant metrics frequently & redundantly, producing system log entries, consuming memory).


> It only happens once in a blue moon (in the relatively unlikely case where a
> pmda has failed, and people are using only the default-generated rulesets
> with remote hosts - these can be overridden if there was a genuine concern /
> actual issue here).

Your data for "blue moon" please.  On moderately busy servers I overlook, 100% of them encounter proc-pmda timeout/hangs after a few days of uptime.

Hand-editing default configuration files is not helpful advice, esp. considering where these changes would have to be made (multiple place); the general principle of having defaults -work- rather than have to be disabled; the tools' tendency to regenerate configuration files periodically, overwriting said hand-editing.  Poor QoI.


> > (I remain convinced that pmda restarting ought to be performed by
> > logic within the local pmcd, and not require external imperfect assistance.)
> 
> That's nice.  Please send through the code implementing this & lets see if
> it can be made to work as reliably, and how much complexity it adds.

This is an inappropriate attitude.

Comment 3 Nathan Scott 2016-04-04 23:08:25 UTC

(In reply to Frank Ch. Eigler from comment #2)
> (In reply to Nathan Scott from comment #1)
> > [...] We could workaround this, by
> > adding knowledge to pmieconf about local pmcd vs not when generating config
> > files - but its not worth it just for this rule. [...]
> 
> Yes  [...]

OK, I'll open a separate RFE with a bit more detail.  Its unlikely we'll work on this here in Red Hats PCP team, however, without a more compelling case (or, alternatively, a need for more local-only rules as per earlier comments, which would begin to make the case for it).

> [...]  Have you tested this scenario before making this assertion?
> [...]  avoid paying its performance cost

You seem to be asking me to prove that that a hypothetical bug you've opened exists.  However, I see no evidence of a problem, nor would I expect to, so I tend to think we should spend time on more worthwhile pursuits.

> This is an inappropriate attitude.

Hmm, let me put it differently - I do welcome other folk continuing to investigate the area, of course.  There's no need to take offense at my suggestion you do so (its not something we're likely to take on in the PCP team here at Red Hat, s'all).  I'm sorry if you took the suggestion that you might like to do some work on this as inappropriate / offensive - but its just being realistic, noone else seems to care as much as you do (if at all) about this perceived problem.

> 100% of them encounter proc-pmda timeout/hangs after a few days of uptime.

It would be very helpful if you could analyze the underlying kernel / pmdaproc problem there (I do not see that behaviour here) - there would seem to be some pathological root cause on these systems that could be diagnosed and the code improved.

> Hand-editing default configuration files is not helpful advice, esp. [...]

Oh, a misunderstanding perhaps - this is all pmieconf-driven, there's no hand-editing involved here.  If its concerning you, use pmieconf in pmmgr to switch it off (pmie rules in pcp group).  There's no reason it should concern you, however.

Comment 4 Frank Ch. Eigler 2016-04-08 11:27:59 UTC

> > [...]  Have you tested this scenario before making this assertion?
> > [...]  avoid paying its performance cost
> 
> You seem to be asking me to prove that that a hypothetical bug you've opened
> exists.  However, I see no evidence of a problem, nor would I expect to, so
> I tend to think we should spend time on more worthwhile pursuits.

The bug plainly exists in the current code.  A large-fanout central pmie server will flood its own local pmcd with SIGHUPS, 1 per minute per remote server.  Your assertion was that this is free of consequence.  Have you ever tested what a pmcd does when it's given a SIGHUP multiple times a second?  (Plus a syslog message for each?)


> > Hand-editing default configuration files is not helpful advice, esp. [...]
> 
> Oh, a misunderstanding perhaps - this is all pmieconf-driven, there's no
> hand-editing involved here.

The point is that you suggested editing the pmieconf-generated files to remove the useless & possibly-harmful pmsignal clause.  That is an impractical solution.

Comment 5 Frank Ch. Eigler 2016-04-08 13:12:51 UTC

FWIW the issues were foreseen:
http://oss.sgi.com/pipermail/pcp/2016-February/009720.html

Comment 6 Nathan Scott 2016-04-10 22:35:21 UTC

(In reply to Frank Ch. Eigler from comment #4)
> > > [...]  Have you tested this scenario before making this assertion?
> > > [...]  avoid paying its performance cost
> > 
> > You seem to be asking me to prove that that a hypothetical bug you've opened
> > exists.  However, I see no evidence of a problem, nor would I expect to, so
> > I tend to think we should spend time on more worthwhile pursuits.
> 
> The bug plainly exists in the current code.  A large-fanout central pmie
> server will [...]

Why would the remote collectors exhibiting this problem not be able to run a local pmie alongside their problematic pmcd/pmdas?  They are able to, of course, so this fan-out-with-all-failing case is an unrealistic scenario.

> flood its own local pmcd with SIGHUPS, 1 per minute per remote server.

For this to be even close-to-maybe-remotely-a-problem, it assumes:
- all/many remote servers have failed agents, constantly
- all remote servers are not (able to?) run local pmie (why not?)
- or, all/many remote servers have an inability to restart agents
- it can't be solved in pmmgr/pmlogger_check (it can, as per BZ 1323851)

I have spent alot of time in this code - the cost of a no-op SIGHUP to pmcd is not measurable (not even if multiplied by 1000s of hypothetically broken remote servers that for some bizarre reason cannot run local pmie co-processes).

> > > Hand-editing default configuration files is not helpful advice, esp. [...]
> > 
> > Oh, a misunderstanding perhaps - this is all pmieconf-driven, there's no
> > hand-editing involved here.
> 
> The point is that you suggested editing the pmieconf-generated files to
> remove the useless & possibly-harmful pmsignal clause.  That is an
> impractical solution.

At no point did I suggest editing the pmieconf-generated-files via anything other than an automated process - pmmgr could certainly run pmieconf to disable this rule if its still concerning you, as I already said.  So, very much a practical approach if you are concerned about this in pmmgr.

Also, as I said, I'm not against further work in the area and/or additional solutions ... please do hack on it if you wish.  IMO though, this problem is adequately solved by the simpler pmie solution.

Thanks for your interest!  Let me know if/when you have code for some other, additional approach, and I'll be happy to review and assess it.

Comment 7 Frank Ch. Eigler 2016-04-14 00:54:47 UTC

near trivial patch posted
http://oss.sgi.com/pipermail/pcp/2016-April/010201.html

Comment 8 Fedora Update System 2016-04-29 02:53:31 UTC

pcp-3.11.2-1.el5 has been submitted as an update to Fedora EPEL 5. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-394320f755

Comment 9 Fedora Update System 2016-04-29 17:21:56 UTC

pcp-3.11.2-2.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-bad5995fe9

Comment 10 Fedora Update System 2016-04-30 01:50:07 UTC

pcp-3.11.2-1.el5 has been pushed to the Fedora EPEL 5 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-394320f755

Comment 11 Fedora Update System 2016-04-30 02:23:06 UTC

pcp-3.11.2-1.fc22 has been pushed to the Fedora 22 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-f8f919a355

Comment 12 Fedora Update System 2016-04-30 02:23:44 UTC

pcp-3.11.2-2.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-53282a0c5a

Comment 13 Fedora Update System 2016-05-09 00:04:47 UTC

pcp-3.11.2-2.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report.

Comment 14 Fedora Update System 2016-05-10 17:53:09 UTC

pcp-3.11.2-1.fc22 has been pushed to the Fedora 22 stable repository. If problems still persist, please make note of it in this bug report.

Comment 15 Fedora Update System 2016-05-10 17:59:41 UTC

pcp-3.11.2-2.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report.

Comment 16 Fedora Update System 2016-06-18 06:17:43 UTC

pcp-3.11.3-1.el5 has been pushed to the Fedora EPEL 5 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-4745f3e292

Comment 17 Fedora Update System 2016-07-09 20:19:12 UTC

pcp-3.11.3-1.el5 has been pushed to the Fedora EPEL 5 stable repository. If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.