Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1323521
Summary: | remote operation of pmie based pmda restarter interferes with local pmcd | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Frank Ch. Eigler <fche> |
Component: | pcp | Assignee: | Frank Ch. Eigler <fche> |
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | rawhide | CC: | brolley, fche, lberk, mbenitez, mgoodwin, nathans, pcp, scox |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | pcp-3.11.2-2.fc24 pcp-3.11.2-1.fc22 pcp-3.11.2-2.fc23 pcp-3.11.3-1.el5 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-07-09 20:19:38 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1334815 |
Description
Frank Ch. Eigler
2016-04-04 00:02:33 UTC
There's really no reason to worry about this. Yes, we can end up signaling the local pmcd when a remote pmda fails. We could workaround this, by adding knowledge to pmieconf about local pmcd vs not when generating config files - but its not worth it just for this rule. If we add more localhost-only rules, sure, lets look into it.
Signaling pmcd does not cause any problems, and is a very lightweight operation when no work needs to be done. There is no reason not to run a local pmie on every host where there is concern about pmda/domain-induced timeouts.
It only happens once in a blue moon (in the relatively unlikely case where a pmda has failed, and people are using only the default-generated rulesets with remote hosts - these can be overridden if there was a genuine concern / actual issue here).
> (I remain convinced that pmda restarting ought to be performed by logic within > the local pmcd, and not require external imperfect assistance.)
That's nice. Please send through the code implementing this & lets see if it can be made to work as reliably, and how much complexity it adds.
(In reply to Nathan Scott from comment #1) > [...] We could workaround this, by > adding knowledge to pmieconf about local pmcd vs not when generating config > files - but its not worth it just for this rule. [...] Yes, that would be a partial workaround. > Signaling pmcd does not cause any problems, and is a very lightweight > operation when no work needs to be done. That's not obvious, if you consider a high-fanout remote-pmie installation, where impotent remote-wannabe-SIGHUPs barrage the local pmcd. Have you tested this scenario before making this assertion? > There is no reason not to run a local pmie on every host where there > is concern about pmda/domain-induced timeouts. Wrong. One simple reason not to run a local pmie is to avoid paying its performance cost (polling a variety of irrelevant metrics frequently & redundantly, producing system log entries, consuming memory). > It only happens once in a blue moon (in the relatively unlikely case where a > pmda has failed, and people are using only the default-generated rulesets > with remote hosts - these can be overridden if there was a genuine concern / > actual issue here). Your data for "blue moon" please. On moderately busy servers I overlook, 100% of them encounter proc-pmda timeout/hangs after a few days of uptime. Hand-editing default configuration files is not helpful advice, esp. considering where these changes would have to be made (multiple place); the general principle of having defaults -work- rather than have to be disabled; the tools' tendency to regenerate configuration files periodically, overwriting said hand-editing. Poor QoI. > > (I remain convinced that pmda restarting ought to be performed by > > logic within the local pmcd, and not require external imperfect assistance.) > > That's nice. Please send through the code implementing this & lets see if > it can be made to work as reliably, and how much complexity it adds. This is an inappropriate attitude. (In reply to Frank Ch. Eigler from comment #2) > (In reply to Nathan Scott from comment #1) > > [...] We could workaround this, by > > adding knowledge to pmieconf about local pmcd vs not when generating config > > files - but its not worth it just for this rule. [...] > > Yes [...] OK, I'll open a separate RFE with a bit more detail. Its unlikely we'll work on this here in Red Hats PCP team, however, without a more compelling case (or, alternatively, a need for more local-only rules as per earlier comments, which would begin to make the case for it). > [...] Have you tested this scenario before making this assertion? > [...] avoid paying its performance cost You seem to be asking me to prove that that a hypothetical bug you've opened exists. However, I see no evidence of a problem, nor would I expect to, so I tend to think we should spend time on more worthwhile pursuits. > This is an inappropriate attitude. Hmm, let me put it differently - I do welcome other folk continuing to investigate the area, of course. There's no need to take offense at my suggestion you do so (its not something we're likely to take on in the PCP team here at Red Hat, s'all). I'm sorry if you took the suggestion that you might like to do some work on this as inappropriate / offensive - but its just being realistic, noone else seems to care as much as you do (if at all) about this perceived problem. > 100% of them encounter proc-pmda timeout/hangs after a few days of uptime. It would be very helpful if you could analyze the underlying kernel / pmdaproc problem there (I do not see that behaviour here) - there would seem to be some pathological root cause on these systems that could be diagnosed and the code improved. > Hand-editing default configuration files is not helpful advice, esp. [...] Oh, a misunderstanding perhaps - this is all pmieconf-driven, there's no hand-editing involved here. If its concerning you, use pmieconf in pmmgr to switch it off (pmie rules in pcp group). There's no reason it should concern you, however. > > [...] Have you tested this scenario before making this assertion? > > [...] avoid paying its performance cost > > You seem to be asking me to prove that that a hypothetical bug you've opened > exists. However, I see no evidence of a problem, nor would I expect to, so > I tend to think we should spend time on more worthwhile pursuits. The bug plainly exists in the current code. A large-fanout central pmie server will flood its own local pmcd with SIGHUPS, 1 per minute per remote server. Your assertion was that this is free of consequence. Have you ever tested what a pmcd does when it's given a SIGHUP multiple times a second? (Plus a syslog message for each?) > > Hand-editing default configuration files is not helpful advice, esp. [...] > > Oh, a misunderstanding perhaps - this is all pmieconf-driven, there's no > hand-editing involved here. The point is that you suggested editing the pmieconf-generated files to remove the useless & possibly-harmful pmsignal clause. That is an impractical solution. FWIW the issues were foreseen: http://oss.sgi.com/pipermail/pcp/2016-February/009720.html (In reply to Frank Ch. Eigler from comment #4) > > > [...] Have you tested this scenario before making this assertion? > > > [...] avoid paying its performance cost > > > > You seem to be asking me to prove that that a hypothetical bug you've opened > > exists. However, I see no evidence of a problem, nor would I expect to, so > > I tend to think we should spend time on more worthwhile pursuits. > > The bug plainly exists in the current code. A large-fanout central pmie > server will [...] Why would the remote collectors exhibiting this problem not be able to run a local pmie alongside their problematic pmcd/pmdas? They are able to, of course, so this fan-out-with-all-failing case is an unrealistic scenario. > flood its own local pmcd with SIGHUPS, 1 per minute per remote server. For this to be even close-to-maybe-remotely-a-problem, it assumes: - all/many remote servers have failed agents, constantly - all remote servers are not (able to?) run local pmie (why not?) - or, all/many remote servers have an inability to restart agents - it can't be solved in pmmgr/pmlogger_check (it can, as per BZ 1323851) I have spent alot of time in this code - the cost of a no-op SIGHUP to pmcd is not measurable (not even if multiplied by 1000s of hypothetically broken remote servers that for some bizarre reason cannot run local pmie co-processes). > > > Hand-editing default configuration files is not helpful advice, esp. [...] > > > > Oh, a misunderstanding perhaps - this is all pmieconf-driven, there's no > > hand-editing involved here. > > The point is that you suggested editing the pmieconf-generated files to > remove the useless & possibly-harmful pmsignal clause. That is an > impractical solution. At no point did I suggest editing the pmieconf-generated-files via anything other than an automated process - pmmgr could certainly run pmieconf to disable this rule if its still concerning you, as I already said. So, very much a practical approach if you are concerned about this in pmmgr. Also, as I said, I'm not against further work in the area and/or additional solutions ... please do hack on it if you wish. IMO though, this problem is adequately solved by the simpler pmie solution. Thanks for your interest! Let me know if/when you have code for some other, additional approach, and I'll be happy to review and assess it. near trivial patch posted http://oss.sgi.com/pipermail/pcp/2016-April/010201.html pcp-3.11.2-1.el5 has been submitted as an update to Fedora EPEL 5. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-394320f755 pcp-3.11.2-2.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-bad5995fe9 pcp-3.11.2-1.el5 has been pushed to the Fedora EPEL 5 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-394320f755 pcp-3.11.2-1.fc22 has been pushed to the Fedora 22 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-f8f919a355 pcp-3.11.2-2.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-53282a0c5a pcp-3.11.2-2.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report. pcp-3.11.2-1.fc22 has been pushed to the Fedora 22 stable repository. If problems still persist, please make note of it in this bug report. pcp-3.11.2-2.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report. pcp-3.11.3-1.el5 has been pushed to the Fedora EPEL 5 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-4745f3e292 pcp-3.11.3-1.el5 has been pushed to the Fedora EPEL 5 stable repository. If problems still persist, please make note of it in this bug report. |