1187358 – Spacewalk does not announce tasks over RPC-XML during rhn_check

Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.

Bug 1187358 - Spacewalk does not announce tasks over RPC-XML during rhn_check

Summary: Spacewalk does not announce tasks over RPC-XML during rhn_check

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Spacewalk
Classification:	Community
Component:	Server
Sub Component:
Version:	2.2
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Stephen Herr
QA Contact:	Red Hat Satellite QA List
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	space23
TreeView+	depends on / blocked

Reported:	2015-01-29 20:51 UTC by Christopher Swingler
Modified:	2015-06-03 14:28 UTC (History)
CC List:	4 users (show)
Fixed In Version:	spacewalk-backend-2.3.46-1 osa-dispatcher-5.11.55-1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-04-14 19:03:01 UTC
Embargoed:

Attachments	(Terms of Use)
A stacktrace when attempting to reactivate an existing system (20.36 KB, text/plain) 2015-01-29 20:51 UTC, Christopher Swingler	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1223473	1	None	None	None	2021-01-20 06:05:38 UTC

Description Christopher Swingler 2015-01-29 20:51:21 UTC

Created attachment 985764 [details]
A stacktrace when attempting to reactivate an existing system

Description of problem:

See [Spacewalk-list] thread https://www.redhat.com/archives/spacewalk-list/2015-January/msg00161.html for more discussion on this topic.

In short, systems just plain do not pick up scheduled tasks. Tasks are scheduled through the UI (be them package installation/updates, configuration file publishing, or remote commands) and the API basically says "I've got nothing for you" when the remote system runs an rhn_check.

I've validated that the system ID matches, and that the rhn_check succeeds, and nothing seems to convince Spacewalk that it should pass along the task to the checking in system.  Spacewalk tools have been updated on effected systems. 

A packet capture shows this HTTP response coming back from Spacewalk:
HTTP/1.1 200 OK
Date: Mon, 26 Jan 2015 22:47:55 GMT
Server: Apache
Content-Length: 126
X-RHN-Server-Capability: registration.finish_message(1)=1
X-RHN-Server-Capability: registration.remaining_subscriptions(1)=1
X-RHN-Server-Capability: abrt(1)=1
X-RHN-Server-Capability: registration.update_contact_info(1)=1
X-RHN-Server-Capability: staging_content(1)=1
X-RHN-Server-Capability: applet.has_base_channel(1)=1
X-RHN-Server-Capability: registration.smbios(1)=1
X-RHN-Server-Capability: registration.extended_update_support(1)=1
X-RHN-Server-Capability: rhncfg.filetype.directory(1)=1
X-RHN-Server-Capability: registration.update_systemid(1)=1
X-RHN-Server-Capability: registration.register_osad(1)=1
X-RHN-Server-Capability: registration.delta_packages(1)=1
X-RHN-Server-Capability: cpu_sockets(1)=1
X-RHN-Server-Capability: ipv6(1)=1
X-RHN-Server-Capability: rhncfg.content.base64_decode(1)=1
X-RHN-Server-Capability: xmlrpc.packages.extended_profile(1-2)=1
X-RHN-Server-Capability: xmlrpc.errata.patch_names(1)=1
X-RHN-Server-Capability: xmlrpc.packages.checksums(1)=1
X-RHN-Server-Capability: xmlrpc.login.extra_data(1)=1
X-RHN-Proxy-Version: 0
X-Transport-Info: Extended Capabilities Transport (C) Red Hat, Inc (version 2.5.72-1.el6)
X-RHN-Client-Version: 1
Connection: close
Content-Type: text/xml
 
<?xml version='1.0'?>
<methodResponse>
<params>
<param>
<value><string></string></value>
</param>
</params>
</methodResponse>


In addition, attempting to use the Spacewalk reactivation feature results in the error on rhnreg_ks:

Error communicating with server. The message was:
Internal Server Error

on re-registration, and the attached stack trace generated by the server.

We're seeing this on an assortment of servers and releases (CentOS 5 and 6), with no particular consistent relationship as to why it's broken. 


Version-Release number of selected component (if applicable):

Spacewalk 2.2:
spacewalk-base-2.2.33-1.el6.noarch

On one affected client:
rhnlib-2.5.72-1.el6.noarch
spacewalk-backend-libs-2.2.43-1.el6.noarch
rhncfg-client-5.10.73-1.el6.noarch
rhn-setup-2.2.7-1.el6.noarch
rhn-client-tools-2.2.7-1.el6.noarch
rhnsd-5.0.14-1.el6.x86_64
rhn-check-2.2.7-1.el6.noarch
rhncfg-actions-5.10.73-1.el6.noarch
rhncfg-5.10.73-1.el6.noarch


How reproducible:
Very; but not reliably. Some systems just get stuck in this state, and I haven't figured out why or how to reproduce it.


Steps to Reproduce:
1. Schedule task in Spacewalk UI
2. Attempt to run rhn_check on remote host.


Actual results:
Receive successful completion of rhn_check, but task remains at "Queued", "This action has not yet been picked up."


Expected results:
Receive successful completion of rhn_check, and the task completed. 

Additional info:

Comment 1 Stephen Herr 2015-01-29 22:18:01 UTC

This is some kind of huge hint, but I don't know why yet:
https://www.redhat.com/archives/spacewalk-list/2015-January/msg00175.html

Setting the creator_id for the server to non-null and then reactivating it fixes the problem.

spacewalk-sql -i
alter table rhnserver disable trigger rhnserver_log_trig ;
update rhnserver set creator_id = X where id = Y;
alter table rhnserver enable trigger rhnserver_log_trig ;

Comment 2 Stephen Herr 2015-01-29 22:19:39 UTC

Traceback from email in comment 1:
Exception Handler Information
Traceback (most recent call last):
 File "/usr/lib/python2.6/site-packages/spacewalk/server/apacheRequest.py", line 122, in call_function
   response = func(*params)
 File "/usr/share/rhn/server/handlers/xmlrpc/registration.py", line 511, in new_system
   architecture, data)
 File "/usr/share/rhn/server/handlers/xmlrpc/registration.py", line 300, in create_system
   newserv.user.reload(newserv.server['creator_id'])
 File "/usr/lib/python2.6/site-packages/spacewalk/server/rhnUser.py", line 246, in reload
   raise rhnException("Could not find contact record for id", user_id)
rhnException: ('Could not find contact record for id', None)

Comment 3 Stephen Herr 2015-01-30 18:36:35 UTC

Okay, so after some investigation, I believe the traceback when reactivating is a different issue from whatever is causing the action to not get picked up, and reactivating just happens to fix whatever the not-picking-up-actions problem is. I'll look at fixing the reactivation problem in a minute, but I need more information about the not-picking-up-actions problem (I can't reproduce it).

In addition to the normal cause for actions not getting picked up (client not calling home, something weird with system / database timestamp so Spacewalk does not thing the action should be executed yet, system id on client does not match System profile) I have identified a couple of other things I'd like you to check for:

1) Is the system profile locked? Some actions will not be scheduled if the system profile is locked.
System Overview page -> Lock Status field

2) Does Spacewalk (incorrectly) think that this system is in the middle of a reboot?
spacewalk-sql -i
select 1
  from rhnServerAction sa
  join rhnAction a on sa.action_id = a.id
  join rhnActionType at on a.action_type = at.id
 where sa.server_id = <sid>
   and sa.status = 1;

(will return 1 if Spacewalk thinks the system is rebooting)

If neither of those apply and the system is still not picking up actions, I'll need you to enable more thorough logging on the Spacewalk server and post the logs here.

Enable debug logging by setting "server.debug = 4" in /etc/rhn/rhn.conf and restarting httpd. Then (as root):
1) tail -f /var/log/rhn/rhn_server_xmlrpc.log -n 0 > out.txt
2) run rhn_check on the client
3) Ctl-c the tail
4) attach out.txt to this bug or copy it into a comment.

Thanks!

Comment 4 Stephen Herr 2015-01-30 22:17:40 UTC

I have fixed the "traceback during re-activation" problem, but again, I think this is unrelated to the "not picking up actions" problem.

Committing to Spacewalk master:
e3400bd637f7686664ae29ff55f085b0723c1f6d

Fixed in spacewalk-backend-2.3.37-1

Comment 5 Christopher Swingler 2015-02-11 16:32:15 UTC

Okay, so I went and scheduled a remote command on all of our systems in Spacewalk with a null creator ID.  This task was scheduled at 1/30/15 10:47:00 AM CST. I finally went and checked on it today (over twelve days later).

The remote command was to basically just email me when it executed:

echo "Hello from $(hostname) at $(date)" | mail -s "Pingback from null-creator_id host $(hostname)" cswingler@

Results:

Completed: 224
Failed:      5
Pending:   125

The five failed systems failed for various, valid reasons - three for not having a "mail" command, and two for not having the permissions set to allow execution.

Of the 125 that are still pending (twelve days later), 48 are on our Inactive Systems list and should probably be cleaned out. :)

That still leaves 77 systems that seem to be affected. A spot check shows that, indeed, they're still failing to pick up tasks.

I think this might work against the hypothesis that a null creator-ID is to blame here, and that was simply coincidental

But I (sort of) digress. 

To answer your questions:

> 1) Is the system profile locked? Some actions will not be scheduled if the system profile is locked.

Created an ssm in spacecmd, ran system_details against the 77 systems we've got here, and:

[admincswingler@spwk-app-chi-01 ~]$ grep -c 'Locked.*False' stuck_executors.txt
77
[admincswingler@spwk-app-chi-01 ~]$ grep -ic 'Locked.*True' stuck_executors.txt
0

Nope. All unlocked. 

> 2) Does Spacewalk (incorrectly) think that this system is in the middle of a reboot?
spaceschema-#  where sa.server_id in (1000014807, 1000014827, 1000014828, 1000014680, 1000013882, 1000014931, 1000014932, 1000015027, 1000015252, 1000014809, 1000015308, 1000015341, 1000015329, 1000015324, 1000015330, 1000015338, 1000015335, 1000014812, 1000015389, 1000015275, 1000015222, 1000015229, 1000015239, 1000013782, 1000014910, 1000014923, 1000014839, 1000014924, 1000014841, 1000014953, 1000014951, 1000014916, 1000014736, 1000014751, 1000014993, 1000014995, 1000014994, 1000015023, 1000015021, 1000015020, 1000014653, 1000014758, 1000015065, 1000014955, 1000014957, 1000014968, 1000015118, 1000014947, 1000014912, 1000014911, 1000014821, 1000014822, 1000014824, 1000014561, 1000015090, 1000015120, 1000014921, 1000015081, 1000015309, 1000014942, 1000014943, 1000014697, 1000014978, 1000014979, 1000015371, 1000014323, 1000015151, 1000014981, 1000014976, 1000015205, 1000015208, 1000015183, 1000015084, 1000014741, 1000014975, 1000014974, 1000014815)
spaceschema-#    and sa.status = 1;
 ?column?
----------
       1
<snip>
       1
(77 rows)

Bingo!

Comment 6 Stephen Herr 2015-02-13 19:51:28 UTC

Sorry, "rebooting" was not entirely accurate on my part. The comment I was looking for called out rebooting specifically, but actually that sql is looking for any action that is currently in progress. Which makes sense, you can't start on the next action until you're done with the one you're already working on.

You can find out what type of action is "stuck" with this query:

select sa.server_id, at.label
  from rhnServerAction sa
  join rhnAction a on sa.action_id = a.id
  join rhnActionType at on a.action_type = at.id
 where sa.server_id in (1000014807, 1000014827, 1000014828, 1000014680, 1000013882, 1000014931, 1000014932, 1000015027, 1000015252, 1000014809, 1000015308, 1000015341, 1000015329, 1000015324, 1000015330, 1000015338, 1000015335, 1000014812, 1000015389, 1000015275, 1000015222, 1000015229, 1000015239, 1000013782, 1000014910, 1000014923, 1000014839, 1000014924, 1000014841, 1000014953, 1000014951, 1000014916, 1000014736, 1000014751, 1000014993, 1000014995, 1000014994, 1000015023, 1000015021, 1000015020, 1000014653, 1000014758, 1000015065, 1000014955, 1000014957, 1000014968, 1000015118, 1000014947, 1000014912, 1000014911, 1000014821, 1000014822, 1000014824, 1000014561, 1000015090, 1000015120, 1000014921, 1000015081, 1000015309, 1000014942, 1000014943, 1000014697, 1000014978, 1000014979, 1000015371, 1000014323, 1000015151, 1000014981, 1000014976, 1000015205, 1000015208, 1000015183, 1000015084, 1000014741, 1000014975, 1000014974, 1000014815)
   and sa.status = 1;

The fact that the problem is actions that are suck makes this difficult to debug, since I have no idea where to look or how they got in this state. If you're sure that they're not actually in the middle of something you can unstick by just changing all their states to failed:

update rhnServerAction sa
   set sa.status = 3
 where sa.server_id in (1000014807, 1000014827, 1000014828, 1000014680, 1000013882, 1000014931, 1000014932, 1000015027, 1000015252, 1000014809, 1000015308, 1000015341, 1000015329, 1000015324, 1000015330, 1000015338, 1000015335, 1000014812, 1000015389, 1000015275, 1000015222, 1000015229, 1000015239, 1000013782, 1000014910, 1000014923, 1000014839, 1000014924, 1000014841, 1000014953, 1000014951, 1000014916, 1000014736, 1000014751, 1000014993, 1000014995, 1000014994, 1000015023, 1000015021, 1000015020, 1000014653, 1000014758, 1000015065, 1000014955, 1000014957, 1000014968, 1000015118, 1000014947, 1000014912, 1000014911, 1000014821, 1000014822, 1000014824, 1000014561, 1000015090, 1000015120, 1000014921, 1000015081, 1000015309, 1000014942, 1000014943, 1000014697, 1000014978, 1000014979, 1000015371, 1000014323, 1000015151, 1000014981, 1000014976, 1000015205, 1000015208, 1000015183, 1000015084, 1000014741, 1000014975, 1000014974, 1000014815)
   and sa.status = 1;

However this doesn't fix the larger problem of how they got stuck in the first place. If any other information that can shed some light becomes available (error messages in logs, reproducible situations where an action never completes, etc.) please report it here so I can fix it.

Comment 7 Michael Calmer 2015-02-14 14:05:26 UTC

(In reply to Stephen Herr from comment #6)
> Sorry, "rebooting" was not entirely accurate on my part. The comment I was
> looking for called out rebooting specifically, but actually that sql is
> looking for any action that is currently in progress. Which makes sense, you
> can't start on the next action until you're done with the one you're already
> working on.

Hmm, but maybe this is the problem. 

_query_queue_get explicitly collect queued _and_ pickedUp actions. There is also
a column "remaining_tries" in rhnServerAction. So it seems that already picked-up actions are executed a second time if the client ask again.

But with the current __reboot_in_progress() method which do not check for reboot, all picked-up but "failed" actions will never be triggered again.

Comment 8 Stephen Herr 2015-03-05 21:25:58 UTC

Yeah I think you're right. Accepting your pull request:
https://github.com/spacewalkproject/spacewalk/pull/230

Comment 9 Grant Gainey 2015-03-23 16:59:00 UTC

Moving bugs to ON_QA as we move to release Spacewalk 2.3

Comment 10 Grant Gainey 2015-04-14 19:03:01 UTC

Spacewalk 2.3 has been released. See

https://fedorahosted.org/spacewalk/wiki/ReleaseNotes23

Note You need to log in before you can comment on or make changes to this bug.