Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1328958
Summary: | pbs_sched doesn't appear to work | ||
---|---|---|---|
Product: | [Fedora] Fedora EPEL | Reporter: | Kevin L. Esteb <kesteb> |
Component: | torque | Assignee: | David Brown <david.brown> |
Status: | ON_QA --- | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | epel7 | CC: | david.brown, fotis, garrick, karlthered |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | Type: | Bug | |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Kevin L. Esteb
2016-04-20 17:53:12 UTC
This is also happening on RHEL6. Did you forget to compile in a valid scheduler for pbs_sched in the last update? Kevin, Sorry you are having issues trying to schedule nodes, I'm not sure if this may be related to the new numa support that I put in a few months ago... I've never run multiple queues before, my testing is primarily with a default queue setup and MPI jobs can be scheduled and run just fine. It will take me some time to try and reproduce your scheduling environment so in the mean time, have you run this issue by the mailing list yet? Thanks, - David Brown No, the mailing list is not much help, just vague comments that pbs_sched is broken on 4.2 and vague statements that somebody, somehow got it working again. This problem started with the last update from epel. Our RHEL5 boxes are working fine, but they weren't updated. We didn't change our configuration, just updated the software and the scheduler stopped working. Luckily our RHEL6/RHEL7 boxes don't currently use Torque in production. But we are doing a push to RHEL7 this summer and a broken Torque is not good. Two things that I noticed, pbs_sched stopped listening on the loopback device and I had to use the '-l' switch with pbs_server to force it to communicate with pbs_sched. netstat show the connections, but the scheduler doesn't seem to want to schedule. Our RHEL6/RHEL7 boxes show this: [root@wsipc-scm-01 Resource]# netstat -tapn | grep pbs tcp 0 0 0.0.0.0:9501 0.0.0.0:* LISTEN 32891/pbs_server tcp 0 0 0.0.0.0:9502 0.0.0.0:* LISTEN 33329/pbs_mom tcp 0 0 0.0.0.0:9503 0.0.0.0:* LISTEN 33329/pbs_mom tcp 0 0 10.1.254.181:9504 0.0.0.0:* LISTEN 15237/pbs_sched tcp 1 0 10.1.254.181:843 10.1.254.181:9504 CLOSE_WAIT 32891/pbs_server Our RHEL5 boxes show this: [root@redhat-test-03 ~]# netstat -tapn | grep pbs tcp 0 0 10.1.252.43:9504 0.0.0.0:* LISTEN 17168/pbs_sched tcp 0 0 0.0.0.0:9501 0.0.0.0:* LISTEN 17121/pbs_server tcp 0 0 0.0.0.0:9502 0.0.0.0:* LISTEN 17147/pbs_mom tcp 0 0 0.0.0.0:9503 0.0.0.0:* LISTEN 17147/pbs_mom [root@redhat-test-03 ~]# Updating to v6.0.1 from the Adaptive Computing web site fixes these problems. I compiled the code using the provided spec file. After some minor configurations, it just worked. torque-4.2.10-11.el7 has been submitted as an update to Fedora EPEL 7. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-6658d64670 torque-4.2.10-11.el7 has been pushed to the Fedora EPEL 7 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-6658d64670 |