Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1321154
Summary: | numa enabled torque don't work | ||
---|---|---|---|
Product: | [Fedora] Fedora EPEL | Reporter: | nucleo <alekcejk> |
Component: | torque | Assignee: | David Brown <david.brown> |
Status: | CLOSED EOL | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | medium | Docs Contact: | |
Priority: | low | ||
Version: | el6 | CC: | aflyhorse, agajania, Anthony.Thyssen, a.rohou, david.brown, fotis, garrick, gmn, g.roest, j4jes, karlthered, kevin, olivier, sdainard, troels |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | torque-4.2.10-10.fc24 torque-4.2.10-10.el7 torque-4.2.10-10.fc23 torque-4.2.10-10.fc22 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-11-30 15:58:50 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
nucleo
2016-03-24 19:33:31 UTC
According to the document[1] from AdaptiveComputing, the mom.layout is either manually created, or by using a contrib perl script "mom_gencfg". Maybe we should find a way to make single and numa configuration to co-exist. [1] http://docs.adaptivecomputing.com/torque/4-2-10/help.htm#topics/1-installConfig/buildingWithNUMA.htm%3FTocPath%3D1.0%2520Installation%2520and%2520configuration|1.7%2520TORQUE%2520on%2520NUMA%2520systems|_____2 Also, if numa-support is breaking "minor update friendly" of EPEL philosophy, I suggest removing it. Please remove the NUMA support from this package group, or create an alternate package group. My cluster has been dead for almost 2 weeks and the scientists are getting cranky. This feature does not play well with the MAUI scheduler and, apparently, not at all with the built-in scheduler (http://www.clusterresources.com/pipermail/torqueusers/2013-September/016136.html). Requiring this feature means having to introduce a whole host of changes to the Torque environment as well as forcing recompile of OpenMPI (last I checked epel version of openmpi does not have Torque support) and MAUI, which then means recompiling all the analysis applications, etc... I've tried...I really have. I even tried rebuilding the package group from the src rpm, but when I remove the enable-numa switch from the torque.spec file it still builds with numa support (not sure what I'm missing there). My first (anguished) post here, so please excuse my noobness. Okay so didn't realize you were having this much of an issue, if you have a version of the torque package that you know works. There's a plugin for yum called yum-plugin-versionlock in that package there is some configs you can setup to lock torque at the version you know works for you. Also, you can yum downgrade torque* to get to a previous version that hopefully works better for you. I'm working through some tests to try and reproduce the situation you are trying to describe but it is taking some time as I'm volunteering most of my time for this and my virtual environments where I test didn't take this situation into account. Thanks for working on this David! yum downgrade was the first thing I tried but all I get is "Only Upgrade available on package", etc... Not sure what I messed up there. I'll look into the plugin you mentioned. The "problem" (for me anyway) got ugly (i.e. after I worked out the various mom.layout and cpuset issues) when openmpi started barfing on the various shared memory configuration issues. I thought I had worked through most of those over the last few days, but now I just can't get MAUI to send jobs to more than one physical node; all jobs run on a single node regardless of how it's specified in the PBS script. MPI is doing the allocation correctly, but then MAUI (and/or the pbs_mom process) just ignores it... NUMA support enabled in 4.2.10-6, so last working version is 4.2.10-5. It can be downloaded here https://kojipkgs.fedoraproject.org//packages/torque/4.2.10/5.el6/ Older packages For other EPEL and Fedora releases can be found here https://kojipkgs.fedoraproject.org//packages/torque/4.2.10/ Thanks nucleo! Very educational. Okay, I got some time to test things out. Just to reference for everyone involved, I think I mentioned this on another bug and on the torque users mailing list. I use chef to do some testing to build a virtual cluster and setup torque https://github.com/dmlb2000/torque-cookbook. Check out the templates directory, there are several files that need to be rendered correctly to make things work. For the numa support I had to change the server's nodes file and each mom got the mom.layout file. I've tested multiple CPUs with multiple nodes (2x2) and am able to run MPI jobs just fine. However, the RHEL/CentOS version of openmpi is built without torque support. This means that you have to setup your hostsfile and specify the `-np` option to mpirun in order to use OpenMPI in a run and make it work. #PBS -l nodes=2:ppn=2 mpirun -hostfile hostfile -np 4 ./mpi_hello As, MAUI is not in EPEL I can't really setup and support a configuration of that and I consider it out of scope of support from EPEL's point of view. As I don't have a version of MAUI to target I can't ensure interoperability between the two pieces of software. If you are having issues building and running torque with MAUI or MOAB you should ask the user mailing list as well to get help. As to the status of the original bug I could include a basic mom.layout file. The one from the chef cookbook for example. However, this would have to be changed for most installations as that just flattens the cores on the node. torque-4.2.10-10.el6 has been submitted as an update to Fedora EPEL 6. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-7a55539098 torque-4.2.10-10.el7 has been submitted as an update to Fedora EPEL 7. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-596ccc8373 torque-4.2.10-10.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2016-43b6ce44b3 torque-4.2.10-10.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2016-b21f08b188 torque-4.2.10-10.fc22 has been submitted as an update to Fedora 22. https://bodhi.fedoraproject.org/updates/FEDORA-2016-830fdb2304 torque-4.2.10-10.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-43b6ce44b3 torque-4.2.10-10.el7 has been pushed to the Fedora EPEL 7 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-596ccc8373 torque-4.2.10-10.fc22 has been pushed to the Fedora 22 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-830fdb2304 torque-4.2.10-10.el6 has been pushed to the Fedora EPEL 6 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-7a55539098 torque-4.2.10-10.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-b21f08b188 After installation update torque-4.2.10-10.el6.x86_64 service pbs_mom starts but node shown as down. pbs_server log: PBS_Server.6045;Svr;PBS_Server;LOG_ERROR::get_numa_from_str, Node isn't declared to be NUMA, but mom is reporting I solved the error by changing "var/lib/torque/server_priv/nodes" into: HOSTNAME np=NP num_node_boards=1 (substitute HOSTNAME and NP to the correct number on your host) Then stop and restart pbs_server and pbs_mom. This nodes file mimics that you are a NUMA node with only one subnode. Using the packages in epel-testing (4.2.10-10) means that Torque/PBS works on my newly installed system. Before, with version 4.2.10-9, pbs_mom did not start. I suggest that 4.2.10-10 be pushed to the main EPEL repos. I found using the epel-testing packages ver 4.2.10-10 & the numa configuration settings for non-numa nodes (<nodename> np=4 num_node_boards=1) allowed submitting jobs but pbsnodes listed nodes as ncpus=0 and each node would only run one job concurrently. Downgrading to 4.2.10-5 solved this issue, and I can run concurrent jobs on nodes. Please fork the server builds into: torque-server torque-server-numa if it is not possible to provide proper support for both configurations in one package. So, could we use the config from comment #20 to make the default/existing non numa case work and those that want numa could then adjust that file for their needs? IMHO it's much better for the people wanting the new functionality to have to edit config that people with existing working installs to have to. (In reply to Steve D from comment #22) > I found using the epel-testing packages ver 4.2.10-10 & the numa > configuration settings for non-numa nodes (<nodename> np=4 > num_node_boards=1) allowed submitting jobs but pbsnodes listed nodes as > ncpus=0 and each node would only run one job concurrently. Hmm... Reproduced it, so my walkaround is invalid. > Please fork the server builds into: > torque-server > torque-server-numa Agree. This is the best solution. make -numa and vanilla conflict with each other, and both provides torque-server. Although I don't know how to fit both builds into a single spec file, as well as whether epel permits new packages rolling in. (In reply to Kevin Fenzi from comment #23) > So, could we use the config from comment #20 to make the default/existing > non numa case work and those that want numa could then adjust that file for > their needs? numa arch is decided when you passed --enable-numa to the ./configure. They probably could not co-exist in the current version. on second thought, the package which needs to be folked is torque-mom, since
> Node isn't declared to be NUMA, but mom is reporting
Looks like more and more people have problems with torque packages built with enabled NUMA: http://www.supercluster.org/pipermail/torqueusers/2016-May/018658.html > I recently upgraded packages and the torque packages were updated to the latest rpm versions. However, I am unable to get the nodes to active state. Okay, after some long deliberation in my head about what to do with this and some digging into how to support both configurations in the various environments here's my suggestions... Forking the build for torque is bad for a couple of reasons A. Torque is designed to be a single build for the entire cluster of machines, having multiple builds for sched, mom, etc invites more confusion on users and would result in just more issues. Users would need to know that all torque-*-numa packages should be installed not anything else on every machine in their cluster. B. Torque has many different options that make builds incompatible, forking based on numa just invites forking on blcr, hwloc, pam, readline, tcl/tk, etc... and the combinations just explode... Most of the issues seem to be around EL6 and not EL7 ... Would supporting the numa build in EL7 and reverting EL6 be palatable for everyone? The only other option I would see is more management overhead for EPEL as it involves multiple repositories with various builds that have different upgrade, configuration and change policies. However, this is higher than I can reach right now. I'm seeing the NUMA-trouble in an EL7 setting, so I'm not particularly fond of David Brown's suggestion in comment 27. I concur with Comment 28, running EL7 here. It seems that since NUMA was not included in previous EL builds, that NUMA configuration options should be removed from current builds for consistency (support for Comment 2). Also if Comment 3 is accurate in the amount of packages needing rebuilding and re-configured this seems like a deal-breaker anyway. This way if there is a lot of feedback from the community that NUMA support needs to be included, a separate set of packages will be built for it if the additional capacity is available on maintainers side. Lastly, I wonder if the long-term solution should actually be to request upstream to support a run time configuration setting to disable NUMA support for nodes rather than as a compile only option. But this would still require the build changes mentioned in Comment 3 for other non-torque specific packages. (In reply to Steve D from comment #29) > I concur with Comment 28, running EL7 here. Damn, I was hoping... > It seems that since NUMA was not included in previous EL builds, that NUMA > configuration options should be removed from current builds for consistency > (support for Comment 2). Also if Comment 3 is accurate in the amount of > packages needing rebuilding and re-configured this seems like a deal-breaker > anyway. Under the current infrastructure and support policies this seems to be the only option... > This way if there is a lot of feedback from the community that NUMA support > needs to be included, a separate set of packages will be built for it if the > additional capacity is available on maintainers side. I'd rather put my effort toward pushing a different model of support for these kind of packages. The idea being it would allow me to support things in EPEL more like the way packages flow through Fedora into RHEL. > Lastly, I wonder if the long-term solution should actually be to request > upstream to support a run time configuration setting to disable NUMA support > for nodes rather than as a compile only option. But this would still require > the build changes mentioned in Comment 3 for other non-torque specific > packages. The issue with that (for at least torque) is the maintainers have moved on and are doing major development on torque 6 (yes, two major versions) rather than making feature requests for this old version... I'm currently playing around in copr, see if I can setup a system to support all workflows without forking the build into multiple packages with different names. Though this would be something to discuss with other EPEL folks on the mailing list, see if there's other EPEL packages that could take advantage of the model. torque-4.2.10-10.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report. Bug is not actually fixed, so reopening. torque-4.2.10-10.el7 has been pushed to the Fedora EPEL 7 stable repository. If problems still persist, please make note of it in this bug report. Not really closed as some haven't accepted the fix... torque-4.2.10-10.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report. torque-4.2.10-10.fc22 has been pushed to the Fedora 22 stable repository. If problems still persist, please make note of it in this bug report. I can confirm comment 22 on CentOS 7: Using a single-cpu compute node with "nodes=1" in mom.layout and "np=8 num_node_boards=1" in server_priv/nodes, the machine shows up in "pbsnodes -a" as having "ncpus=0". However, I can run 8 jobs (sleep 10) concurrently on this np=8 node. If you want to send the job to a specific node you have to specify the name (including -0) as shown with "pbsnodes -a" torque-4.2.10-11.el7 has been submitted as an update to Fedora EPEL 7. https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-6658d64670 torque-4.2.10-11.el7 has been pushed to the Fedora EPEL 7 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2017-6658d64670 After installing EPEL Testing Repo torque-4.2.10-11.el7 However I found that all the nodes were 'down' even though everything appears to be running, with no errors in the error logs. After a lot of trials, errors and reseach, I eventually (on a whim) I decided to remove the "num_node_boards=1" entry from the "torque/server_priv/nodes" file and restart the server & scheduler. Suddenly the nodes were "free" and my initial test job ran. Perhaps the EPEL-Test Torque 4.2.10-11 does not contain Numa? ALL later tests (with OpenMPI - RHEL SRPM 1.10.6-2 re-compiled "--with-tm") is now responding to the Torque mode allocation correctly and is no longer simply running all the jobs on the first node. That is $PBS_NODEFILE , pbsdsh hostname and mpirun hostname are all in agreement. Phew... I had tried installing torque using the 4.2.10-10 version and had the same issue where pbs_mom would not bring the nodes up because of numa detected. I found this page, enabled epel-testing and did a yum update torque* on the frontend and backend nodes, presto everything is working again happily with maui/torque/munge/trqauth/kickstart, as it was before my leap from Centos5 to Centos7. Now I'm going to give this a go on a RHEL7 Stacki cluster! Thanks! Strange thing, I'm having a problem with 4.2.10-11 where my ulimit -l returns 'unlimited' from the terminal the way I set it to in limits.conf on all my nodes, yet if I run a qsub job that echoes ulimit -l into a txt file, it gives me '64'. So torque ignores whatever ulimit is set via pam.d and keeps the max file lock value at 64! As a result, this makes my jobs fail with: ipath_userinit: mmap of rcvhdrq failed: Resource temporarily unavailable -------------------------------------------------------------------------- PSM was unable to open an endpoint. Why is torque insisting on this value? A few forums say the fix is to set ulimit -l unlimited inside the /etc/init.d/pbs_mom script, but whats the equivalent in Centos7 ? Is their a pbs_mom file in my /var/lib/torque where I can set this on all nodes? In CentOS 7 (and Redhat) the launcher is systemd 'service script' /usr/lib/systemd/system/pbs_mom.service This defined when and what resources the daemon needs before systemd starts it. specifically that the syslog, networking, and trqauthd daemon is running. Then all it does is run /usr/sbin/pbs_mom -- nothing special. If nothing else you could wrapper pbs_mom with a script to set the ulimit before exec'ing to the real pbs_mom. However this is NOTHING to do with the bug. And probaby should have been posted on some other forum. This message is a reminder that EPEL 6 is nearing its end of life. Fedora will stop maintaining and issuing updates for EPEL 6 on 2020-11-30. It is our policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a 'version' of 'el6'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later EPEL version. Thank you for reporting this issue and we are sorry that we were not able to fix it before EPEL 6 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. EPEL el6 changed to end-of-life (EOL) status on 2020-11-30. EPEL el6 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of EPEL please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed. |