Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.

Bug 1914777

Summary: bout++ fails to build with Python 3.10: test-multigrid_laplace - timeout
Product: [Fedora] Fedora Reporter: Tomáš Hrnčiar <thrnciar>
Component: bout++Assignee: david08741
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: david08741, mhroncok, thrnciar
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-01-17 14:29:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1890881    

Description Tomáš Hrnčiar 2021-01-11 08:38:25 UTC
bout++ fails to build with Python 3.10.0a4.

======= FAILURES ========

----- test-multigrid_laplace -----
rm: cannot remove 'data/BOUT.dmp.*.nc': No such file or directory

(It is likely that a timeout occured)
======= 1 failed in 929.92 seconds ========
make: *** [makefile:49: check-integrated-tests] Error 1

For the build logs, see:
https://copr-be.cloud.fedoraproject.org/results/@python/python3.10/fedora-rawhide-x86_64/01868967-bout++/

For all our attempts to build bout++ with Python 3.10, see:
https://copr.fedorainfracloud.org/coprs/g/python/python3.10/package/bout++/

Testing and mass rebuild of packages is happening in copr. You can follow these instructions to test locally in mock if your package builds with Python 3.10:
https://copr.fedorainfracloud.org/coprs/g/python/python3.10/

Let us know here if you have any questions.

Python 3.10 will be included in Fedora 35. To make that update smoother, we're building Fedora packages with early pre-releases of Python 3.10.
A build failure prevents us from testing all dependent packages (transitive [Build]Requires), so if this package is required a lot, it's important for us to get it fixed soon.
We'd appreciate help from the people who know this package best, but if you don't want to work on this now, let us know so we can try to work around it on our side.

Comment 1 Miro Hrončok 2021-01-11 10:29:56 UTC
IIRC this should only happen in Copr and not Koji. A workaround is to enable network access.

See https://bugzilla.redhat.com/show_bug.cgi?id=1793612#c1 for details.

Comment 2 david08741 2021-01-11 11:06:53 UTC
I don't think as it is that simple, the MPI issues is I think fixed, at least on rawhide.

The test should not be particular slow, either, normally 20 to 30 secs, so well below the 600 secs.

I will try to investigate this, and thus keep the bug open.

Comment 3 david08741 2021-01-12 15:52:50 UTC
I am tempted to say this is an issue that copr is not having enough cores. Even though the test only uses 3 threads - that might be sufficient to trigger the timeout.
On an old 2-core system the test finishes in about 4 seconds if it is using 1 thread, but with 3 threads it takes over 4 minutes.
I am not sure what copr is using, but I think it is also using old CPUs and very few CPU (1?) - in which case it might take well more then 10 minutes.
On a decent 64 core system the single tread version takes 1.3 seconds and 1.0 with 3 threads.

If this keeps being an issue, and I can disable the test on copr or if there is only one core available.

The underlying issue is that MPI is optimized to be fast on non-oversubscribed systems. While in the real world MPI should never be used oversubscribed, this is common for testing, in which case the "idle" threads are busy waiting on the other threads ...

Comment 4 Miro Hrončok 2021-01-12 16:06:38 UTC
Any explanation why it works with network enabled?

Comment 5 david08741 2021-01-12 16:12:46 UTC
Pure luck - I guess ...
Timeout is 600 seconds, in the case with network enabled it took:
test-multigrid_laplace           ✓ 588.655 s

In that case increasing the time-out might be the most easy solution ...

Comment 6 david08741 2021-01-17 14:29:05 UTC
I have increased the timeout from 10m to 15m, I think that should fix the issue.