Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.

Bug 1155291

Summary: hang in test_lock
Product: [Fedora] Fedora Reporter: Dan Horák <dan>
Component: kernelAssignee: Kyle McMartin <kmcmartin>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: gansalmon, itamar, jcajka, jcm, jonathan, karsten, kdudka, kernel-maint, kmcmartin, madhu.chinakonda, mchehab, mjuszkie, moceap, mtoman, pbrobinson, peterm, zbyszek
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-11-04 09:23:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 467765, 1071880, 922257, 1051573    

Description Dan Horák 2014-10-21 20:08:57 UTC
We are seeing hung test_lock process during many builds (like in packages bundling gnulib) on various arches (arm, s390, but mostly on ppc*). It might or might not be a problem in gnulib, other possible places are glibc and kernel. Both ppc64 and ppc64le are probably the best candidates to reproduce the problem.

See https://fedoraproject.org/wiki/Architectures/PowerPC/WorkQueue for a minimalized test case.

Comment 1 Dan Horák 2014-10-21 20:15:39 UTC
what might help
- use 1 CPU (disable all except one)
- utilize the system by eg. build a kernel build in paralllel
- retry

Comment 2 Jon Masters 2014-10-22 07:03:54 UTC
Thanks for the reproducer. Generally, we want to run it via this:

./test-driver --test-name test-lock --log-file test-lock.log --trs-file test-lock.trs --color-tests no --enable-hard-errors s --expect-failure no -- ./test-lock

(via the GNU test-driver script rather than directly)

It will lock after an arbitrary number of attempts, where that might be the first one, or the third, etc. Some analysis shows that we are failing in a one shot threading test routine in which the test's main "test_once" function spawns a number of THREAD_COUNT (10) "once_contender_thread"(s) that will each wait for a POSIX rwlock to be fired by the main thread, and then repeat this 50,000 times. After an arbitrary number of iterations the main thread is seeing that one (random) thread is not ready. That would be the case if it was sitting waiting for a signal to wake up following blocking on gl_rwlock_rdlock (which is actually a futex when translated into glibc pthreads). The threads uses these rwlocks after the first iteration (repeat).

So. The whole thing smells (sadly) like some kind of kernel futex bug. It's odd that this affects several architectures (I tried this on AArch64 Fedora 21). Has something dramatic changed in futexes upstream in glibc or the kernel very recently? Does anyone have some thoughts about the best way to triage the kernel futex code here perhaps? I'm too tired tonight.

For the interim I can suggest a couple quick *hacks*. For one, you could disable the test entirely (which you won't like). For another, you can turn on #define ENABLE_DEBUGGING to 1 instead of 0 via a small patch to test-lock.c since the interaction caused by the logging output invariably seems to result in the tests completing in the various quick  runs I did here tonight. If you set debugging on the behavior of the test would otherwise be identical to not setting it. That is the most ugly and nasty approach I agree.

Jon.

Comment 3 Marcin Juszkiewicz 2014-10-28 12:16:46 UTC
Reported upstream: http://savannah.gnu.org/bugs/?43487

Comment 4 Jon Masters 2014-10-29 04:29:01 UTC
Please try a kernel after 76835b0ebf8a7fe85beb03c75121419a7dec52f0 has been applied. I believe this is a bug in the futex code due to a missing barrier. Futexes are used to back NPTL POSIX pthreads that are used in the test case.

Comment 6 Dan Horák 2014-11-04 09:23:54 UTC
jwb: This will get fixed automagically today.  It was included in the 3.16.7
and 3.17.2 stable releases that just happened.