Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1155291
Summary: | hang in test_lock | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Dan Horák <dan> |
Component: | kernel | Assignee: | Kyle McMartin <kmcmartin> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | rawhide | CC: | gansalmon, itamar, jcajka, jcm, jonathan, karsten, kdudka, kernel-maint, kmcmartin, madhu.chinakonda, mchehab, mjuszkie, moceap, mtoman, pbrobinson, peterm, zbyszek |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2014-11-04 09:23:54 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 467765, 1071880, 922257, 1051573 |
Description
Dan Horák
2014-10-21 20:08:57 UTC
what might help - use 1 CPU (disable all except one) - utilize the system by eg. build a kernel build in paralllel - retry Thanks for the reproducer. Generally, we want to run it via this: ./test-driver --test-name test-lock --log-file test-lock.log --trs-file test-lock.trs --color-tests no --enable-hard-errors s --expect-failure no -- ./test-lock (via the GNU test-driver script rather than directly) It will lock after an arbitrary number of attempts, where that might be the first one, or the third, etc. Some analysis shows that we are failing in a one shot threading test routine in which the test's main "test_once" function spawns a number of THREAD_COUNT (10) "once_contender_thread"(s) that will each wait for a POSIX rwlock to be fired by the main thread, and then repeat this 50,000 times. After an arbitrary number of iterations the main thread is seeing that one (random) thread is not ready. That would be the case if it was sitting waiting for a signal to wake up following blocking on gl_rwlock_rdlock (which is actually a futex when translated into glibc pthreads). The threads uses these rwlocks after the first iteration (repeat). So. The whole thing smells (sadly) like some kind of kernel futex bug. It's odd that this affects several architectures (I tried this on AArch64 Fedora 21). Has something dramatic changed in futexes upstream in glibc or the kernel very recently? Does anyone have some thoughts about the best way to triage the kernel futex code here perhaps? I'm too tired tonight. For the interim I can suggest a couple quick *hacks*. For one, you could disable the test entirely (which you won't like). For another, you can turn on #define ENABLE_DEBUGGING to 1 instead of 0 via a small patch to test-lock.c since the interaction caused by the logging output invariably seems to result in the tests completing in the various quick runs I did here tonight. If you set debugging on the behavior of the test would otherwise be identical to not setting it. That is the most ugly and nasty approach I agree. Jon. Reported upstream: http://savannah.gnu.org/bugs/?43487 Please try a kernel after 76835b0ebf8a7fe85beb03c75121419a7dec52f0 has been applied. I believe this is a bug in the futex code due to a missing barrier. Futexes are used to back NPTL POSIX pthreads that are used in the test case. jwb: This will get fixed automagically today. It was included in the 3.16.7 and 3.17.2 stable releases that just happened. |