Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1412204 - rust network test suite stuck
Summary: rust network test suite stuck
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: rust
Version: 25
Hardware: s390x
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: Rust SIG
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: ZedoraTracker 1410952
TreeView+ depends on / blocked
 
Reported: 2017-01-11 14:17 UTC by Dan Horák
Modified: 2017-06-20 13:50 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-06-20 13:50:39 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
IBM Linux Technology Center 150481 0 None None None 2017-01-11 16:14:36 UTC

Description Dan Horák 2017-01-11 14:17:41 UTC
rust network tests are stuck in F-24 and F-25 with following in the build.log
...
test net::tcp::tests::listen_localhost ... ok
test net::tcp::tests::multiple_connect_interleaved_greedy_schedule ... ok
test net::tcp::tests::multiple_connect_interleaved_lazy_schedule ... ok
test net::tcp::tests::multiple_connect_serial ... ok
test net::tcp::tests::nodelay ... ok
test net::tcp::tests::partial_read ... test net::tcp::tests::partial_read has been running for over 60 seconds

this is from a single CPU builder

strace says:
[root@fedora2 ~]# strace -p 15604
strace: Process 15604 attached
futex(0x2aa430c3064, FUTEX_WAIT_PRIVATE, 731, NULL

^Cstrace: Process 15604 detached
 <detached ...>

gdb says:
(gdb) attach 15604
Attaching to process 15604
[New LWP 16084]
warning: .dynamic section for "target:/lib64/libnss_myhostname.so.2" is not at the expected address (wrong library or version mismatch?)
warning: .dynamic section for "target:/lib64/libselinux.so.1" is not at the expected address (wrong library or version mismatch?)
warning: .dynamic section for "target:/lib64/libcap.so.2" is not at the expected address (wrong library or version mismatch?)
Missing separate debuginfo for target:/lib64/libdl.so.2
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/38/593be8033a61a0499ee4e28870e60ab80bbea4.debug
Missing separate debuginfo for target:/lib64/libpthread.so.0
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/bb/99d4fe8c2865e9f48573fb449d0e1769d55bb8.debug
warning: Expected absolute pathname for libpthread in the inferior, but got target:/lib64/libpthread.so.0.
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
Missing separate debuginfo for target:/lib64/libgcc_s.so.1
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/10/881b8838e53886f108b2127a1413911411ed37.debug
Missing separate debuginfo for target:/lib64/libc.so.6
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/41/850117e1e0b3c4740b575c86ee5b0f73d54a12.debug
Missing separate debuginfo for target:/lib/ld64.so.1
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/5a/8703e244406d2953cd2bde426de7789dfa4a6e.debug
Missing separate debuginfo for target:/lib64/libm.so.6
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/f3/a157f9f2ad9e951184189b497633341b030a95.debug
Missing separate debuginfo for target:/lib64/libnss_files.so.2
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/52/fedee1ab9055a8e577e80c6bb793e5c4d03a09.debug
Missing separate debuginfo for target:/lib64/libnss_dns.so.2
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/79/63fe15529e79342d127cd0dfc9fc066cc12f64.debug
Missing separate debuginfo for target:/lib64/libresolv.so.2
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/88/6e62846eed661a4c4eb069d5d5bc37e2568e19.debug
Missing separate debuginfo for target:/lib64/libnss_myhostname.so.2
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/e6/00892dc4b3dffc066301b122dd57c380aff27a.debug
Missing separate debuginfo for target:/lib64/libselinux.so.1
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/b5/2444da828bbd039796fa6db6a1a36822ef092d.debug
Missing separate debuginfo for target:/lib64/libcap.so.2
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/d3/868bd1d051c8edfdff64e5312a79ad35ae040a.debug
Missing separate debuginfo for target:/lib64/librt.so.1
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/7d/baf95deeefda8c14af99dc2ec1d1aa494720f4.debug
Missing separate debuginfo for target:/lib64/libpcre.so.1
Try: dnf --enablerepo='*debug*' install /usr/lib/debug/.build-id/0d/8b4758547c001d3132199637dc873c33e9a2e9.debug
warning: Expected absolute pathname for libpthread in the inferior, but got target:/lib64/libpthread.so.0.
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
0x000003ffba40e2ac in pthread_rwlock_tryrdlock () from target:/lib64/libpthread.so.0
(gdb) where
#0  0x000003ffba40e2ac in pthread_rwlock_tryrdlock () from target:/lib64/libpthread.so.0
#1  0x0000000000000020 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

The complains are probably due the chroot environment (F-25 chroot on F-24).


Version-Release number of selected component (if applicable):
rust-1.14.0-1.fc25
llvm-3.8.1-2.fc25
glibc-2.24-4.fc25.s390x

rust-1.14.0-1.fc24
llvm-3.8.1-2.fc24
glibc-2.23.1-11.fc24.s390x

kernel-4.8.4-200.fc24.s390x (kernel on builder)

Comment 1 Dan Horák 2017-01-11 14:43:50 UTC
potential real cause is in bug 1410052

Comment 2 Josh Stone 2017-01-12 00:31:33 UTC
I believe this is another codegen bug in LLVM 3.8.  All of the rawhide s390x builds have been fine, so I think we can rule out kernel problems in the builder.  Then I tried a test binary compiled in f25, running in a rawhide chroot, and that still hangs, so it doesn't seem to be fixed by newer glibc etc.  Then I tried in f25 with the bundled LLVM (3.9-ish), and that has been working perfectly!

As for blaming pthread_rwlock_tryrdlock, the gdb errors need to be resolved before you can trust the reported locations.  I had similar errors as you trying to attach in the chroot, and found that the buildids that gdb was looking at were from the host versions of those libraries.  (I'm not even sure how it's getting there from the chroot!)

But it does work to launch the test binary directly from gdb, then I see:

(gdb) info threads
  Id   Target Id         Frame
* 1    Thread 0x3fffdd022a0 (LWP 52925) "stdtest-s390x-u" 0x000003fffdfb162e in __pthread_cond_timedwait (cond=0x2aa002c7060, mutex=0x2aa002c7010, abstime=0x3ffffffaeb8) at pthread_cond_timedwait.c:198
  3    Thread 0x3fffdb00910 (LWP 52929) "net::tcp::tests" __pthread_cond_wait (cond=0x3fff0000a20, mutex=0x3fff00014b0) at pthread_cond_wait.c:189
  17   Thread 0x3fffdd00910 (LWP 52943) "net::tcp::tests" __pthread_cond_wait (cond=0x3fff8000d90, mutex=0x3fff8001100) at pthread_cond_wait.c:189

Thread 1 is just the main thread spawning tests.

(gdb) thread 3
[Switching to thread 3 (Thread 0x3fffdb00910 (LWP 52929))]
#0  __pthread_cond_wait (cond=0x3fff0000a20, mutex=0x3fff00014b0) at pthread_cond_wait.c:189
189           __pthread_disable_asynccancel (cbuffer.oldtype);
(gdb) backtrace
#0  __pthread_cond_wait (cond=0x3fff0000a20, mutex=0x3fff00014b0) at pthread_cond_wait.c:189
#1  0x000002aa0016fc4a in std::sys::imp::condvar::Condvar::wait (self=<optimized out>, mutex=<optimized out>) at /builddir/build/BUILD/rustc-1.14.0/src/libstd/sys/unix/condvar.rs:64
#2  std::sys_common::condvar::Condvar::wait (self=<optimized out>, mutex=<optimized out>) at /builddir/build/BUILD/rustc-1.14.0/src/libstd/sys_common/condvar.rs:51
#3  std::sync::condvar::Condvar::wait<bool> (self=<optimized out>, guard=...) at /builddir/build/BUILD/rustc-1.14.0/src/libstd/sync/condvar.rs:125
#4  0x000002aa0007cfc2 in std::thread::park () at /builddir/build/BUILD/rustc-1.14.0/src/libstd/thread/mod.rs:466
#5  0x000002aa0014d212 in std::sync::mpsc::blocking::WaitToken::wait (self=...) at /builddir/build/BUILD/rustc-1.14.0/src/libstd/sync/mpsc/blocking.rs:81
#6  std::sync::mpsc::shared::Packet<(std::net::tcp::TcpStream, std::net::addr::SocketAddr)>::recv<(std::net::tcp::TcpStream, std::net::addr::SocketAddr)> (self=<optimized out>, deadline=...)
    at /builddir/build/BUILD/rustc-1.14.0/src/libstd/sync/mpsc/shared.rs:237
#7  std::sync::mpsc::Receiver<(std::net::tcp::TcpStream, std::net::addr::SocketAddr)>::recv<(std::net::tcp::TcpStream, std::net::addr::SocketAddr)> (self=<optimized out>) at /builddir/build/BUILD/rustc-1.14.0/src/libstd/sync/mpsc/mod.rs:883
#8  0x000002aa001c4708 in std::net::tcp::tests::clone_accept_concurrent::{{closure}} (addr=...) at /builddir/build/BUILD/rustc-1.14.0/src/libstd/net/tcp.rs:996
#9  0x000002aa000eb55c in std::net::tcp::tests::each_ip (f=...) at /builddir/build/BUILD/rustc-1.14.0/src/libstd/net/tcp.rs:444
#10 std::net::tcp::tests::clone_accept_concurrent () at /builddir/build/BUILD/rustc-1.14.0/src/libstd/net/tcp.rs:974
#11 0x000002aa001d6616 in test::run_test::{{closure}} () at /builddir/build/BUILD/rustc-1.14.0/src/libtest/lib.rs:1265
[...]

(gdb) thread 17
[Switching to thread 17 (Thread 0x3fffdd00910 (LWP 52943))]
#0  __pthread_cond_wait (cond=0x3fff8000d90, mutex=0x3fff8001100) at pthread_cond_wait.c:189
189           __pthread_disable_asynccancel (cbuffer.oldtype);
(gdb) backtrace
#0  __pthread_cond_wait (cond=0x3fff8000d90, mutex=0x3fff8001100) at pthread_cond_wait.c:189
#1  0x000002aa0016fc4a in std::sys::imp::condvar::Condvar::wait (self=<optimized out>, mutex=<optimized out>) at /builddir/build/BUILD/rustc-1.14.0/src/libstd/sys/unix/condvar.rs:64
#2  std::sys_common::condvar::Condvar::wait (self=<optimized out>, mutex=<optimized out>) at /builddir/build/BUILD/rustc-1.14.0/src/libstd/sys_common/condvar.rs:51
#3  std::sync::condvar::Condvar::wait<bool> (self=<optimized out>, guard=...) at /builddir/build/BUILD/rustc-1.14.0/src/libstd/sync/condvar.rs:125
#4  0x000002aa0007cfc2 in std::thread::park () at /builddir/build/BUILD/rustc-1.14.0/src/libstd/thread/mod.rs:466
#5  0x000002aa00135bd8 in std::sync::mpsc::blocking::WaitToken::wait (self=...) at /builddir/build/BUILD/rustc-1.14.0/src/libstd/sync/mpsc/blocking.rs:81
#6  std::sync::mpsc::shared::Packet<()>::recv<()> (self=<optimized out>, deadline=...) at /builddir/build/BUILD/rustc-1.14.0/src/libstd/sync/mpsc/shared.rs:237
#7  0x000002aa0014b9ae in std::sync::mpsc::Receiver<()>::recv<()> (self=<optimized out>) at /builddir/build/BUILD/rustc-1.14.0/src/libstd/sync/mpsc/mod.rs:883
#8  0x000002aa001c280c in std::net::tcp::tests::clone_while_reading::{{closure}} (addr=...) at /builddir/build/BUILD/rustc-1.14.0/src/libstd/net/tcp.rs:949
#9  0x000002aa000eb36c in std::net::tcp::tests::each_ip (f=...) at /builddir/build/BUILD/rustc-1.14.0/src/libstd/net/tcp.rs:444
#10 std::net::tcp::tests::clone_while_reading () at /builddir/build/BUILD/rustc-1.14.0/src/libstd/net/tcp.rs:916
#11 0x000002aa001d6616 in test::run_test::{{closure}} () at /builddir/build/BUILD/rustc-1.14.0/src/libtest/lib.rs:1265

In both cases, they are stuck in a mpsc channel recv[1] at the end of their respective tests, waiting for their sub-threads to exit.  But we can see in GDB that there are no other threads, so those must have already exited and we missed their mpsc send!
[1] https://doc.rust-lang.org/std/sync/mpsc/struct.Receiver.html#method.recv

If I run just the sync::mpsc tests, it also gets stuck.  I think we're only seeing it get stuck in net::tcp because those are alphabetically first and happen to also use mpsc.  Rust runs tests in parallel up to the number of CPUs, so once we get a hung test for each CPU, we're hosed.


TL;DR Rust's mpsc is misbehaving, but then it works with any LLVM 3.9, so it appears to be a codegen issue with 3.8.

Comment 3 Fedora Admin XMLRPC Client 2017-01-24 18:42:05 UTC
This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.

Comment 4 Dan Horák 2017-06-20 13:50:39 UTC
resolved via bug 1410952 (llvm got updated to 3.9 in F-25)


Note You need to log in before you can comment on or make changes to this bug.