Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 971790
Summary: | bdrv_rwv_co() loops indefinitely when using the curl driver in qemu with curl --enable-threaded-resolver | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Richard W.M. Jones <rjones> |
Component: | qemu | Assignee: | Fedora Virtualization Maintainers <virt-maint> |
Status: | CLOSED UPSTREAM | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 20 | CC: | amit.shah, berrange, cfergeau, crobinso, dwmw2, fabian.deutsch, famz, itamar, kdudka, kwolf, lagarcia, paul, pbonzini, rjones, scottt.tw, virt-maint |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2014-01-16 09:59:22 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 910269 | ||
Attachments: |
Description
Richard W.M. Jones
2013-06-07 09:41:44 UTC
By a process of elimination, the problem is this ./configure option: --enable-threaded-resolver Removing just this line from the spec file fixes curl in Fedora. Similarly, adding this option to curl from git breaks it. I will ask upstream. How are you confirming this is a bug of curl? I am able to download first 512 bytes of the remote file with the curl tool just fine: [kdudka@f20 ~]$ curl -sr0-511 http://libguestfs.org/index.html | hexdump -C 00000000 3c 21 44 4f 43 54 59 50 45 20 68 74 6d 6c 20 50 |<!DOCTYPE html P| 00000010 55 42 4c 49 43 20 22 2d 2f 2f 57 33 43 2f 2f 44 |UBLIC "-//W3C//D| 00000020 54 44 20 58 48 54 4d 4c 20 31 2e 30 20 53 74 72 |TD XHTML 1.0 Str| 00000030 69 63 74 2f 2f 45 4e 22 20 22 68 74 74 70 3a 2f |ict//EN" "http:/| 00000040 2f 77 77 77 2e 77 33 2e 6f 72 67 2f 54 52 2f 78 |/www.w3.org/TR/x| 00000050 68 74 6d 6c 31 2f 44 54 44 2f 78 68 74 6d 6c 31 |html1/DTD/xhtml1| 00000060 2d 73 74 72 69 63 74 2e 64 74 64 22 3e 0a 3c 68 |-strict.dtd">.<h| 00000070 74 6d 6c 20 78 6d 6c 6e 73 3d 22 68 74 74 70 3a |tml xmlns="http:| 00000080 2f 2f 77 77 77 2e 77 33 2e 6f 72 67 2f 31 39 39 |//www.w3.org/199| 00000090 39 2f 78 68 74 6d 6c 22 20 78 6d 6c 3a 6c 61 6e |9/xhtml" xml:lan| 000000a0 67 3d 22 65 6e 22 20 6c 61 6e 67 3d 22 65 6e 22 |g="en" lang="en"| 000000b0 3e 0a 3c 68 65 61 64 3e 0a 3c 74 69 74 6c 65 3e |>.<head>.<title>| 000000c0 6c 69 62 67 75 65 73 74 66 73 2c 20 6c 69 62 72 |libguestfs, libr| 000000d0 61 72 79 20 66 6f 72 20 61 63 63 65 73 73 69 6e |ary for accessin| 000000e0 67 20 61 6e 64 20 6d 6f 64 69 66 79 69 6e 67 20 |g and modifying | 000000f0 56 4d 20 64 69 73 6b 20 69 6d 61 67 65 73 3c 2f |VM disk images</| 00000100 74 69 74 6c 65 3e 0a 3c 6d 65 74 61 20 6e 61 6d |title>.<meta nam| 00000110 65 3d 22 64 65 73 63 72 69 70 74 69 6f 6e 22 20 |e="description" | 00000120 63 6f 6e 74 65 6e 74 3d 22 6c 69 62 67 75 65 73 |content="libgues| 00000130 74 66 73 2c 20 6c 69 62 72 61 72 79 20 61 6e 64 |tfs, library and| 00000140 20 74 6f 6f 6c 73 20 66 6f 72 20 61 63 63 65 73 | tools for acces| 00000150 73 69 6e 67 20 61 6e 64 20 6d 6f 64 69 66 79 69 |sing and modifyi| 00000160 6e 67 20 56 4d 20 64 69 73 6b 20 69 6d 61 67 65 |ng VM disk image| 00000170 73 22 20 2f 3e 0a 3c 6d 65 74 61 20 6e 61 6d 65 |s" />.<meta name| 00000180 3d 22 6b 65 79 77 6f 72 64 73 22 20 63 6f 6e 74 |="keywords" cont| 00000190 65 6e 74 3d 22 6c 69 62 67 75 65 73 74 66 73 2c |ent="libguestfs,| 000001a0 20 67 75 65 73 74 66 69 73 68 2c 20 67 75 65 73 | guestfish, gues| 000001b0 74 66 73 2c 20 6c 69 62 67 75 65 73 74 2c 20 67 |tfs, libguest, g| 000001c0 75 65 73 74 20 69 6d 61 67 65 2c 20 76 69 72 74 |uest image, virt| 000001d0 75 61 6c 20 6d 61 63 68 69 6e 65 2c 20 64 69 73 |ual machine, dis| 000001e0 6b 20 69 6d 61 67 65 2c 20 73 75 70 65 72 6d 69 |k image, supermi| 000001f0 6e 2c 20 66 65 62 6f 6f 74 73 74 72 61 70 2c 20 |n, febootstrap, | 00000200 Posted upstream. The message should appear here in a few hours: http://curl.haxx.se/mail/lib-2013-06/ (In reply to Richard W.M. Jones from comment #1) > By a process of elimination, the problem is this ./configure option: > > --enable-threaded-resolver This option tells curl to perform name resolving in a separate thread so that it does not wait forever. We used to have problems with a setjmp/longjmp-based implementation of the timeout (now in RHEL-6). Using c-ares (a glibc-independent resolver) did produce an avalanche of bug reports because of various nss plug-ins (mDNS, ...). Using pthread (the current setting) seems to be the best choice according the bugzilla statistics. > Removing just this line from the spec file fixes curl in Fedora. Again, how are you confirming that it is curl what is broken? I've no idea if it's a bug in curl or qemu. See message I posted upstream, when it appears. Could you please explain the following code of qemu-1.5.0/block.c ? 2208│ if (qemu_in_coroutine()) { 2209│ /* Fast-path if already in coroutine context */ 2210│ bdrv_rw_co_entry(&rwco); 2211│ } else { 2212│ co = qemu_coroutine_create(bdrv_rw_co_entry); 2213│ qemu_coroutine_enter(co, &rwco); 2214├> while (rwco.ret == NOT_DONE) { 2215│ qemu_aio_wait(); 2216│ } 2217│ } It seems to heavily use sigsetjmp/siglongjmp, which makes the code hard to debug for newcomers... valgrind reports an invalid read: ==21877== Invalid read of size 8 ==21877== at 0x150574: qemu_in_coroutine (coroutine-ucontext.c:224) ==21877== by 0x11EA38: bdrv_flush (block.c:4414) ==21877== by 0x11EE1C: bdrv_close (block.c:1352) ==21877== by 0x11F0B1: bdrv_delete (block.c:1592) ==21877== by 0x16B90E: openfile (qemu-io.c:1778) ==21877== by 0x1170F5: main (qemu-io.c:2019) I suspect a clash on thread-private storage... (In reply to Kamil Dudka from comment #6) > Could you please explain the following code of qemu-1.5.0/block.c ? > > 2208│ if (qemu_in_coroutine()) { > 2209│ /* Fast-path if already in coroutine context */ > 2210│ bdrv_rw_co_entry(&rwco); > 2211│ } else { > 2212│ co = qemu_coroutine_create(bdrv_rw_co_entry); > 2213│ qemu_coroutine_enter(co, &rwco); > 2214├> while (rwco.ret == NOT_DONE) { > 2215│ qemu_aio_wait(); > 2216│ } > 2217│ } > > It seems to heavily use sigsetjmp/siglongjmp, which makes the code hard to > debug for newcomers... You probably want to ask someone like kwolf about this. Using gdb, I can see the following pthread keys are in use: 0 - coroutine_init() from qemu-1.5.0/coroutine-ucontext.c 1 - g_private_get_impl + 36 in section .text of /lib64/libglib-2.0.so.0 2 - init () from /lib64/libdl.so.2 3 - _PR_InitThreads in section .text of /lib64/libnspr4.so The thread resolver in libcurl does not use thread-private storage on its own. You can try configuring qemu with ./configure --with-coroutine=gthread so that coroutines are turned into threads instead of using longjmp/setjmp to switch context. It can be useful occasionally for debugging, but it may as well make the bug disappear. As for the specific piece of code you're asking about: The qemu block layer functions for read/write requests can be called either from coroutine context, in which case they will yield as soon as they have issued the asynchronous I/O request to the backend (i.e. block/curl.c in this case) and will be reentered from the AIO callback called by the curl driver once the request has completed. Otherwise, if called outside a coroutine (like the initial request sent by qemu-io to block/raw.c), they spawn a new coroutine to process the request and synchronously wait until the coroutine has completed before they return control. ./configure --with-coroutine=gthread does not fix (or hide) the bug. Conversely, it makes your code hanging even when running against pthread-free libcurl. Please provide a libcurl-based minimal example if you still think the bug is there. Let's everyone just take a deep breath and wait to see if we get a response from upstream curl: http://curl.haxx.se/mail/lib-2013-06/0035.html *** Bug 785594 has been marked as a duplicate of this bug. *** (In reply to Richard W.M. Jones from comment #12) > Let's everyone just take a deep breath and wait to see > if we get a response from upstream curl: A response from upstream curl: http://thread.gmane.org/gmane.comp.web.curl.library/39574/focus=39590 (In reply to Kamil Dudka from comment #18) > (In reply to Richard W.M. Jones from comment #12) > > Let's everyone just take a deep breath and wait to see > > if we get a response from upstream curl: > > A response from upstream curl: > > http://thread.gmane.org/gmane.comp.web.curl.library/39574/focus=39590 I spent the best part of today trying to fix this based on the comment. I don't think this is fixable in qemu currently (although we should leave this bug open as the curl driver is certainly broken). See my patches + comments to be attached next. Created attachment 759175 [details]
curl: Use curl_multi_perform instead of deprecated curl_multi_socket_all.
This should be the least controversial patch: replace
curl_multi_socket_all with curl_multi_perform.
HOWEVER it does not work. No idea why. It should work
according to the docs and the comment received from the
upstream curl developers.
Created attachment 759176 [details]
curl: Register a timer function and handle timeout callbacks [timers version]
This was my original attempt to fix the bug. Use qemu timers
to implement curl timers.
This should be straightforward. HOWEVER it does not work.
The reason it doesn't work is because the qemu aio layer which
is used to implement qemu block devices doesn't implement timers
(seems to be a missing feature). There are other block drivers
in qemu which use timers too, they must also be broken.
Created attachment 759188 [details]
curl: Register a timer function and handle timeout callbacks [busy wait version]
This is the attempt to work around the lack of timers in the qemu
block device layer. This busy-waits using an event notifier (I
also tried several other ways).
HOWEVER this does not work.
The AIO layer apparently doesn't respond to event notifiers, or
I'm using them wrong, but this loops using 100% of CPU just like
all the others.
If (as the comment says) this timer callback is called already during open (via curl_easy_perform), that is why it doesn't work. At that point, there are no pending requests and curl_aio_flush returns 0, thus the event notifier isn't polled. This bug appears to have been reported against 'rawhide' during the Fedora 20 development cycle. Changing version to '20'. More information and reason for this action is here: https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora20 This seems to be working on F20 now, but not sure if it was 'fixed' on the qemu side or the curl side. Regardless, closing Okay, it isn't that cut and dry. This is now working for me, when was failing in the past (there was a previous bug open about this issue at one point): qemu-kvm -cdrom http://192.168.1.149/trees/f20/images/boot.iso However, booting from disk still hangs: qemu-kvm -drive file=http://localhost/f16.raw,if=none,id=drive-virtio-disk0,readonly=on,format=http -device virtio-blk-pci,scsi=off,drive=drive-virtio-disk0,id=virtio-disk0 -net none However, with qemu.git, and this patch applied: https://lists.gnu.org/archive/html/qemu-devel/2014-01/msg01954.html Things do seem to work as expected, built against F20 curl. Unfortunately that patch doesn't apply cleanly to F20 qemu because it depends on a bunch of timer infrastructure that isn't in 1.6. Unfortunately I cannot get this to work with or without that patch. It's now failing in a different way however: - If I use the remote server hostname, curl starts a new process which hangs in poll(2). It's not exactly clear what it is waiting for. - If I use an IP address (also remote), it says no such file or directory, although I see the HEAD + GET accesses in the logs on the remote server. Behaviour is the same with and without the patch. Anyway I don't think we can declare this fixed now. Please try the following command: unset http_proxy= export LIBGUESTFS_BACKEND=direct export LIBGUESTFS_HV=/path/to/x86_64-softmmu/qemu-system-x86_64 guestfish -v <<EOF add-drive /cirros.img protocol:http readonly:true server:192.168.x.y run EOF where you have to replace "/cirros.img" with the path to the disk image on the remote server, and "192.168.x.y" with the IP address of a remote server running Apache. That also prints out the full qemu command line. Here's something even simpler which fails for me: ./qemu-img info http://192.168.0.249/scratch/cirros-0.3.1-x86_64-disk.img It hangs, whether or not I use Peter's patch. (In reply to Richard W.M. Jones from comment #28) > Here's something even simpler which fails for me: > > ./qemu-img info http://192.168.0.249/scratch/cirros-0.3.1-x86_64-disk.img > > It hangs, whether or not I use Peter's patch. OK ignore this comment. It was hanging because I had a proxy set. Please try the guestfish command in comment 27, with a file located on a remote server (pref one with a bit of lag). I don't think that testing localhost is fair. The command in comment #27 fails for me like with this error from qemu: qemu-system-x86_64: -drive file=http://192.168.1.149/f16.raw,snapshot=on,cache=writeback,id=hd0,if=none: could not open disk image http://192.168.1.149/f16.raw: Could not open backing file: Could not open 'http://192.168.1.149/f16.raw': No such file or directory The qemu command line was: /home/crobinso/src/qemu/x86_64-softmmu/qemu-system-x86_64 \ -global virtio-blk-pci.scsi=off \ -nodefconfig \ -enable-fips \ -nodefaults \ -nographic \ -machine accel=kvm:tcg \ -cpu host,+kvmclock \ -m 500 \ -no-reboot \ -no-hpet \ -kernel /var/tmp/.guestfs-0/kernel.11807 \ -initrd /var/tmp/.guestfs-0/initrd.11807 \ -device virtio-scsi-pci,id=scsi \ -drive file=http://192.168.1.149/f16.raw,snapshot=on,cache=writeback,id=hd0,if=none \ -device scsi-hd,drive=hd0 \ -drive file=/var/tmp/.guestfs-0/root.11807,snapshot=on,id=appliance,cache=unsafe,if=none \ -device scsi-hd,drive=appliance \ -device virtio-serial-pci \ -serial stdio \ -device sga \ -chardev socket,path=/tmp/libguestfscUstFg/guestfsd.sock,id=channel0 \ -device virtserialport,chardev=channel0,name=org.libguestfs.channel.0 \ -append 'panic=1 console=ttyS0 udevtimeout=600 no_timer_check acpi=off printk.time=1 cgroup_disable=memory root=/dev/sdb selinux=0 guestfs_verbose=1 TERM=xterm' I assume that's not the error you hit, how does it fail for you? (In reply to Cole Robinson from comment #30) > The command in comment #27 fails for me like with this error from qemu: > > qemu-system-x86_64: -drive > file=http://192.168.1.149/f16.raw,snapshot=on,cache=writeback,id=hd0,if=none: > could not open disk image http://192.168.1.149/f16.raw: Could not open > backing file: Could not open 'http://192.168.1.149/f16.raw': No such file or > directory >[...] > I assume that's not the error you hit, how does it fail for you? No that *is* the error I hit! It makes no sense because the file exists, and curl is even downloading the file (according to the server logs). I also get a different problem if I use a hostname instead of an IP address. qemu (really, I assume, libcurl) forks off another process which just hangs forever. (Note: I have a remote DNS server) BTW I suspect the original error (bdrv_rwv_co) is fixed, and this is a different bug. Need to check if this happens without the non-standard --enable-threaded-resolver configure option which Fedora libcurl uses. I want to factor out issues that are caused by Fedora's curl using ./configure --enable-threaded-resolver. Therefore I compiled curl from upstream git, and added: export LD_LIBRARY_PATH=/path/to/curl/lib/.libs to the command in comment 27. That fixes the hostname lookup hang. So the hostname lookup hang may be caused by --enable-threaded-resolver. However I still see the mysterious error: CURL: Error opening file: Connection time-out qemu-system-x86_64: -drive file=http://onuma/scratch/cirros-0.3.1-x86_64-disk.img,snapshot=on,cache=writeback,id=hd0,if=none: could not open disk image http://onuma/scratch/cirros-0.3.1-x86_64-disk.img: Could not open backing file: Could not open 'http://onuma/scratch/cirros-0.3.1-x86_64-disk.img': No such file or directory As that is a different problem from the one reported in this bug, I have opened a new bug about that (https://bugs.launchpad.net/qemu/+bug/1269606). However please leave this bug open because we don't really know if it is fixed or just masked. https://bugs.launchpad.net/qemu/+bug/1269606 and the error shown in comment 33 is caused by snapshot=on. I'm now pretty sure this (original bug) is fixed. |