Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.

Bug 1665063

Summary: qemu aarch64 pl011 driver shouldn't use synchronous writes
Product: Red Hat Enterprise Linux Advanced Virtualization Reporter: Richard W.M. Jones <rjones>
Component: qemu-kvmAssignee: Guowen Shan <gshan>
qemu-kvm sub component: Devices QA Contact: Virtualization Bugs <virt-bugs>
Status: CLOSED WONTFIX Docs Contact:
Severity: high    
Priority: urgent CC: gshan, knoel, lcapitulino, rbalakri, virt-maint
Version: 8.0Keywords: OtherQA, Regression
Target Milestone: rc   
Target Release: ---   
Hardware: aarch64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1661940 Environment:
Last Closed: 2020-03-01 22:32:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 910269, 1677408    
Attachments:
Description Flags
TCP server/client program
none
qemu patch - enforce flooding on PL011 on startup none

Description Richard W.M. Jones 2019-01-10 12:40:20 UTC
+++ This bug was initially created as a clone of Bug #1661940 +++

Description:

When libvirtd starts a qemu guest, it does it as follows:

(1) Create the qemu process paused (-S)
(2) Unpause the guest using the qemu monitor
(3) Query the qemu monitor for list of devices etc

(Bug 1661940 is about changing the ordering of events to make
qemu more robust, so don't complain that libvirt is doing it
wrong :-)

Unfortunately when creating a qemu aarch64 guest using the PL011
serial driver, between (2) and (3) a lot of boot messages can
get written to the serial console.  However nothing is listening
on the other end of the serial console socket at this point (because
libvirt hasn't returned to the caller process).  Because of the
implementation of the PL011 driver, it doesn't just throw away
the output like a normal serial port would do.  Instead there is
a small race which can causes the qemu monitor to lock up, thus
causing a deadlock.

In particular commit 6ab3fc32ea64:

> diff --git a/hw/char/pl011.c b/hw/char/pl011.c
> index c0fbf8a87428..786e605fdd61 100644
> --- a/hw/char/pl011.c
> +++ b/hw/char/pl011.c
> @@ -146,7 +146,9 @@ static void pl011_write(void *opaque, hwaddr offset,
>          /* ??? Check if transmitter is enabled.  */
>          ch = value;
>          if (s->chr)
> -            qemu_chr_fe_write(s->chr, &ch, 1);
> +            /* XXX this blocks entire thread. Rewrite to use
> +             * qemu_chr_fe_write and background I/O callbacks */
> +            qemu_chr_fe_write_all(s->chr, &ch, 1);
>          s->int_level |= PL011_INT_TX;
>          pl011_update(s);
>          break;

This is only a brief summary of the bug.  The longer details can be
found in the following comments:

https://bugzilla.redhat.com/show_bug.cgi?id=1661940#c42
https://bugzilla.redhat.com/show_bug.cgi?id=1661940#c43

An awkward reproducer which requires patching libvirt to
slow it down between steps (2) and (3) is:

https://bugzilla.redhat.com/show_bug.cgi?id=1661940#c44

Version:

qemu-kvm-3.1.0-3.module+el8+2638+e43dad09.aarch64

Comment 1 Richard W.M. Jones 2019-01-10 12:41:21 UTC
(In reply to Richard W.M. Jones from comment #0)
> (Bug 1661940 is about changing the ordering of events to make
> qemu more robust, so don't complain that libvirt is doing it
> wrong :-)

I mean to make *libvirtd* more robust :-)

Comment 2 Luiz Capitulino 2019-01-10 19:00:49 UTC
Setting this to 8.1, unless we consider it a blocker.

Comment 3 Luiz Capitulino 2019-01-10 19:18:23 UTC
btw, I'm wondering, the pl011 driver seems very old. Is there another uart driver we could use for arm?

Comment 6 Ademar Reis 2020-02-05 22:53:50 UTC
QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks

Comment 7 Guowen Shan 2020-02-20 06:09:38 UTC
Created attachment 1664266 [details]
TCP server/client program

Comment 8 Guowen Shan 2020-02-20 06:11:06 UTC
Created attachment 1664267 [details]
qemu patch - enforce flooding on PL011 on startup

Comment 9 Guowen Shan 2020-02-20 06:25:43 UTC
I think Richard already provided all the details. The issue seems caused by flooding on TCP connection, which isn't accepted on server side. I don't know how to build a private libvirtd, so the attached "tcp.c" and patch are used to reproduce the issue.

First of all, run binary "tcp", which is build from "tcp.c" as below. The transmission fails and returns -EAGAIN on client side after continuous 431 transmissons. It means the client can send data even the connection isn't accepted on server side. Besides, the sending buffer can be overruned quickly.

   session1# ./tcp -s 50900
   server: connected with client localhost (127.0.0.1)
   session2# ./tcp -c 50900
   line[0]: succeed, 1024
   line[1]: succeed, 1024
   line[2]: succeed, 1024
   line[3]: succeed, 1024
      :
   line[431]: succeed, 1024
   line[432]: succeed, 673
   line[433]: ret=-1, errno=11

Another experiment is start a VM with customized qemu, which is built from upstream qemu and the the attached patch. It also proves the flooding on PL011 can prevent the system from booting:

   session1# ./tcp -s 50900
   session2# ~/sandbox/src/qemu/qemu.main.aarch64/aarch64-softmmu/qemu-system-aarch64  \
             -machine virt,gic-version=3 -cpu max -m 4096                              \
             -smp 2,sockets=2,cores=1,threads=1                                        \
             -monitor none -serial tcp:127.0.0.1:50900 -nographic -s                   \
               :
            flood_uart: 0
            flood_uart: 1
            flood_uart: 2
            flood_uart: 3
            flood_uart: 4
            flood_uart: 5
               :
            flood_uart: 2573
            flood_uart: 2574
            <system stops here>

For this, a patch has been posted for review. I will do downstream backporting once it's finalized.

   https://patchwork.kernel.org/patch/11393393/

Comment 10 Guowen Shan 2020-03-01 22:32:17 UTC
The community chooses not to fix the corner case according to the discussion as below link shows.
Instead, the user should have the TCP connection accepted and kept being polled for data on the
server side. So I'm closing this with "WONTFIX".

   https://patchwork.kernel.org/patch/11393393/