Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1731038 - guest on src host get stuck after execute migrate_cancel for rdma migration
Summary: guest on src host get stuck after execute migrate_cancel for rdma migration
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: qemu-kvm
Version: 8.1
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: rc
: ---
Assignee: Dr. David Alan Gilbert
QA Contact: Li Xiaohui
URL:
Whiteboard:
Depends On:
Blocks: 1758964 1771318 1897025
TreeView+ depends on / blocked
 
Reported: 2019-07-18 07:45 UTC by Li Xiaohui
Modified: 2021-04-15 12:48 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-15 07:37:36 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Li Xiaohui 2019-07-18 07:45:10 UTC
Description of problem:
guest on src host get stuck after execute migrate_cancel for rdma migration


Version-Release number of selected component (if applicable):
src&dst host info: kernel-4.18.0-117.el8.x86_64 & qemu-img-4.0.0-5.module+el8.1.0+3622+5812d9bf.x86_64
guest info: kernel-4.18.0-113.el8.x86_64

Mellanox card:
# lspci
01:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]


How reproducible:
2/2


Steps to Reproduce:
1.config Mellanox card
2.boot guest on src host with clis:
/usr/libexec/qemu-kvm \
-enable-kvm \
-machine q35  \
-m 8G \
-smp 8 \
-cpu Skylake-Client \
-name debug-threads=on \
-device pcie-root-port,id=pcie.0-root-port-2,slot=2,chassis=2,addr=0x2,bus=pcie.0 \
-device pcie-root-port,id=pcie.0-root-port-3,slot=3,chassis=3,addr=0x3,bus=pcie.0 \
-device pcie-root-port,id=pcie.0-root-port-4,slot=4,chassis=4,addr=0x4,bus=pcie.0 \
-device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie.0-root-port-2,addr=0x0 \
-blockdev driver=file,cache.direct=off,cache.no-flush=on,filename=/mnt/nfs/rhel810-64-virtio-scsi-3.qcow2,node-name=my_file \
-blockdev driver=qcow2,node-name=my_disk,file=my_file \
-device scsi-hd,drive=my_disk,bus=virtio_scsi_pci0.0 \
-netdev tap,id=hostnet0,vhost=on,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown,queues=4 \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=70:5a:0f:38:cd:1c,bus=pcie.0-root-port-3,vectors=10,mq=on \
-vnc :0 \
-device VGA \
-monitor stdio \
-qmp tcp:0:1234,server,nowait \
3.run stressapptest in guest:
# stressapptest -M 1000 -s 10000
4.boot guest on dst host with clis:
/usr/libexec/qemu-kvm \
-enable-kvm \
-machine q35  \
-m 8G \
-smp 8 \
-cpu Skylake-Client \
-name debug-threads=on \
-device pcie-root-port,id=pcie.0-root-port-2,slot=2,chassis=2,addr=0x2,bus=pcie.0 \
-device pcie-root-port,id=pcie.0-root-port-3,slot=3,chassis=3,addr=0x3,bus=pcie.0 \
-device pcie-root-port,id=pcie.0-root-port-4,slot=4,chassis=4,addr=0x4,bus=pcie.0 \
-device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie.0-root-port-2,addr=0x0 \
-blockdev driver=file,cache.direct=off,cache.no-flush=on,filename=/mnt/nfs/rhel810-64-virtio-scsi-3.qcow2,node-name=my_file \
-blockdev driver=qcow2,node-name=my_disk,file=my_file \
-device scsi-hd,drive=my_disk,bus=virtio_scsi_pci0.0 \
-netdev tap,id=hostnet0,vhost=on,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown,queues=4 \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=70:5a:0f:38:cd:1c,bus=pcie.0-root-port-3,vectors=10,mq=on \
-vnc :0 \
-device VGA \
-monitor stdio \
-qmp tcp:0:1234,server,nowait \
-incoming rdma:0:4444 \
5.set migration transfer speed and enable rdma-pin-all
(qemu) migrate_set_speed 10G
(qemu) migrate_set_capability rdma-pin-all on
6.Do migration through rdma protocal
(qemu)migrate rdma:192.168.10.21:5555
7.cancel migration process before migration completed
# telnet 127.0.0.1 1234
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
{"QMP": {"version": {"qemu": {"micro": 0, "minor": 0, "major": 4}, "package": "qemu-kvm-4.0.0-5.module+el8.1.0+3622+5812d9bf"}, "capabilities": ["oob"]}}
{"execute":"qmp_capabilities"}
{"return": {}}
{"timestamp": {"seconds": 1563433312, "microseconds": 731190}, "event": "NIC_RX_FILTER_CHANGED", "data": {"name": "net0", "path": "/machine/peripheral/net0/virtio-backend"}}
{"execute":"migrate_cancel"}
{"return": {}}


Actual results:
guest on src host get stuck after execute migrate_cancel.
(1)on src qemu, keep here, and couldn't operate the hmp:
(qemu) migrate rdma:192.168.0.21:4444
source_resolve_host RDMA Device opened: kernel name mlx4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx4_0, transport: (1) Infiniband
(2)on dst qemu, search migration status:
(qemu) dest_init RDMA Device opened: kernel name mlx4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx4_0, transport: (1) Infiniband
(qemu) info status 
VM status: paused (inmigrate)
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off return-path: off pause-before-switchover: off multifd: off dirty-bitmaps: off postcopy-blocktime: off late-block-activate: off x-ignore-shared: off 
Migration status: active
total time: 0 milliseconds
(3)guest get stuck, and couldn't operate via mice on remote-viewer control panel.
and couldn't ping guest via guest ip


Expected results:
guest still run normal on src host after migrate_cancel


Additional info:
guest work well after rdma migration without mgirate_cancel

Comment 2 Li Xiaohui 2019-07-19 09:30:33 UTC
Hi all,
I also test this case on rhel8.1.0 fast train with guest win10(q35+seabios), win8-32(pc+seabios), rhel8.1.0(q35+seabios), rhel7.7(pc+seabios), rhel8.0.1(q35+ovmf), 
1.rhel8.1.0 and win10 guest hit same issue, like above comment 0
2.rhel8.0.1 and rhel7.7, and win8-32 guest get prompt like followings after migrate_cancel, But I think maybe the prompt isn't right(ibv_poll_cq wc.status=13 RNR retry counter exceeded!...), what do you think?
(1)on src host qemu:
(qemu) migrate rdma:192.168.0.21:4444
source_resolve_host RDMA Device opened: kernel name mlx4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx4_0, transport: (1) Infiniband
qemu-kvm: Early error. Sending error.
ibv_poll_cq wc.status=13 RNR retry counter exceeded!
ibv_poll_cq wrid=CONTROL SEND!
qemu-kvm: rdma migration: send polling control error
(qemu) info status 
VM status: running
(qemu) info migr
migrate               migrate_cache_size    migrate_capabilities  
migrate_parameters    
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off return-path: off pause-before-switchover: off multifd: off dirty-bitmaps: off postcopy-blocktime: off late-block-activate: off x-ignore-shared: off 
Migration status: cancelled
total time: 0 milliseconds
(2)on dst host qemu:
(qemu) info status 
VM status: paused (inmigrate)
(qemu) dest_init RDMA Device opened: kernel name mlx4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx4_0, transport: (1) Infiniband
qemu-kvm: receive cm event, cm event is 10
qemu-kvm: rdma migration: send polling control error
qemu-kvm: Failed to send control buffer!
qemu-kvm: load of migration failed: Input/output error
qemu-kvm: Early error. Sending error.
qemu-kvm: rdma migration: send polling control error


What's more, I test this case on rhel8.1.0 slow train with guest win10(q35+seabios) and rhel8.1.0(pc+seabios), guest run normal on src host after migrate_cancel, and the prompt is right both on src and dst qemu:
(1)on src host qemu:
(qemu) migrate rdma:192.168.0.21:4444
source_resolve_host RDMA Device opened: kernel name mlx4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx4_0, transport: (1) Infiniband
qemu-kvm: migration_iteration_finish: Unknown ending state 2
qemu-kvm: Early error. Sending error.
(qemu) info status 
VM status: running
(qemu) info migr
migrate               migrate_cache_size    migrate_capabilities  
migrate_parameters    
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
capabilities: xbzrle: off rdma-pin-all: on auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off late-block-activate: off 
Migration status: cancelled
total time: 0 milliseconds

(2)on dst host qemu:
QEMU 2.12.0 monitor - type 'help' for more information
(qemu) dest_init RDMA Device opened: kernel name mlx4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx4_0, transport: (1) Infiniband
qemu-kvm: Was expecting a QEMU FILE (3) control message, but got: ERROR (1), length: 0
qemu-kvm: load of migration failed: Input/output error

Comment 4 Ademar Reis 2020-02-05 23:00:59 UTC
QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks

Comment 7 RHEL Program Management 2021-03-15 07:37:36 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 8 Li Xiaohui 2021-04-15 12:48:39 UTC
Didn't reproduce this bz with Comment 0 on rhelav 8.4.0(kernel-4.18.0-304.el8.x86_64&qemu-kvm-5.2.0-14.module+el8.4.0+10425+ad586fa5.x86_64)

Close this bz as CurrentRelease.


Note You need to log in before you can comment on or make changes to this bug.