Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1014604
Summary: | Race condition allocating veth devices with parallel LXC container creation | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Daniel Berrangé <berrange> |
Component: | libvirt | Assignee: | Daniel Berrangé <berrange> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Virtualization Bugs <virt-bugs> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.0 | CC: | acathrow, ajia, berrange, dallan, dyuan, fullung, jdenemar, lsu, mprivozn |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | libvirt-1.1.1-26.el7 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2014-06-13 09:29:36 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 910269, 992980, 1058606, 1086175 |
Description
Daniel Berrangé
2013-10-02 11:37:08 UTC
Reproduced this on libvirt-1.1.1-8.el7.x86_64 with libvirt-sandbox-0.5.0-5.el7.x86_64 and kernel-3.10.0-33.el7.x86_64. # for i in {1..100}; do virt-sandbox-service -c lxc:/// create -N dhcp,source=default mylxcsh$i /bin/bash;done # for i in {1..100}; do virsh -c lxc:/// start mylxcsh$i & done <slice> error: internal error: Child process (ip link add veth0 type veth peer name veth1) unexpected exit status 2: RTNETLINK answers: File exists Domain mylxcsh5 started Domain mylxcsh3 started XXXXXX Domain mylxcsh42 started error: Failed to start domain mylxcsh96 error: error: Failed to start domain mylxcsh60 internal error: Child process (ip link add veth16 type veth peer name veth21) unexpected exit status 2: RTNETLINK answers: File exists </slice> # virsh -c lxc:/// -q list | wc -l 18 Tested it on # for i in {1..100}; do virsh -c lxc:/// start mylxcsh$i & done <slice> error: Failed to start domain mylxcsh11 error: internal error: Failed to allocate free veth pair after 10 attempts error: Failed to start domain mylxcsh24 error: internal error: Failed to allocate free veth pair after 10 attempts error: Failed to start domain mylxcsh10 error: internal error: Failed to allocate free veth pair after 10 attempts error: Failed to start domain mylxcsh43 error: internal error: Failed to allocate free veth pair after 10 attempts </slice> Daniel, are 10 attempts acceptable? or could users change attempts times? I guess 10 times are hard code. Thanks. # virsh -c lxc:/// -q list |wc -l 95 # tail -2 /etc/libvirt/libvirtd.conf max_clients = 1024 max_workers = 1024 (In reply to Alex Jia from comment #4) > Tested it on Tested it on libvirt-1.1.1-9.el7.x86_64 with libvirt-sandbox-0.5.0-5.el7.x86_64 and kernel-3.10.0-33.el7.x86_64. Daniel, could you help confirm issues on Comment4? thanks. (In reply to Alex Jia from comment #4) > Tested it on Retest this. # rpm -q libvirt libvirt-sandbox kernel libvirt-1.1.1-13.el7.x86_64 libvirt-sandbox-0.5.0-6.el7.x86_64 kernel-3.10.0-0.rc7.64.el7.x86_64 > > # for i in {1..100}; do virsh -c lxc:/// start mylxcsh$i & done > > <slice> > > error: Failed to start domain mylxcsh11 > error: internal error: Failed to allocate free veth pair after 10 attempts > > error: Failed to start domain mylxcsh24 > error: internal error: Failed to allocate free veth pair after 10 attempts > > error: Failed to start domain mylxcsh10 > error: internal error: Failed to allocate free veth pair after 10 attempts > > error: Failed to start domain mylxcsh43 > error: internal error: Failed to allocate free veth pair after 10 attempts > > </slice> > Domain mylxcsh36 started error: Failed to start domain mylxcsh43 error: internal error: Failed to allocate free veth pair after 10 attempts Notes, only 1 cantainer can't be successfully started. > > Daniel, are 10 attempts acceptable? or could users change attempts times? I > guess 10 times are hard code. Thanks. > > > # virsh -c lxc:/// -q list |wc -l > 95 # virsh -c lxc:/// -q list|grep mylxcsh|wc -l 99 > > # tail -2 /etc/libvirt/libvirtd.conf > max_clients = 1024 > max_workers = 1024 # tail -3 /etc/libvirt/libvirtd.conf max_clients = 20 max_workers = 20 max_queued_clients = 20 Notes, it seems the above limitation is invalid after restarting libvirtd. In libvirt log: # grep error /var/log/libvirt/libvirtd.log 2013-12-02 10:15:05.481+0000: 21664: error : virNetlinkEventCallback:340 : nl_recv returned with error: No buffer space available 2013-12-02 10:15:06.590+0000: 21873: error : virNetDevVethCreate:179 : internal error: Failed to allocate free veth pair after 10 attempts Notes, what does mean about "No buffer space available"? and "10 attempts" are enough? can users change attempts times? thanks. src/util/virnetdevveth.c:67:#define MAX_DEV_NUM 65536 src/util/virnetdevveth.c:120:#define MAX_VETH_RETRIES 10 It's very easy to hit error "Failed to allocate free veth pair after %d attempts" if parallel start LXC container. Also reproduced with same steps in libvirt-1.1.1-22.el7.x86_64 kernel-3.10.0-86.el7.x86_64 libvirt-sandbox-0.5.0-9.el7.x86_64 something like error: Failed to start domain mylxcsh89 error: internal error: Failed to allocate free veth pair after 10 attempts The from comment #4 should be fixed by an upstream commit v1.2.2-rc2-1-gc0d162c: commit c0d162c68c2f19af8d55a435a9e372da33857048 Author: Michal Privoznik <mprivozn> Date: Tue Feb 25 16:41:07 2014 +0100 virNetDevVethCreate: Serialize callers Consider dozen of LXC domains, each of them having this type of interface: <interface type='network'> <mac address='52:54:00:a7:05:4b'/> <source network='default'/> </interface> When starting these domain in parallel, all workers may meet in virNetDevVethCreate() where a race starts. Race over allocating veth pairs because allocation requires two steps: 1) find first nonexistent '/sys/class/net/vnet%d/' 2) run 'ip link add ...' command Now consider two threads. Both of them find N as the first unused veth index but only one of them succeeds allocating it. The other one fails. For such cases, we are running the allocation in a loop with 10 rounds. However this is very flaky synchronization. It should be rather used when libvirt is competing with other process than when libvirt threads fight each other. Therefore, internally we should use mutex to serialize callers, and do the allocation in loop (just in case we are competing with a different process). By the way we have something similar already since 1cf97c87. Signed-off-by: Michal Privoznik <mprivozn> *** Bug 1070221 has been marked as a duplicate of this bug. *** Test under libvirt-1.1.1-26.el7.x86_64 and via comment 4's steps All containers start up in parallel and no error found both in libvirtd and system log. So set it VERIFIED This request was resolved in Red Hat Enterprise Linux 7.0. Contact your manager or support representative in case you have further questions about the request. |