1350772 – Memory locking is not required for non-KVM ppc64 guests

Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.

Bug 1350772 - Memory locking is not required for non-KVM ppc64 guests

Summary: Memory locking is not required for non-KVM ppc64 guests

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	7.3
Hardware:	ppc64le
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Andrea Bolognani
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-06-28 11:07 UTC by Andrea Bolognani
Modified:	2016-11-03 18:47 UTC (History)
CC List:	8 users (show)
Fixed In Version:	libvirt-2.0.0-2.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1293024
Environment:
Last Closed:	2016-11-03 18:47:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
libguestfs-test-tool fail output with qemu-kvm-rhev-2.6.0-11.el7 (11.03 KB, text/plain) 2016-07-13 07:36 UTC, Dan Zheng	no flags	Details
libguestfs-test-tool pass output with qemu-kvm-rhev-2.6.0-11.el7 unlimited work around (55.93 KB, text/plain) 2016-07-13 07:39 UTC, Dan Zheng	no flags	Details
libguestfs-test-tool fail output with qemu-kvm-rhev-2.6.0-13.el7 (33.79 KB, text/plain) 2016-07-18 07:53 UTC, Dan Zheng	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2016:2577	0	normal	SHIPPED_LIVE	Moderate: libvirt security, bug fix, and enhancement update	2016-11-03 12:07:06 UTC

Description Andrea Bolognani 2016-06-28 11:07:28 UTC

+++ This bug was initially created as a clone of Bug #1293024 +++

Description of problem:

When launching either a ppc64 or ppc64le guest (x86-64 host) I get:

ERROR    internal error: Process exited prior to exec: libvirt:  error : cannot limit locked memory to 46137344: Operation not permitted

Version-Release number of selected component (if applicable):

libvirt-1.3.0-1.fc24.x86_64
kernel 4.2.6-301.fc23.x86_64

How reproducible:

100%

Steps to Reproduce:
1. Run this virt-install command:

virt-install --name=tmp-fed0fb92 --ram=4096 --vcpus=1 --os-type=linux --os-variant=fedora21 --arch ppc64le --machine pseries --initrd-inject=/tmp/tmp.sVjN8w5nyk '--extra-args=ks=file:/tmp.sVjN8w5nyk console=tty0 console=hvc0 proxy=http://cache.home.annexia.org:3128' --disk fedora-23-ppc64le,size=6,format=raw --serial pty --location=https://download.fedoraproject.org/pub/fedora-secondary/releases/21/Server/ppc64le/os/ --nographics --noreboot

(The same failure happens with ppc64).

--- Additional comment from Richard W.M. Jones on 2015-12-19 04:56:29 EST ---

It's OK with an x86-64 guest.

--- Additional comment from Richard W.M. Jones on 2015-12-19 05:00:33 EST ---

I worked around it by increasing my user account's locked memory
limit (ulimit -l) to unlimited.  I wonder if the error message comes
from qemu?

--- Additional comment from Richard W.M. Jones on 2015-12-19 05:04:44 EST ---

Smallest reproducer is this command (NB: as NON-root):

$ virt-install --name=tmp-bz1293024 --ram=4096 --vcpus=1 --os-type=linux --os-variant=fedora22 --disk /var/tmp/fedora-23.img,size=6,format=raw --serial pty --location=https://download.fedoraproject.org/pub/fedora-secondary/releases/23/Server/ppc64le/os/ --nographics --noreboot --arch ppc64le

Note: If you are playing with ulimit, you have to kill libvirtd
since it could use the previous ulimit from another session.

--- Additional comment from Jan Kurik on 2016-02-24 09:09:40 EST ---

This bug appears to have been reported against 'rawhide' during the Fedora 24 development cycle.
Changing version to '24'.

More information and reason for this action is here:
https://fedoraproject.org/wiki/Fedora_Program_Management/HouseKeeping/Fedora24#Rawhide_Rebase

--- Additional comment from Cole Robinson on 2016-03-16 19:43:19 EDT ---

Rich do you still see this with latest rawhide?

(the mem locking error comes from libvirt... apparently ppc64 needs some explicit mem locking? that's what the code says, but I didn't dig deeper than that)

--- Additional comment from Richard W.M. Jones on 2016-03-17 12:11:30 EDT ---

There doesn't appear to be a Rawhide repo for ppc64le yet.

Unless something has changed in libvirt or virt-install to fix
this, I doubt very much that it is fixed.

--- Additional comment from Cole Robinson on 2016-03-17 12:13:51 EDT ---

Andrea, any thoughts on this? Have you seen this issue?

--- Additional comment from Richard W.M. Jones on 2016-03-24 14:46:38 EDT ---

Still happening on libvirt-1.3.2-3.fc24.x86_64 (x86-64 host, running
Ubuntu/ppc64le guest).

--- Additional comment from Andrea Bolognani on 2016-03-29 09:11:01 EDT ---

(In reply to Cole Robinson from comment #7)
> Andrea, any thoughts on this? Have you seen this issue?

I hadn't, thanks for bringing it up.

The issue Rich's seeing is caused by

  https://bugzilla.redhat.com/show_bug.cgi?id=1273480

having been fixed.

Short version is that ppc64 guests always need some amount
of memory to be locked, and that amount is guaranteed to be
more than the default 64 KiB allowance.

libvirt tries to raise the limit to prevent the allocation
from failing, but it can only do that successfully when
running as root.

--- Additional comment from Richard W.M. Jones on 2016-04-07 15:51:02 EDT ---

I set the architecture to ppc64le, but in fact it affects
ppc64 also.  In answer to comment 5, it affects Fedora 24 too.

--- Additional comment from Andrea Bolognani on 2016-04-08 04:55:42 EDT ---

(In reply to Richard W.M. Jones from comment #10)
> I set the architecture to ppc64le, but in fact it affects
> ppc64 also.  In answer to comment 5, it affects Fedora 24 too.

Yeah, this will affect both ppc64 variants and any version of
libvirt from 1.3.0 on.

Unfortunately I don't really see a way to fix this: the memory
locking limit really needs to be quite high on ppc64,
definitely higher than the default: the fact that this was not
enforced before was a bug and could lead to more trouble later
on.

When libvirtd is running as root we can adjust the limit
ourselves quite easily; when it's running as a regular user,
we're of course unable to do that.

At least the error message is IMHO quite clear and hints at
the solution.

--- Additional comment from Cole Robinson on 2016-04-26 17:42:04 EDT ---

bug 1273480 seems to be all about hostdev assignment, which rich isn't doing. I see this commit:

commit 16562bbc587add5a03a01c8eb8607c9e05819607
Author: Andrea Bolognani <abologna>
Date:   Fri Nov 13 10:58:07 2015 +0100

    qemu: Always set locked memory limit for ppc64 domains
    
    Unlike other architectures, ppc64 domains need to lock memory
    even when VFIO is not used.


But I don't see where the need for unconditional locked memory is explained... Can you point me to that discussion?

--- Additional comment from Andrea Bolognani on 2016-04-28 08:08:52 EDT ---

(In reply to Cole Robinson from comment #12)
> bug 1273480 seems to be all about hostdev assignment, which rich isn't
> doing. I see this commit:
> 
> commit 16562bbc587add5a03a01c8eb8607c9e05819607
> Author: Andrea Bolognani <abologna>
> Date:   Fri Nov 13 10:58:07 2015 +0100
> 
>     qemu: Always set locked memory limit for ppc64 domains
>     
>     Unlike other architectures, ppc64 domains need to lock memory
>     even when VFIO is not used.
> 
> 
> But I don't see where the need for unconditional locked memory is
> explained... Can you point me to that discussion?

See David's detailed explanation[1] from back when the patch
series was posted on libvir-list.

On a related note, there's been some progress recently toward
getting some of that memory actually accounted for.


[1] https://www.redhat.com/archives/libvir-list/2015-November/msg00769.html

--- Additional comment from Cole Robinson on 2016-04-29 08:00:32 EDT ---

Thanks for the pointer.  So if ppc64 doesn't do this memlocking, do things fail 100% of the time? Or is this a heuristic that maybe is triggering a false positive? Rich maybe you can edit libvirt and figure it out.

If this has the ponential to be wrong in the non-VFIO case, I suggest at least making it a non-fatal error if the daemon is unprivileged, and logging a VIR_WARN instead.

An additional bit we could do is have qemu-system-ppc64 ship a /etc/security/limits.d file to up the memlock limit on pcc64 hosts

--- Additional comment from Andrea Bolognani on 2016-05-05 04:51:40 EDT ---

(In reply to Cole Robinson from comment #14)
> Thanks for the pointer.  So if ppc64 doesn't do this memlocking, do things
> fail 100% of the time? Or is this a heuristic that maybe is triggering a
> false positive? Rich maybe you can edit libvirt and figure it out.
> 
> If this has the ponential to be wrong in the non-VFIO case, I suggest at
> least making it a non-fatal error if the daemon is unprivileged, and logging
> a VIR_WARN instead.
> 
> An additional bit we could do is have qemu-system-ppc64 ship a
> /etc/security/limits.d file to up the memlock limit on pcc64 hosts

My understanding is that the consequences of not raising the
memory locking limit appropriately can be pretty severe.

David, can you give us more details please? What could happen
if users ran QEMU with the default memory locking limit of
64 KiB?

--- Additional comment from David Gibson on 2016-05-26 02:08:22 EDT ---

Cole,

The key thing here is that on ppc64, unlike x86, the hardware page tables are encoded as a big hash table, rather than a set of radix trees.  Each guest needs its own hashed page table (HPT).  These can get quite large - it can vary depending on a number of things, but the usual rule of thumb is that the HPT is 1/128th to 1/64th of RAM size, with a minimum size of 16MiB.

For PAPR paravirtualized guests this HPT is accessed entirely via hypercall and does not exist within the guest's RAM - it needs to be allocated on the host above and beyond the guest's RAM image.  When using the "HV" KVM implementation (the only one we're targetting) the HPT has to be _host_ physically contiguous, unswappable memory (because it's read directly by hardware.

At the moment, the host kernel doesn't actually need the locked memory limit - it allows unprivileged users (with permission to create VMs) to allocate HPTs anyway, but this is really a bug.  As it stands a non-privileged user could create a whole pile of tiny VMs (it doesn't even need to actually execute any instructions in the VMs) and consume an unbounded amount of host memory with those 16MiB HPTs.

So we plan to fix that in the kernel.  In the meantime libvirt treats things as if the kernel enforced that limit even though it doesn't yet, to avoid having yet more ugly kernel version dependencies.


Andrea, would it make any sense to have failure of the setrlimit in libvirt cause only a warning, not a fatal error?  In that case it wouldn't prevent things working in situations where it can for other reasons (old kernel which doesn't enforce limits, PR KVM which doesn't require it..).

--- Additional comment from Peter Krempa on 2016-05-26 03:28:32 EDT ---

(In reply to David Gibson from comment #16)

[...]

> Andrea, would it make any sense to have failure of the setrlimit in libvirt
> cause only a warning, not a fatal error?  In that case it wouldn't prevent
> things working in situations where it can for other reasons (old kernel
> which doesn't enforce limits, PR KVM which doesn't require it..).

Not really. Warnings are not presented to the user just logged to the log file so its very likely to get ignored.

--- Additional comment from Andrea Bolognani on 2016-05-26 04:20:07 EDT ---

(In reply to David Gibson from comment #16)
> Cole,
> 
> The key thing here is that on ppc64, unlike x86, the hardware page tables
> are encoded as a big hash table, rather than a set of radix trees.  Each
> guest needs its own hashed page table (HPT).  These can get quite large - it
> can vary depending on a number of things, but the usual rule of thumb is
> that the HPT is 1/128th to 1/64th of RAM size, with a minimum size of 16MiB.
> 
> For PAPR paravirtualized guests this HPT is accessed entirely via hypercall
> and does not exist within the guest's RAM - it needs to be allocated on the
> host above and beyond the guest's RAM image.  When using the "HV" KVM
> implementation (the only one we're targetting) the HPT has to be _host_
> physically contiguous, unswappable memory (because it's read directly by
> hardware.
> 
> At the moment, the host kernel doesn't actually need the locked memory limit
> - it allows unprivileged users (with permission to create VMs) to allocate
> HPTs anyway, but this is really a bug.

So IIUC the bug is that, by not accounting for that memory
properly, the kernel is allowing it to be allocated as
potentially non-contiguous and swappable, which will result
in failure right away (non-contiguous) or as soon as it has
been swapped out (swappable). Is that right?

> As it stands a non-privileged user
> could create a whole pile of tiny VMs (it doesn't even need to actually
> execute any instructions in the VMs) and consume an unbounded amount of host
> memory with those 16MiB HPTs.

That's not really something QEMU specific, though, is it?
The same user could just as easily start a bunch of random
processes, each one allocating 16MiB+ and get the same result.

> So we plan to fix that in the kernel.  In the meantime libvirt treats things
> as if the kernel enforced that limit even though it doesn't yet, to avoid
> having yet more ugly kernel version dependencies.
> 
> 
> Andrea, would it make any sense to have failure of the setrlimit in libvirt
> cause only a warning, not a fatal error?  In that case it wouldn't prevent
> things working in situations where it can for other reasons (old kernel
> which doesn't enforce limits, PR KVM which doesn't require it..).

I don't think that's a good idea.

First of all, we'd have to be able to tell whether raising
the limit is actually needed or not, which would probably be
tricky - especially considering that libvirt currently doesn't
know anything about the difference between HV and PR KVM.

Most importantly, we'd be allowing users to start guests that
we know full well may run into trouble later. I'd rather error
out early than have the guest behave erratically down the line
for no apparent reason.

Peter's point about warnings having very little visibility is
also a good one.

--- Additional comment from David Gibson on 2016-05-26 18:11:08 EDT ---

> > At the moment, the host kernel doesn't actually need the locked memory limit
> > - it allows unprivileged users (with permission to create VMs) to allocate
> > HPTs anyway, but this is really a bug.

> So IIUC the bug is that, by not accounting for that memory
> properly, the kernel is allowing it to be allocated as
> potentially non-contiguous and swappable, which will result
> in failure right away (non-contiguous) or as soon as it has
> been swapped out (swappable). Is that right?

No.  The HPT *will* be allocated contiguous and non-swappable (it's allocated with CMA) - it's just not accounted against the process / user's locked memory limit.  That's why this is a security bug.

> > As it stands a non-privileged user
> > could create a whole pile of tiny VMs (it doesn't even need to actually
> > execute any instructions in the VMs) and consume an unbounded amount of host
> > memory with those 16MiB HPTs.

> That's not really something QEMU specific, though, is it?
> The same user could just as easily start a bunch of random
> processes, each one allocating 16MiB+ and get the same result.

No, because in that case the memory would be non-contiguous and swappable.

--- Additional comment from Andrea Bolognani on 2016-06-09 10:26:56 EDT ---

Got it.

So I guess our options are:

  a) Raise locked memory limit for users to something like
     64 MiB, so they can run guests of reasonable size (4 GiB)
     without running into errors. Appliances created by
     libguestfs are going to be even smaller than that, I
     assume, so they would work

  b) Teach libvirt about the difference between kvm_hv and
     kvm_pr, only try to tweak the locked memory limit when
     using HV, and have libguestfs always use PR

  c) Force libguestfs to use the direct backend on ppc64

  d) Leave things as they are, basically restricting
     libguestfs usage to the root user

a) and c) are definitely hacks, but could be implemented
fairly quickly and removed once a better solution is in
place.

b) looks like it would be the proper solution but, as with
all thing libvirt, rushing an implementation without thinking
hard at the design has the potential to paint us in a corner.

d) is probably not acceptable.

--- Additional comment from David Gibson on 2016-06-14 02:01:23 EDT ---

In the short term, I think we need to go with option (a).  That's the only really feasible way we can handle this in the next RHEL release, I think.

(b).. I really dislike.  We try to avoid explicitly exposing the PR/HV distinction even to qemu as much as possible - instead using explicit capabilities for various features.  Exposing and using that distinction a layer beyond qemu is going to open several new cans of worms.  For one thing, whether the kernel picks HV or PR can depend on a number of details of both host and guest configuration, so you can't really reliably know which one it's going to be before starting it.

(c) I'm not quite sure what "direct mode" entails.

(d) is.. yeah, certainly suboptimal.


Other things we could try:

(e) Change KVM so that if it's unable to allocate the HPT due to locked memory limit, it will fall back to PR-KVM.  In a sense that's the most pedantically correct, but I dislike it, because I suspect the result will be lots of people's VMs going slow for non-obvious reasons.

(f) Put something distinctive in the error qemu reports when it hits the HPT allocation problem, and only have libvirt try to alter the limit and retry if qemu dies with that error.  Involves an extra qemu invocation, which sucks.

(g) Introduce some new kind of "VM limits" stuff into RHEL startup scripts, that will adjust users locked memory limits based on some sort of # of VMs and max size of VMs values configured by admin.  This is basically a sophisticated version of (a).


Ugh.. none of these are great :/.

--- Additional comment from Andrea Bolognani on 2016-06-14 06:33:32 EDT ---

(In reply to David Gibson from comment #21)
> In the short term, I think we need to go with option (a).  That's the only
> really feasible way we can handle this in the next RHEL release, I think.

I guess we would have to make qemu-kvm-rhev ship a
/etc/security/limits.d/qemu-kvm-rhev-memlock.conf file that
sets the new limit. It wouldn't make sense to raise the
limit for hosts that are not going to act as hypervisors.

> (b).. I really dislike.  We try to avoid explicitly exposing the PR/HV
> distinction even to qemu as much as possible - instead using explicit
> capabilities for various features.  Exposing and using that distinction a
> layer beyond qemu is going to open several new cans of worms.  For one
> thing, whether the kernel picks HV or PR can depend on a number of details
> of both host and guest configuration, so you can't really reliably know
> which one it's going to be before starting it.

Okay then.

> (c) I'm not quite sure what "direct mode" entails.

Basically libguestfs will call QEMU itself instead of going
through libvirt. guestfish will give you this hint:

  libguestfs: error: could not create appliance through libvirt.

  Try running qemu directly without libvirt using this environment variable:
  export LIBGUESTFS_BACKEND=direct

and if you do that you'll of course be able to avoid the error
raised by libvirt.

I don't know what other implications there are to using the
direct backend, though. Rich?

> (d) is.. yeah, certainly suboptimal.
> 
> 
> Other things we could try:
> 
> (e) Change KVM so that if it's unable to allocate the HPT due to locked
> memory limit, it will fall back to PR-KVM.  In a sense that's the most
> pedantically correct, but I dislike it, because I suspect the result will be
> lots of people's VMs going slow for non-obvious reasons.

Yeah, doing this kind of stuff outside of user's control is
never going to end well. Better to fail with a clear error
message than trying to patch things up behind the scenes.

> (f) Put something distinctive in the error qemu reports when it hits the HPT
> allocation problem, and only have libvirt try to alter the limit and retry
> if qemu dies with that error.  Involves an extra qemu invocation, which
> sucks.

libvirt is not really designed in a way that allows you to
just try calling QEMU with some arguments and, if that fails,
call it again with different arguments. So QEMU would have to
expose the information through QMP somehow, for libvirt to
probe beforehand. I'm not sure whether this approach would
even be feasible.

> (g) Introduce some new kind of "VM limits" stuff into RHEL startup scripts,
> that will adjust users locked memory limits based on some sort of # of VMs
> and max size of VMs values configured by admin.  This is basically a
> sophisticated version of (a).

The limits are be per-process, though. So the only thing
that really matters is how much memory you want to allow
for an unpriviledged guest. PCI passthrough is not going
to be a factor unless you're root, and in that case you
can set the limit as you please.

> Ugh.. none of these are great :/.

--- Additional comment from Daniel Berrange on 2016-06-14 06:40:48 EDT ---

(In reply to Andrea Bolognani from comment #22)
> (In reply to David Gibson from comment #21)
> > In the short term, I think we need to go with option (a).  That's the only
> > really feasible way we can handle this in the next RHEL release, I think.
> 
> I guess we would have to make qemu-kvm-rhev ship a
> /etc/security/limits.d/qemu-kvm-rhev-memlock.conf file that
> sets the new limit. It wouldn't make sense to raise the
> limit for hosts that are not going to act as hypervisors.

Such files will have no effect. The limits.conf files are processed by PAM, and when libvirt launches QEMU and sets its UID, PAM is not involved in any way.

IOW, if we need to set limits for QEMU, libvirt has to set them explicitly. The same would apply for other apps launching QEMU, unless they actually use 'su' to run QEMU as a diffferent account, which I don't believe any do.

--- Additional comment from Andrea Bolognani on 2016-06-14 07:14:00 EDT ---

(In reply to Daniel Berrange from comment #23)
> > I guess we would have to make qemu-kvm-rhev ship a
> > /etc/security/limits.d/qemu-kvm-rhev-memlock.conf file that
> > sets the new limit. It wouldn't make sense to raise the
> > limit for hosts that are not going to act as hypervisors.
> 
> Such files will have no effect. The limits.conf files are processed by PAM,
> and when libvirt launches QEMU and sets its UID, PAM is not involved in any
> way.
> 
> IOW, if we need to set limits for QEMU, libvirt has to set them explicitly.
> The same would apply for other apps launching QEMU, unless they actually use
> 'su' to run QEMU as a diffferent account, which I don't believe any do.

For user sessions, the libvirt daemon is autostarted and
will inherit the user's limits.

I tried dropping

  *       hard    memlock         64000
  *       soft    memlock         64000

in /etc/security/limits.d/qemu-kvm-rhev-memlock.conf and,
after logging out and in again, I was able to install a
guest and use guestfish from my unprivileged account.

--- Additional comment from Richard W.M. Jones on 2016-06-14 07:28:43 EDT ---

(In reply to Andrea Bolognani from comment #22)
> > (c) I'm not quite sure what "direct mode" entails.
> 
> Basically libguestfs will call QEMU itself instead of going
> through libvirt. guestfish will give you this hint:
> 
>   libguestfs: error: could not create appliance through libvirt.
> 
>   Try running qemu directly without libvirt using this environment variable:
>   export LIBGUESTFS_BACKEND=direct
> 
> and if you do that you'll of course be able to avoid the error
> raised by libvirt.
> 
> I don't know what other implications there are to using the
> direct backend, though. Rich?

It's not supported, nor encouraged in RHEL.  In this case it's a DIY
workaround, but it ought to be fixed in libvirt (or qemu, or wherever,
but in any case not by end users).

--- Additional comment from Andrea Bolognani on 2016-06-28 05:01:55 EDT ---

Moving this to qemu, as the only short-term (and possibly
long-term) solution seems to be the one outlined in
Comment 20 (proposal A) and POC-ed in Comment 24, ie. ship
a /etc/security/limits.d/qemu-memlock.conf file that raises
the memory locking limit to something like 64 MiB, thus
allowing regular users to run smallish guests.

Comment 1 Andrea Bolognani 2016-06-28 11:13:13 UTC

I just realized that the original report was about this
failure happening on a x86_64 host.

In the case of TCG guests, regardless of the host
architecture, it's my understanding that memory locking
should not be required.

David, can you please confirm that?

Comment 2 Richard W.M. Jones 2016-06-28 13:43:34 UTC

(In reply to Andrea Bolognani from comment #1)
> I just realized that the original report was about this
> failure happening on a x86_64 host.

That's not (really) right.  It was actually launching an L2
guest inside an L1 host, where:

  L2 guest (ppc64) - failed because of memory locking
  L1 guest (ppc64) - running OK
  L0 host (x86-64)

In this case I'm only doing nested virt while waiting for
IBM to send me some POWER hardware.  Ha ha, not really.

Comment 3 Andrea Bolognani 2016-06-28 14:27:33 UTC

(In reply to Richard W.M. Jones from comment #2)
> > I just realized that the original report was about this
> > failure happening on a x86_64 host.
> 
> That's not (really) right.  It was actually launching an L2
> guest inside an L1 host, where:
> 
>   L2 guest (ppc64) - failed because of memory locking
>   L1 guest (ppc64) - running OK
>   L0 host (x86-64)
> 
> In this case I'm only doing nested virt while waiting for
> IBM to send me some POWER hardware.  Ha ha, not really.

Of course the first ppc64 guest (L1) had to be using TCG
because the architecture mismatch. But was it started by
libvirt? And if so, was it the user daemon or the system
one?

Comment 4 Richard W.M. Jones 2016-06-28 14:43:18 UTC

(In reply to Andrea Bolognani from comment #3)
> (In reply to Richard W.M. Jones from comment #2)
> > > I just realized that the original report was about this
> > > failure happening on a x86_64 host.
> > 
> > That's not (really) right.  It was actually launching an L2
> > guest inside an L1 host, where:
> > 
> >   L2 guest (ppc64) - failed because of memory locking
> >   L1 guest (ppc64) - running OK
> >   L0 host (x86-64)
> > 
> > In this case I'm only doing nested virt while waiting for
> > IBM to send me some POWER hardware.  Ha ha, not really.
> 
> Of course the first ppc64 guest (L1) had to be using TCG
> because the architecture mismatch.

For sure.

> But was it started by
> libvirt? And if so, was it the user daemon or the system
> one?

Yes, libvirt, and using the system connection.

However I am not clear if the *L2* guest would be using TCG,
or whether qemu emulates enough of POWER that it can emulate
KVM too (albeit really slowly, of course).  For example if you
run an L2 guest on x86-64, even without nested KVM, the
L2 guest will use (emulated, very slow) KVM because qemu-system-x86_64
running the L1 guest can emulate AMD's virt extensions.

Comment 5 Andrea Bolognani 2016-06-28 15:14:00 UTC

(In reply to Richard W.M. Jones from comment #4)
> > > > I just realized that the original report was about this
> > > > failure happening on a x86_64 host.
> > > 
> > > That's not (really) right.  It was actually launching an L2
> > > guest inside an L1 host, where:
> > > 
> > >   L2 guest (ppc64) - failed because of memory locking
> > >   L1 guest (ppc64) - running OK
> > >   L0 host (x86-64)
> > > 
> > > In this case I'm only doing nested virt while waiting for
> > > IBM to send me some POWER hardware.  Ha ha, not really.
> > 
> > Of course the first ppc64 guest (L1) had to be using TCG
> > because the architecture mismatch.
> 
> For sure.
> 
> > But was it started by
> > libvirt? And if so, was it the user daemon or the system
> > one?
> 
> Yes, libvirt, and using the system connection.

The system daemon, running as root, was able to raise the
locked memory limit, hence why you didn't run into any error.
So it all checks out :)

This BZ is about teaching libvirt that TCG guests don't need
to lock memory, which would make you able to run TCG guests,
either on x86_64 or ppc64, from the user daemon.

Assuming David confirms that TCG doesn't need to lock memory,
that is :)

> However I am not clear if the *L2* guest would be using TCG,
> or whether qemu emulates enough of POWER that it can emulate
> KVM too (albeit really slowly, of course).  For example if you
> run an L2 guest on x86-64, even without nested KVM, the
> L2 guest will use (emulated, very slow) KVM because qemu-system-x86_64
> running the L1 guest can emulate AMD's virt extensions.

I think the L2 guest will not be able to use kvm_hv, but
will fall back to kvm_pr instead.

Comment 6 David Gibson 2016-06-29 04:13:07 UTC

> In the case of TCG guests, regardless of the host
> architecture, it's my understanding that memory locking
> should not be required.

> David, can you please confirm that?


That's correct, unless you're using VFIO devices with TCG, in which case they will need their own memlock quota, as usual.

[Richard]
> However I am not clear if the *L2* guest would be using TCG,
> or whether qemu emulates enough of POWER that it can emulate
> KVM too (albeit really slowly, of course).  For example if you
> run an L2 guest on x86-64, even without nested KVM, the
> L2 guest will use (emulated, very slow) KVM because qemu-system-x86_64
> running the L1 guest can emulate AMD's virt extensions.

[Andrea]
> I think the L2 guest will not be able to use kvm_hv, but
> will fall back to kvm_pr instead.

The L2 guest could be using either KVM PR or TCG, I'm not sure.

The L1 guest is a PAPR (paravirtualized) guest, which runs with the HV (hypervisor mode) bit *off*.  This has to be the case, because we don't support emulating a bare metal Power machine with full HV mode emulation in qemu.  There are patches gradually on the way to add that a new "powernv" machine type which will emulate bare metal, but they're not in yet.

Using KVM HV requires a host running in hypervisor mode.  Since the L1 guest is not in hypervisor mode, it won't even attempt to use KVM HV.

KVM PR could work for the L2 guest, however, RHEL by default won't load the KVM PR module.  So if L1 is RHEL, and you haven't manually loaded the module, I'd expect the L2 guest to be running under TCG instead.

All of which underscores the basic problem here: it's not easy for libvirt to tell what emulation mode a guest will run in until it's running, which is a problem if we need to conditionally adjust the locked memory limit beforehand.  I don't have any good ideas about how to deal with that.

Comment 8 Wayne Sun 2016-07-01 09:24:34 UTC

As check in our CI jobs, the memory lock problem failed with:
libvirt	1.3.5-1.el7
kernel	3.10.0-327.el7
qemu-kvm-rhev	2.6.0-6.el7

Now retest with qemu-kvm-rhev updated

packages: 
libvirt-1.3.5-1.el7.ppc64le
qemu-kvm-rhev-2.6.0-10.el7.ppc64le
kernel-3.10.0-327.el7.ppc64le

steps:
# useradd new_user
# su - new_user
$ virsh list --all
 Id    Name                           State
----------------------------------------------------
 -     avocado-vt-vm1                 shut off

$ virsh dumpxml avocado-vt-vm1
<domain type='kvm'>
  <name>avocado-vt-vm1</name>
  <uuid>1c2363d5-90da-4f59-b1f8-25fbb4bec2d8</uuid>
  <memory unit='KiB'>1048576</memory>
  <currentMemory unit='KiB'>1048576</currentMemory>
  <vcpu placement='static'>2</vcpu>
  <os>
    <type arch='ppc64le' machine='pseries-rhel7.3.0'>hvm</type>
    <boot dev='hd'/>
  </os>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/tmp/autotest.img'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>
    <controller type='usb' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'/>
    <controller type='virtio-serial' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:f4:85:91'/>
      <source bridge='virbr0'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
    </interface>
    <serial type='pty'>
      <target port='0'/>
      <address type='spapr-vio' reg='0x30000000'/>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
      <address type='spapr-vio' reg='0x30000000'/>
    </console>
    <input type='keyboard' bus='usb'/>
    <input type='mouse' bus='usb'/>
    <graphics type='vnc' port='-1' autoport='yes'>
      <listen type='address'/>
    </graphics>
    <video>
      <model type='vga' vram='16384' heads='1' primary='yes'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </memballoon>
    <panic model='pseries'/>
  </devices>
</domain>

$ virsh start avocado-vt-vm1
error: Failed to start domain avocado-vt-vm1
error: Failed to connect socket to '/home/new_user/.cache/libvirt/virtlogd-sock': Connection refused

some problem with virtlogd, as on x86_64 this works, start virtlogd under this user to workaround this

$ virtlogd --daemon

$ virsh start avocado-vt-vm1
error: Failed to start domain avocado-vt-vm1
error: internal error: /usr/libexec/qemu-bridge-helper --use-vnet --br=virbr0 --fd=24: failed to communicate with bridge helper: Transport endpoint is not connected
stderr=libvirt:  error : internal error: cannot apply process capabilities -1

another problem with qemu-bridge-helper failed, will file separate bug for this.

The memory lock problem can't be reproduced with qemu-kvm-rhev-2.6.0-10.el7.ppc64le

Comment 9 Richard W.M. Jones 2016-07-01 15:46:50 UTC

stderr=libvirt:  error : internal error: cannot apply process capabilities -1
is likely to be bug 1351995 (ie. a completely different thing)

Comment 10 Andrea Bolognani 2016-07-04 07:57:50 UTC

(In reply to Richard W.M. Jones from comment #9)
> stderr=libvirt:  error : internal error: cannot apply process capabilities -1
> is likely to be bug 1351995 (ie. a completely different thing)

Yeah, I already checked on Friday that that was the case.
Didn't get around to update the BZ though.

Comment 11 Wayne Sun 2016-07-04 09:02:08 UTC

As checked in CI job for 2.0.0-1, with audit 2.6.2-1 which fixed bug in comment 
#8, now the bug is reproduced with packages:

libvirt	2.0.0-1.el7
kernel	3.10.0-327.el7
qemu-kvm-rhev	2.6.0-11.el7

Comment 12 Andrea Bolognani 2016-07-04 09:09:30 UTC

A fix for this issue has been posted upstream.

https://www.redhat.com/archives/libvir-list/2016-July/msg00072.html

Comment 13 Andrea Bolognani 2016-07-04 13:08:42 UTC

The fix has been committed upstream.

commit cd89d3451b8efcfed05ff1f4a91d9b252dbe26bc
Author: Andrea Bolognani <abologna>
Date:   Wed Jun 29 10:22:32 2016 +0200

    qemu: Memory locking is only required for KVM guests on ppc64
    
    Due to the way the hardware works, KVM on ppc64 always requires
    memory locking; however, that is not the case for non-KVM ppc64
    guests, eg. ppc64 guests that are running on x86_64 with TCG.
    
    Only require memory locking for ppc64 guests if they are using
    KVM or, as it's the case for all architectures, they have host
    devices assigned using VFIO.
    
    Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1350772

v2.0.0-37-gcd89d34

Comment 16 Dan Zheng 2016-07-13 07:33:58 UTC

*Packages used to reproduce:

qemu-kvm-rhev-2.6.0-11.el7.ppc64le
libvirt-2.0.0-1.el7.ppc64le
kernel-3.10.0-327.el7.ppc64le

*Reproduced with 2 scenarios:
1. Non-root user start guest on PPC host
Steps:
# useradd dzheng
# su - dzheng
$ virsh define /tmp/rpm/libvirt/guest.xml
Domain avocado-vt-vm1 defined from /tmp/rpm/libvirt/guest.xml
$ virsh list --all
 Id    Name                           State
----------------------------------------------------
 -     avocado-vt-vm1                 shut off
$ virsh start avocado-vt-vm1
error: Failed to start domain avocado-vt-vm1
error: internal error: Process exited prior to exec: libvirt:  error : cannot limit locked memory to 20971520: Operation not permitted


2. libguestfs-test-tool
Install libvirt, qemu-kvm-rhev, and libguestfs packages.
# systemctl restart libvirtd
Log on as root again
# ulimit -l
64
# su - dzheng
$ ulimit -l
64
$ /usr/bin/libguestfs-test-tool 
...
Original error from libvirt: internal error: Process exited prior to exec: libvirt:  error : cannot limit locked memory to 18874368: Operation not permitted [code=1 int1=-1]
libguestfs-test-tool: failed to launch appliance
See attachment (ulimit_64_dzheng_qemu_11_fail.libguestfs.log)

As workaround,
Log on as root
# ulimit -l unlimited
# su - dzheng
$ ulimit -l
unlimited
$ /usr/bin/libguestfs-test-tool
...
libguestfs: appliance is up
Guest launched OK.
...
===== TEST FINISHED OK =====
See attachment (ulimit_unlimited_dzheng_qemu_11_pass.libguestfs.log)

Comment 17 Dan Zheng 2016-07-13 07:36:44 UTC

Created attachment 1179117 [details]
libguestfs-test-tool fail output with qemu-kvm-rhev-2.6.0-11.el7

Comment 18 Dan Zheng 2016-07-13 07:39:12 UTC

Created attachment 1179118 [details]
libguestfs-test-tool pass output with qemu-kvm-rhev-2.6.0-11.el7 unlimited work around

Comment 19 Andrea Bolognani 2016-07-13 07:49:45 UTC

(In reply to Dan Zheng from comment #16)
> *Packages used to reproduce:
> 
> qemu-kvm-rhev-2.6.0-11.el7.ppc64le
> libvirt-2.0.0-1.el7.ppc64le
> kernel-3.10.0-327.el7.ppc64le

The fix has been included in libvirt-2.0.0-2.el7, so the
version you're using is too old... Or are you creating a
baseline for testing the fix?

Comment 20 Dan Zheng 2016-07-14 00:37:25 UTC

Andrea,
Yes. comment 16 is just to reproduce the issue. And I did upgrade qemu to 
qemu-kvm-rhev-2.6.0-13.el7.ppc64le, but libvirt is still libvirt-2.0.0-1.el7 without upgrade. Then the scenarios in comment 16 can pass without your new codes. So I am thinking about what the scenario is to prove your codes required and take effect.

BTW, it is hard for me to run a virt-install --arch ppc64le ... on a x86_host for the scenario used in the beginning of the bug because I can not setup TCG using downstream packages, I think. Any other suggestion?

Comment 21 Andrea Bolognani 2016-07-14 12:00:22 UTC

(In reply to Dan Zheng from comment #20)
> Andrea,
> Yes. comment 16 is just to reproduce the issue. And I did upgrade qemu to 
> qemu-kvm-rhev-2.6.0-13.el7.ppc64le, but libvirt is still libvirt-2.0.0-1.el7
> without upgrade. Then the scenarios in comment 16 can pass without your new
> codes. So I am thinking about what the scenario is to prove your codes
> required and take effect.

Scenario 1 failed outright; scenario 2 failed until you used a
workaround. Those are the failures you're looking for.

You either need to keep qemu-kvm-rhev at version 2.6.0-11.el7,
or run

  $ ulimit -l 64

to make sure your memory locking limit is very low;
additionaly, you need to make sure that the guest XML starts
with

  <domain type='qemu'>

which tells libvirt to use TCG instead of KVM. After you've
done this, scenario 1 should fail with libvirt-2.0.0-1.el7
and succeed with libvirt-2.0.0-2.el7.

Not sure if there's a way to force libguestfs-test-tool to
use TCG in order to test scenario 2... Rich? :)

> BTW, it is hard for me to run a virt-install --arch ppc64le ... on a
> x86_host for the scenario used in the beginning of the bug because I can not
> setup TCG using downstream packages, I think. Any other suggestion?

We don't ship qemu-system-ppc64 on x86_64, so running ppc64
guests on x86_64 hosts is not a relevant use case for
downstream. No need to test it.

Comment 22 Richard W.M. Jones 2016-07-14 14:14:23 UTC

Yes: http://libguestfs.org/guestfs.3.html#force_tcg

Comment 23 Andrea Bolognani 2016-07-15 11:01:12 UTC

(In reply to Richard W.M. Jones from comment #22)
> Yes: http://libguestfs.org/guestfs.3.html#force_tcg

Sweet! Thank you :)

Comment 24 Dan Zheng 2016-07-18 07:52:02 UTC

*Packages used to reproduce:

qemu-kvm-rhev-2.6.0-13.el7.ppc64le
libvirt-2.0.0-2.el7.ppc64le
kernel-3.10.0-461.el7.ppc64le
libguestfs-1.32.6-1.el7.ppc64le


Scenario 1. Non-root user start guest on PPC host
Steps:
# useradd dzheng
# su - dzheng

Edit guest.xml to make sure <domain type='qemu'>
$ virsh define /tmp/rpm/libvirt/guest.xml
Domain guest1 defined from /tmp/rpm/libvirt/guest.xml

$ ulimit -l 64
$ ulimit -l
64

$ virsh start avocado-vt-vm1

$ virsh list --all
 Id    Name                           State
----------------------------------------------------
 3     guest1                         running

User can log on the guest.

--- With libvirt-2.0.0-1.el7.ppc64le, scenario 1 can be reproduced with same error message as comment 16.

Pass.
*******************************************************

Scenario 2. libguestfs-test-tool
# su - dzheng
$ ulimit -l
64
$ export LIBGUESTFS_BACKEND_SETTINGS=force_tcg
$ /usr/bin/libguestfs-test-tool 

[    1.913667] Rebooting in 1 seconds..libguestfs: error: appliance closed the connection unexpectedly, see earlier error messages
libguestfs: child_cleanup: 0x1000f300340: child process died
libguestfs: error: guestfs_launch failed, see earlier error messages
libguestfs-test-tool: failed to launch appliance

$ ulimit -l 
65536
Same error message.

-- With libvirt-2.0.0-1.el7.ppc64le, same error message. 

See attachment libguestfs-test-tool-tcg-64-fail.log

Comment 25 Dan Zheng 2016-07-18 07:53:28 UTC

Created attachment 1180943 [details]
libguestfs-test-tool fail output with qemu-kvm-rhev-2.6.0-13.el7

Comment 26 Andrea Bolognani 2016-07-19 08:50:37 UTC

Okay, there is a different bug that causes libguestfs-test-tool
to fail in this situation - I've just filed it as Bug 1350772.

So I think it's fair to ignore the libguestfs-test-tool failure
for the moment, and just test that a regular libvirt TCG guest
can be started by an unprivileged user with low memory locking
limit.

What kind of OS is installed in the avocado-vt-vm1 guest you
tested in Comment 24? According to my tests for Bug 1350772,
it would have to be something oldish in order to boot in TCG
mode...

Comment 27 Dan Zheng 2016-07-20 03:21:42 UTC

Andrea,

The OS in avocado-vt-vm1 is 
Red Hat Enterprise Linux Server release 7.2 (Maipo), kernel 3.10.0-327.3.1.el7.ppc64le

Any other information you want?

Comment 28 Andrea Bolognani 2016-07-20 08:25:53 UTC

That explains it - RHEL 7.2 guests can boot fine on the POWER 7
processor that TCG emulates by default.

No more information needed from my side, and I think the bug
can be moved to VERIFIED now.

Comment 29 Dan Zheng 2016-07-21 01:55:37 UTC

Based on above comment 28, I move it to verified now.

Comment 31 errata-xmlrpc 2016-11-03 18:47:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2577.html

Note You need to log in before you can comment on or make changes to this bug.