Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1737256
Summary: | Provisioning VM on 4k gluster storage fails with "Invalid argument" - qemu fail to detect block size | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Nir Soffer <nsoffer> |
Component: | qemu | Assignee: | Fedora Virtualization Maintainers <virt-maint> |
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 31 | CC: | amit, berrange, cfergeau, crobinso, dwmw2, itamar, kwolf, ndevos, pbonzini, rjones, sgarzare, virt-maint, vjuranek |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | qemu-4.1.0-2.fc31 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-09-28 00:01:43 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1592916 | ||
Attachments: |
Description
Nir Soffer
2019-08-04 17:54:21 UTC
Niels, do we have a way to detect the logical block size when using libgfapi? In QEMU, block/gluster.c does not implement BlockDriver->bdrv_probe_blocksizes(). I think it should be possible by using results from fstat()/glfs_fstat() or statvfs()/glfs_statvfs(). Those functions would return the same values as the matching syscalls over FUSE gluster:// URLs. It is not clear to me why that would fail... Can you get the block-size with `stat` on both a FUSE mount point and on the XFS filesystem that functions as the brick? I tested again the case when the first block of the device is allocated - in this case qemu detect the block size correctly. Previously I aborted this test when I found that the guest still see block size of 512 bytes. I ran this script: name=/tmp/qemu-fuse-allocated.trace images=/rhev/data-center/mnt/glusterSD/voodoo4.tlv.redhat.com\:_gv0/de566475-5b67-4987-abf3-3dc98083b44c/images disk=4163cc03-c5ef-4956-ac0a-6e04699ce2e5/51de3a28-5c0a-4db1-a51c-b09c437b6f39 cdrom=3119fe7e-f576-46c3-880b-9590cb4619da/46efd27f-3bbe-4183-91f7-912c163b8eac dd if=/dev/zero bs=4096 count=1 of=$images/$disk conv=fsync truncate -s 1g $images/$disk strace -f -tt -o $name qemu-kvm \ -drive file=$images/$disk,format=raw,cache=none,if=virtio \ -cdrom $images/$cdrom \ -netdev tap,id=net1,ifname=tap0,script=no,downscript=no \ -device virtio-net-pci,netdev=net1,mac=56:6f:fe:64:00:01 \ -m 1024 \ -nographic This the interesting output from the guest: localhost:~# setup-alpine ... Available disks are: vda (1.1 GB 0x1af4 ) Which disk(s) would you like to use? (or '?' for help or 'none') [none] vda The following disk is selected: vda (1.1 GB 0x1af4 ) How would you like to use it? ('sys', 'data', 'lvm' or '?' for help) [?] sys WARNING: The following disk(s) will be erased: vda (1.1 GB 0x1af4 ) WARNING: Erase the above disk(s) and continue? [y/N]: y Creating file systems... Installing system on /dev/vda3: /mnt/boot is device /dev/vda1 100% ████████████████████████████████████████████==> initramfs: creating /boot/initramfs-virt gzip: write error: Read-only file system cpio: write error: Broken pipe rm: can't remove '/tmp/mkinitfs.bPjPpm/media/cdrom': Read-only file system rm: can't remove '/tmp/mkinitfs.bPjPpm/media/floppy': Read-only file system rm: can't remove '/tmp/mkinitfs.bPjPpm/media/usb': Read-only file system ... Looking at the trace: localhost:~# dmesg ... [ 231.776768] vda: vda1 vda2 vda3 [ 232.229337] print_req_error: I/O error, dev vda, sector 100354 [ 232.229460] print_req_error: I/O error, dev vda, sector 101362 [ 232.229603] print_req_error: I/O error, dev vda, sector 102370 [ 232.229697] print_req_error: I/O error, dev vda, sector 102402 [ 232.229791] print_req_error: I/O error, dev vda, sector 103410 [ 232.229883] print_req_error: I/O error, dev vda, sector 104418 [ 232.229976] print_req_error: I/O error, dev vda, sector 104450 [ 232.230086] print_req_error: I/O error, dev vda, sector 105458 [ 232.230180] print_req_error: I/O error, dev vda, sector 106466 [ 232.230272] print_req_error: I/O error, dev vda, sector 106498 [ 233.832220] Adding 262140k swap on /dev/vda2. Priority:-2 extents:1 across:262140k [ 235.754207] EXT4-fs (vda3): mounted filesystem with ordered data mode. Opts: (null) [ 235.988453] EXT4-fs (vda1): mounted filesystem with ordered data mode. Opts: (null) [ 238.602871] print_req_error: 2 callbacks suppressed [ 238.602909] print_req_error: I/O error, dev vda, sector 0 [ 257.568921] print_req_error: I/O error, dev vda, sector 43684 [ 257.568929] EXT4-fs warning (device vda1): ext4_end_bio:323: I/O error 10 writing to inode 15 (offset 0 size 2097152 starting block 22866) Looking at the trace: Thread 18708 opened the drive for read, and detected the block size: 18708 00:45:42.117603 openat(AT_FDCWD, "/rhev/data-center/mnt/glusterSD/voodoo4.tlv.redhat.com:_gv0/de566475-5b67-4987-abf3-3dc98083b44c/images/4163cc03-c5ef-4956-ac0a-6e046 99ce2e5/51de3a28-5c0a-4db1-a51c-b09c437b6f39", O_RDONLY|O_DIRECT|O_CLOEXEC) = 9 18708 00:45:42.117674 fstat(9, {st_mode=S_IFREG|0660, st_size=1073741824, ...}) = 0 18708 00:45:42.117711 lseek(9, 0, SEEK_END) = 1073741824 18708 00:45:42.117745 ioctl(9, BLKSSZGET, 0x7ffc897c6084) = -1 ENOSYS (Function not implemented) 18708 00:45:42.117798 pread64(9, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 0) = 4096 18708 00:45:42.119038 pread64(9, 0x559ae5039a00, 512, 0) = -1 EINVAL (Invalid argument) 18708 00:45:42.120328 pread64(9, 0x559ae5039a00, 1024, 0) = -1 EINVAL (Invalid argument) 18708 00:45:42.121431 pread64(9, 0x559ae5039a00, 2048, 0) = -1 EINVAL (Invalid argument) 18708 00:45:42.122442 pread64(9, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 0) = 4096 18708 00:45:42.123028 ioctl(9, BLKSSZGET, 0x7ffc897c6184) = -1 ENOSYS (Function not implemented) 18708 00:45:42.123112 pread64(9, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 0) = 4096 18708 00:45:42.123534 pread64(9, <unfinished ...> 18711 00:45:42.124234 <... nanosleep resumed>0x7fb6af9a26b0) = 0 18711 00:45:42.124268 futex(0x559ae4d9aa68, FUTEX_WAIT, 4294967295, NULL <unfinished ...> 18708 00:45:42.124553 <... pread64 resumed>0x559ae503ac00, 512, 0) = -1 EINVAL (Invalid argument) 18708 00:45:42.124579 pread64(9, 0x559ae503ac00, 1024, 0) = -1 EINVAL (Invalid argument) 18708 00:45:42.125551 pread64(9, 0x559ae503ac00, 2048, 0) = -1 EINVAL (Invalid argument) 18708 00:45:42.126581 pread64(9, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 0) = 4096 Then it opens the drive again for read-write - but no attempt was made to detect the block size: 18708 00:45:42.185682 openat(AT_FDCWD, "/rhev/data-center/mnt/glusterSD/voodoo4.tlv.redhat.com:_gv0/de566475-5b67-4987-abf3-3dc98083b44c/images/4163cc03-c5ef-4956-ac0a-6e04699ce2e5/51de3a28-5c0a-4db1-a51c-b09c437b6f39", O_RDWR|O_DIRECT|O_CLOEXEC) = 15 The first failure on this file descriptor is: 18924 00:49:34.531792 pwritev(15, [ {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=1024}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, ...], 1026, 51380224) = -1 EINVAL (Invalid argument) So it looks qemu fail to detect that the vector is not aligned. Based on bdrv_qiov_is_aligned(), bdrv_min_mem_align(bs) returns the wrong value at this point. Created attachment 1600805 [details] tests script, output and trace with allocated image This contains the full output and trace mentioned in comment 3. Note that GlusterFS (over FUSE and libgfapi) does not support ioctl() calls. The BLKSSZGET that gets requested comes from block/file-posix.c when running over FUSE. Maybe block/file-posix.c:probe_logical_blocksize() could have a fallback to get the sector/blocksize from fstatvfs(). Please also see comment #2 so that we can verify that Gluster supports getting the non-512 size over the network. (In reply to Niels de Vos from comment #2) > In QEMU, block/gluster.c does not implement BlockDriver->bdrv_probe_blocksizes(). This is the wrong one. It should set bs->bl.request_alignment somewhere during bdrv_open(). (In reply to Nir Soffer from comment #3) > Then it opens the drive again for read-write - but no attempt was made to > detect the block size: This looked like an interesting observation, but after checking the code and trying it out with a simple example, the alignment isn't reset for the new file descriptor, but we just keep using the old value. This should be fine because changing from O_RDONLY to O_RDWR shouldn't make any difference for the required alignment. > So it looks qemu fail to detect that the vector is not aligned. Based on > bdrv_qiov_is_aligned(), bdrv_min_mem_align(bs) returns the wrong value > at this point. I'm not sure what the required memory alignment actually is in the gluster case. But bdrv_co_preadv(), which uses bs->bl.request_alignment, should already have made the request aligned long before bdrv_qiov_is_aligned(), so somehow at least request_alignment, but probably both values got lost at some point. I'll try to find a way to get a block device that requires 4k alignment on my host, so I can hopefully reproduce the problem. (In reply to Niels de Vos from comment #5) > Note that GlusterFS (over FUSE and libgfapi) does not support ioctl() calls. > The BLKSSZGET that gets requested comes from block/file-posix.c when running > over FUSE. Maybe block/file-posix.c:probe_logical_blocksize() could have a > fallback to get the sector/blocksize from fstatvfs(). I can look into using fstatvfs() as an additional source of information. However, I really think gluster should behave consistently with alignment requirements and enforce the same alignment whether or not the accessed data is allocated. If the ioctl() based detection fails, QEMU simply tries different request sizes and picks the smallest one that succeeds. So if gluster made sure that the 4k alignment is consistently required with O_DIRECT instead of just in some cases, QEMU would already pick up the right alignment. (In reply to Niels de Vos from comment #2) > Can you > get the block-size with `stat` on both a FUSE mount point and on the XFS > filesystem that functions as the brick? statvfs does not help, since it reports the file system block size, and not the underling block device block size: This is a result from gluster fuse mount: >>> os.statvfs("/tmp/gv0/fedora29.raw") posix.statvfs_result(f_bsize=4096, f_frsize=4096, f_blocks=26201600, f_bfree=25280117, f_bavail=25280117, f_files=52428800, f_ffree=52428010, f_favail=52428010, f_flag=4096, f_namemax=255) This is a result inside the gluster node: >>> os.statvfs("/export/vdo0/brick") os.statvfs_result(f_bsize=4096, f_frsize=4096, f_blocks=26201600, f_bfree=25308562, f_bavail=25308562, f_files=52428800, f_ffree=52427979, f_favail=52427979, f_flag=4096, f_namemax=255) This is a result from my local file system (512 sector size): >>> os.statvfs(".") posix.statvfs_result(f_bsize=4096, f_frsize=4096, f_blocks=46116700, f_bfree=28182751, f_bavail=25822739, f_files=11780096, f_ffree=11277430, f_favail=11277430, f_flag=4096, f_namemax=255) I think gluster should either extend statvfs to return additional information, or use another API to report storage alignment requirements. It would be best if this works both both fuse and libgfapi. I sent this patch fixing the issue for gluster fuse mount: diff --git a/block/file-posix.c b/block/file-posix.c index 4479cc7ab4..d29b9e5229 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -1122,7 +1122,7 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp) } raw_probe_alignment(bs, s->fd, errp); - bs->bl.min_mem_alignment = s->buf_align; + bs->bl.min_mem_alignment = MAX(s->buf_align, bs->bl.request_alignment); bs->bl.opt_mem_alignment = MAX(s->buf_align, getpagesize()); } This will not help gluster:// drives, for this we need a way to probe the block size, or maybe try the same guess as used in file-posix.c? Created attachment 1600968 [details]
qemu trace log on centos
FIY, I reproduced Nir's results on ovirt-4.3 on CentOS. Attaching qemu trace logs, obtained by Nir's scripts attached to this BZ.
# Gluster nodes
$rpm -qa | grep 'glusterfs\|vdo' | sort
glusterfs-api-6.4-1.el7.x86_64
glusterfs-client-xlators-6.4-1.el7.x86_64
glusterfs-cli-6.4-1.el7.x86_64
glusterfs-fuse-6.4-1.el7.x86_64
glusterfs-libs-6.4-1.el7.x86_64
glusterfs-server-6.4-1.el7.x86_64
glusterfs-6.4-1.el7.x86_64
kmod-kvdo-6.1.1.125-5.el7.x86_64
vdo-6.1.1.125-3.el7.x86_64
# Hypervisor
$rpm -qa | grep qemu | sort
ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch
libvirt-daemon-driver-qemu-4.5.0-10.el7_6.12.x86_64
qemu-guest-agent-2.12.0-2.el7.x86_64
qemu-img-ev-2.12.0-18.el7_6.5.1.x86_64
qemu-kvm-common-ev-2.12.0-18.el7_6.5.1.x86_64
qemu-kvm-ev-2.12.0-18.el7_6.5.1.x86_64
Niels, I think we need more information on the requirements that gluster makes for O_DIRECT files. If the patch from comment 8 fixes things for Nir, the problem must be related to alignment somehow, but it's entirely unclear in what way because gluster behaves seemingly unpredictable. (Note that that patch is good for getting more debugging data, but it's nothing to actually merge in QEMU because it would involve unnecessary bounce buffers in cases where gluster isn't even involved.) I checked the kernel code to see which requirements are made there, but in the fuse path, I couldn't actually see any on the kernel side, so I suppose these errors come from the userspace implementation. Both Nir's and Vojtech's trace show that the first failing request has a first iov member that isn't aligned to 4k. However, there are more such requests recorded (including the initial probing done by QEMU itself) that have succeeded before. To pick just one example, we have this in Vojtech's trace: 1416 pwritev(13, [{"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024}, {"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2048}, {"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024}], 3, 1339392 <unfinished ...> 1416 <... pwritev resumed> ) = 4096 So 4k alignment doesn't seem to be a requirement at least in some cases. On the other hand, I saw failing requests that have a perfectly aligned size and buffer address: 1763 pread64(13, <unfinished ...> 1763 <... pread64 resumed> 0x565377b04000, 4096, 2097152) = -1 EINVAL (Invalid argument) There are even a few fdatasync() calls that return EINVAL, which must be an internal problem in gluster because QEMU doesn't even have a chance to provide any data other than the file descriptor that could be invalid. So from the outside, this looks like gluster behaving completely inconsistently, and I can't even manually figure out what the rules really are. Bug 1738657 is another issue caused by the same alignment issue, and fixed by the same patch (with a workaround). I reproduced the same issue with local xfs file system with 4k sector size. First I added a log in qemu to make it easier debug: diff --git a/block/file-posix.c b/block/file-posix.c index 4479cc7ab4..f33b542b33 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -378,6 +378,10 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) error_setg(errp, "Could not find working O_DIRECT alignment"); error_append_hint(errp, "Try cache.direct=off\n"); } + + /* TODO: add trace event */ + printf("Probed aligment for %s buf_align=%ld request_alignment=%d\n", + bs->filename, s->buf_align, bs->bl.request_alignment); } The file system was created like this: # truncate -s 50g /var/tmp/backing # losetup -f /var/tmp/backing --show --sector-size=4096 # mkfs -t xfs /dev/loop0 # mkdir /tmp/loop0 # mount /dev/loop0 /tmp/loop0 # chown nsoffer:nsoffer /tmp/loop0 Then I created new sparse raw file: $ qemu-img create -f raw /tmp/loop0/fedora29.raw 6g Formatting '/tmp/loop0/fedora29.raw', fmt=raw size=6442450944 And started qemu with this image and fedora 29 iso: $ strace -f -tt -o /tmp/qemu-xfs-4k.trace x86_64-softmmu/qemu-system-x86_64 -accel kvm -m 2048 -smp 2 -drive file=/tmp/loop0/fedora29.raw,format=raw,cache=none -cdrom /home/nsoffer/var/libvirt/iso/Fedora-Server-dvd-x86_64-29-1.2.iso Probed aligment for /tmp/loop0/fedora29.raw buf_align=512 request_alignment=4096 Probed aligment for /tmp/loop0/fedora29.raw buf_align=512 request_alignment=4096 We can see that buf_align was not detected correctly - this is probably because the file contains no data, so xfs optimized the request without doing any I/O or checking the buffer alignment. In the guest I started standard installation. It failed to create the root filesystem after few seconds. I could not extract the error message from the installed, copy and paste from the guest did not work (bad configuration?) In the trace we can see how qemu try to read a request aligned to 512 bytes: 23534 20:44:53.194905 preadv(24, <unfinished ...> 23163 20:44:53.194911 <... ioctl resumed>, 0) = 0 23534 20:44:53.194923 <... preadv resumed>[{iov_base=0x7ff724015000, iov_len=512}, {iov_base=0x7ff7241ea000, iov_len=3584}], 2, 0) = -1 EINVAL (Invalid argument) To make the issue with buf_align more clear, I added also this patch: diff --git a/block/file-posix.c b/block/file-posix.c index f33b542b33..e93fd13a9a 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -323,6 +323,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) BDRVRawState *s = bs->opaque; char *buf; size_t max_align = MAX(MAX_BLOCKSIZE, getpagesize()); + size_t alignments[] = {1, 512, 1024, 2048, 4096}; /* For SCSI generic devices the alignment is not really used. With buffered I/O, we don't have any restrictions. */ @@ -351,9 +352,11 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) /* If we could not get the sizes so far, we can only guess them */ if (!s->buf_align) { + int i; size_t align; buf = qemu_memalign(max_align, 2 * max_align); - for (align = 512; align <= max_align; align <<= 1) { + for (i = 0; i < ARRAY_SIZE(alignments); i++) { + align = alignments[i]; if (raw_is_io_aligned(fd, buf + align, max_align)) { s->buf_align = align; break; @@ -363,9 +366,11 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) } if (!bs->bl.request_alignment) { + int i; size_t align; buf = qemu_memalign(s->buf_align, max_align); - for (align = 512; align <= max_align; align <<= 1) { + for (i = 0; i < ARRAY_SIZE(alignments); i++) { + align = alignments[i]; if (raw_is_io_aligned(fd, buf, align)) { bs->bl.request_alignment = align; break; With this we see: xfs on 4k loop device: $ ./qemu-img convert -f raw -O raw -t none -T none /tmp/loop0/src.raw /tmp/loop0/dst.raw Probed aligment for /tmp/loop0/src.raw buf_align=1 request_alignment=4096 Probed aligment for /tmp/loop0/dst.raw buf_align=1 request_alignment=4096 request_alignment is correct since we use xfs_ctl(). Gluster on xfs on 4k vdo device: $ ./qemu-img convert -f raw -O raw -t none -T none /tmp/gv0/src.raw /tmp/gv0/dst.raw Probed aligment for /tmp/gv0/src.raw buf_align=1 request_alignment=1 Probed aligment for /tmp/gv0/dst.raw buf_align=1 request_alignment=1 request_alignment is incorrect since ready from unallocated area does not do any I/O on xfs. And when we start try to provision a VM we get the same error: $ qemu-img create -f raw /tmp/loop0/fedora29.raw 6g Formatting '/tmp/loop0/fedora29.raw', fmt=raw size=6442450944 $ strace -f -tt -o /tmp/qemu-xfs-4k-align-1.trace x86_64-softmmu/qemu-system-x86_64 -accel kvm -m 2048 -smp 2 -drive file=/tmp/loop0/fedora29.raw,format=raw,cache=none -cdrom /home/nsoffer/var/libvirt/iso/Fedora-Server-dvd-x86_64-29-1.2.iso Probed aligment for /tmp/loop0/fedora29.raw buf_align=1 request_alignment=4096 Probed aligment for /tmp/loop0/fedora29.raw buf_align=1 request_alignment=4096 Started installation in guest here, stuck after few seconds. In the trace we see the similar failure: 28947 21:29:05.771029 preadv(24, [{iov_base=0x7fb26e2bba00, iov_len=512}, {iov_base=0x7fb2d4559000, iov_len=3584}], 2, 1721761792) = -1 EINVAL (Invalid argument) This shows that this issue is not an issue with gluster, but with qemu. Checking alignment requirement using buf_align is not effective. It used to work with storage with sector size of 512 bytes just because this was the first value checked. I tested again with the patch using max(buf_align, request_alignment) to set min_mem_alignment: diff --git a/block/file-posix.c b/block/file-posix.c index e93fd13a9a..08dd330fd6 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -1131,7 +1131,7 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp) } raw_probe_alignment(bs, s->fd, errp); - bs->bl.min_mem_alignment = s->buf_align; + bs->bl.min_mem_alignment = MAX(s->buf_align, bs->bl.request_alignment); bs->bl.opt_mem_alignment = MAX(s->buf_align, getpagesize()); } $ qemu-img create -f raw /tmp/loop0/fedora29.raw 6g Formatting '/tmp/loop0/fedora29.raw', fmt=raw size=6442450944 $ strace -f -tt -o /tmp/qemu-xfs-4k-align-1-patched.trace x86_64-softmmu/qemu-system-x86_64 -accel kvm -m 2048 -smp 2 -drive file=/tmp/loop0/fedora29.raw,format=raw,cache=none -cdrom /home/nsoffer/var/libvirt/iso/Fedora-Server-dvd-x86_64-29-1.2.iso Probed aligment for /tmp/loop0/fedora29.raw buf_align=1 request_alignment=4096 Probed aligment for /tmp/loop0/fedora29.raw buf_align=1 request_alignment=4096 This time the VM was provisioned successfully. Created attachment 1601918 [details]
Trace for failed installation on local xfs on 4k loop device
Created attachment 1601919 [details]
Trace for failed installation on local xfs on 4k loop device with buf_align=1
This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle. Changing version to '31'. This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle. Changing version to 31. Patch posted for review: https://lists.nongnu.org/archive/html/qemu-block/2019-08/msg00568.html Fixed upstream in: https://github.com/qemu/qemu/commit/a6b257a08e3d72219f03e461a52152672fec0612 Cole, do you know which qemu version will provide this fix in virt-preview repo? Same for bug 1738657. (In reply to Nir Soffer from comment #19) > Cole, do you know which qemu version will provide this fix in virt-preview > repo? > > Same for bug 1738657. I will backport the patches to rawhide today and kick off virt-preview builds to match FEDORA-2019-6a014367d5 has been submitted as an update to Fedora 31. https://bodhi.fedoraproject.org/updates/FEDORA-2019-6a014367d5 qemu-4.1.0-2.fc31 has been pushed to the Fedora 31 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-6a014367d5 qemu-4.1.0-2.fc31 has been pushed to the Fedora 31 stable repository. If problems still persist, please make note of it in this bug report. |