Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Description of problem:
Converting a raw image from gluster storage with 4k sector size to same storage
fails:
$ strace -f -tt -o /tmp/convert.trace qemu-img convert -f raw -O raw -t none -T none fedora29.raw fedora29-clone.raw
qemu-img: error while reading sector 4190208: Invalid argument
For info on how to create the gluster storage, see bug 1737256.
Version-Release number of selected component (if applicable):
qemu-img-4.1.0-0.1.rc2.fc29.x86_64 (from virt-preview repo)
How reproducible:
Always - depending on the image.
Steps to Reproduce:
1. Provision a VM on gluster storage with 4k sector size
(using https://lists.nongnu.org/archive/html/qemu-block/2019-08/msg00133.html)
2. Try to clone the image using qemu-img
Looking at the trace show:
1. Open the source image:
18703 20:14:58.755169 openat(AT_FDCWD, "fedora29.raw", O_RDONLY|O_DIRECT|O_CLOEXEC) = 9
18703 20:14:58.755292 fstat(9, {st_mode=S_IFREG|0644, st_size=6442450944, ...}) = 0
18703 20:14:58.755337 lseek(9, 0, SEEK_END) = 6442450944
2. Try to detect buffer alignment by reading from buf + 512
18703 20:14:58.755503 pread64(9, "\353c\220\20\216\320\274\0\260\270\0\0\216\330\216\300\373\276\0|\277\0\6\271\0\2\363\244\352!\6\0"..., 4096, 0) = 4096
This succeeds, since on Gluster there is no alignment requirement for buffers.
The buffer used to do direct I/O is allocated on the Gluster node.
But the actual block size on this storage is 4096 bytes.
This is basically same issue as in bug 1737256.
3. Trying to detect request alignment by reading 512, 1024, 2048, 4096 bytes
into aligned buffer:
18703 20:14:58.756835 pread64(9, 0x5562768ef200, 512, 0) = -1 EINVAL (Invalid argument)
18703 20:14:58.758469 pread64(9, 0x5562768ef200, 1024, 0) = -1 EINVAL (Invalid argument)
18703 20:14:58.760132 pread64(9, 0x5562768ef200, 2048, 0) = -1 EINVAL (Invalid argument)
18703 20:14:58.761631 pread64(9, <unfinished ...>
18704 20:14:58.762213 <... nanosleep resumed>0x7fdde6e20a70) = 0
18704 20:14:58.762249 futex(0x5562752ea508, FUTEX_WAIT, 4294967295, NULL <unfinished ...>
18703 20:14:58.762530 <... pread64 resumed>"\353c\220\20\216\320\274\0\260\270\0\0\216\330\216\300\373\276\0|\277\0\6\271\0\2\363\244\352!\6\0"..., 4096, 0) = 4096
As expected, request alignment of 4096 bytes was detected.
4. Open target image (creating it)
18703 20:14:58.778105 openat(AT_FDCWD, "fedora29-clone.raw", O_RDWR|O_CREAT|O_CLOEXEC, 0644) = 10
5. Truncate target image to 0 and then to virtual size
18706 20:14:58.885829 ftruncate(10, 0) = 0
...
18706 20:14:58.887700 ftruncate(10, 6442450944) = 0
6. Try to detect block alignment on target image
18703 20:14:58.933049 pread64(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 0) = 4096
This wrongly detect buffer alignment of 512 bytes like step 2.
7. Try to detect request alignment
18703 20:14:58.936354 pread64(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512, 0) = 512
This wrongly detect 512 because when using xfs (used on gluster backed),
reading from unallocated file succceeds even with unaligned request
length (xfs probably optimize read when block is not allocated).
So now we have:
source:
buf_align = 512
request_alignment = 4096
min_mem_alignment = 512 <<< uses buf_align
target:
buf_align = 512
request_alignment = 512
min_mem_alignment = 512
We can argue that buf_align = 512 is correct, since Gluster does not require
any alignment - but qemu uses buf_align to checking iov_len
8. qemu fails to read 512 bytes from source image
18709 20:15:09.670293 preadv(9, [{iov_base=0x7fdde5013000, iov_len=2096640}, {iov_base=0x556276902000, iov_len=512}], 2, ) = -1 EINVAL (Invalid argument)
This is expected, since the underlying storage has sector size of 4096.
I tried the same flow using qemu-img with patch:
https://lists.nongnu.org/archive/html/qemu-block/2019-08/msg00133.html
$ strace -f -tt -o /tmp/convert-patched.trace /home/nsoffer/src/qemu/build/qemu-img convert -f raw -O raw -t none -T none fedora29.raw fedora29-clone.raw
qemu-img: error while writing sector 4194303: Invalid argument
This fixes the read failure, but fail to write to the target.
Looking at the new trace show:
1. Same flow for opening the source image and detecting block size
22894 20:49:00.487962 openat(AT_FDCWD, "fedora29.raw", O_RDONLY|O_DIRECT|O_CLOEXEC) = 9
...
22894 20:49:00.488662 pread64(9, "\353c\220\20\216\320\274\0\260\270\0\0\216\330\216\300\373\276\0|\277\0\6\271\0\2\363\244\352!\6\0"..., 4096, 0) = 4096
22894 20:49:00.489776 pread64(9, 0x5615d6044400, 512, 0) = -1 EINVAL (Invalid argument)
22894 20:49:00.491139 pread64(9, 0x5615d6044400, 1024, 0) = -1 EINVAL (Invalid argument)
22894 20:49:00.492392 pread64(9, 0x5615d6044400, 2048, 0) = -1 EINVAL (Invalid argument)
22894 20:49:00.493647 pread64(9, "\353c\220\20\216\320\274\0\260\270\0\0\216\330\216\300\373\276\0|\277\0\6\271\0\2\363\244\352!\6\0"..., 4096, 0) = 4096
2. Same flow for opening the target and detecting block size
22894 20:49:00.506364 openat(AT_FDCWD, "fedora29-clone.raw", O_RDWR|O_CREAT|O_CLOEXEC, 0644) = 10
...
22898 20:49:00.518228 ftruncate(10, 0) = 0
...
22898 20:49:00.518545 ftruncate(10, 6442450944) = 0
...
22894 20:49:00.529136 pread64(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 0) = 4096
22894 20:49:00.529664 pread64(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512, 0) = 512
source:
buf_align = 512
request_alignment = 4096
min_mem_alignment = 4096 <<< uses max(buf_align, request_alignment)
target:
buf_align = 512
request_alignment = 512
min_mem_alignment = 512
3. The read we failed before succeeds now
22912 20:49:11.086632 pread64(9, "6\317M5i?\332Qm\231\342\253\205+\303@ht\335\341\353\300\331\201\314\nb]3R\262\377"..., 2097152, 2145386496) = 2097152
4. Writing to target fails with EINVAL
22911 20:49:11.327101 pwrite64(10, "\246\243\366\4)U(\360\334^\345\277'\313\261\\\211a\302!P\360\300\263\222\2303\365\201\377\223/"..., 2097152, 2147483136 <unfinished ...>
22911 20:49:11.368238 <... pwrite64 resumed>) = -1 EINVAL (Invalid argument)
The write fails because 2147483136 is not aligned the block size:
>>> 2147483136 % 4096
3584
As a workaround, I tried to fix target image block size detection by using
existing file with the fist block allocated.
(management system can easily ensure this)
$ rm fedora29-clone.raw
$ truncate -s 6g fedora29-clone.raw
$ dd if=/dev/zero bs=4096 count=1 of=fedora29-clone.raw conv=notrunc
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.016563 s, 247 kB/s
$ ls -lhs fedora29-clone.raw
4.0K -rw-rw-r--. 1 nsoffer nsoffer 6.0G Aug 7 21:05 fedora29-clone.raw
And now the command succeeded:
$ strace -f -tt -o /tmp/convert-patched-nocreate.trace /home/nsoffer/src/qemu/build/qemu-img convert -n -f raw -O raw -t none -T none fedora29.raw fedora29-clone.raw
Looking in the trace we can see that now that request alignment is detected correctly:
25185 21:06:42.432081 openat(AT_FDCWD, "fedora29-clone.raw", O_RDWR|O_DIRECT|O_CLOEXEC) = 10
...
25185 21:06:42.432841 pread64(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 0) = 4096
25185 21:06:42.434180 pread64(10, 0x555901256a00, 512, 0) = -1 EINVAL (Invalid argument)
25185 21:06:42.435651 pread64(10, 0x555901256a00, 1024, 0) = -1 EINVAL (Invalid argument)
25185 21:06:42.437179 pread64(10, 0x555901256a00, 2048, 0) = -1 EINVAL (Invalid argument)
25185 21:06:42.438611 pread64(10, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 0) = 4096
So we have 2 issues:
- Incorrect use of buf_align to verify that preadv() arguments are aligned
(same as bug 1737256)
- Incorrect detection of target request alignment with unallocated file.
The second issue can be fixed by using existing file with -n argument.
I think this is good enough workaround for management system, but users
will fail in this trap.
Since qemu control the target file, it can allocate one block of zeros
before detecting the block size.
I did not try gluster:// since it failed for bug 1737256, and libgfapi is
not high priority for oVirt now.