Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1737256 - Provisioning VM on 4k gluster storage fails with "Invalid argument" - qemu fail to detect block size
Summary: Provisioning VM on 4k gluster storage fails with "Invalid argument" - qemu fa...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: qemu
Version: 31
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Fedora Virtualization Maintainers
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: 1592916
TreeView+ depends on / blocked
 
Reported: 2019-08-04 17:54 UTC by Nir Soffer
Modified: 2019-09-28 00:01 UTC (History)
13 users (show)

Fixed In Version: qemu-4.1.0-2.fc31
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-09-28 00:01:43 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
tests scripts, traces, and output of guest commands (1.47 MB, application/x-xz)
2019-08-04 17:54 UTC, Nir Soffer
no flags Details
tests script, output and trace with allocated image (3.68 MB, application/x-xz)
2019-08-05 23:42 UTC, Nir Soffer
no flags Details
qemu trace log on centos (17.07 MB, application/x-xz)
2019-08-06 11:44 UTC, Vojtech Juranek
no flags Details
Trace for failed installation on local xfs on 4k loop device (10.71 MB, application/x-xz)
2019-08-08 19:48 UTC, Nir Soffer
no flags Details
Trace for failed installation on local xfs on 4k loop device with buf_align=1 (8.51 MB, application/x-xz)
2019-08-08 19:50 UTC, Nir Soffer
no flags Details

Description Nir Soffer 2019-08-04 17:54:21 UTC
Created attachment 1600406 [details]
tests scripts, traces, and output of guest commands

Description of problem:

When running qeme with drive on gluster storage with block size of 4096 bytes,
the guest see a device with block size of 512 bytes. This causes mkfs to 
fail when trying to create file system on the device.

Version-Release number of selected component (if applicable):
qemu-kvm-4.1.0-0.1.rc2.fc29.x86_64 (from virt-preview repo)

How reproducible:
Always.

Steps to Reproduce:
1. Create raw image on gluster storage with block size of 4096 bytes
2. Start qemu with the image and boot from ISO image
3. Try to install

Actual results:
- Checking block size in guest we see block size of 512 bytes
- mkfs fail to create file system with EINVAL

Expected results:
- Checking block size in guest we see block size of 4096 bytes
- Installation successful

Here are more details on the tested environment.


## Gluster nodes

The gluster nodes are Fedora 29 server with oVirt repos.

See these scripts if you want to create the same setup:
- https://raw.githubusercontent.com/oVirt/vdsm/master/contrib/deploy-gluster.sh
- https://raw.githubusercontent.com/oVirt/vdsm/master/contrib/create-gluster-volume.sh

# rpm -qa | grep 'glusterfs\|vdo' | sort
glusterfs-6.4-1.fc29.x86_64
glusterfs-api-6.4-1.fc29.x86_64
glusterfs-cli-6.4-1.fc29.x86_64
glusterfs-client-xlators-6.4-1.fc29.x86_64
glusterfs-fuse-6.4-1.fc29.x86_64
glusterfs-libs-6.4-1.fc29.x86_64
glusterfs-server-6.4-1.fc29.x86_64
kmod-kvdo-6.2.1.102-1.fc29.x86_64
vdo-6.2.1.102-1.fc29.x86_64


## vdo volumes tested

VDO status:
  Date: '2019-08-04 20:01:25+03:00'
  Node: voodoo4.tlv.redhat.com
Kernel module:
  Loaded: true
  Name: kvdo
  Version information:
    kvdo version: 6.2.1.102
Configuration:
  File: /etc/vdoconf.yml
  Last modified: '2019-08-04 05:46:04'
VDOs:
  vdo0:
    Acknowledgement threads: 1
    Activate: enabled
    Bio rotation interval: 64
    Bio submission threads: 4
    Block map cache size: 128M
    Block map period: 16380
    Block size: 4096
    CPU-work threads: 2
    Compression: enabled
    Configured write policy: auto
    Deduplication: enabled
    Device mapper status: 0 209715200 vdo /dev/vdb normal - online online 2163742 5242880
    Emulate 512 byte: disabled
    Hash zone threads: 1
    Index checkpoint frequency: 0
    Index memory setting: 0.25
    Index parallel factor: 0
    Index sparse: disabled
    Index status: online
    Logical size: 100G
    Logical threads: 1
    Max discard size: 4K
    Physical size: 20G
    Physical threads: 1
    Slab size: 2G
    Storage device: /dev/disk/by-id/virtio-gv0
    ...
  vdo1:
    Acknowledgement threads: 1
    Activate: enabled
    Bio rotation interval: 64
    Bio submission threads: 4
    Block map cache size: 128M
    Block map period: 16380
    Block size: 4096
    CPU-work threads: 2
    Compression: enabled
    Configured write policy: auto
    Deduplication: enabled
    Device mapper status: 0 209715200 vdo /dev/vdc normal - online online 1716855 5242880
    Emulate 512 byte: disabled
    Hash zone threads: 1
    Index checkpoint frequency: 0
    Index memory setting: 0.25
    Index parallel factor: 0
    Index sparse: disabled
    Index status: online
    Logical size: 100G
    Logical threads: 1
    Max discard size: 4K
    Physical size: 20G
    Physical threads: 1
    Slab size: 2G
    Storage device: /dev/disk/by-id/virtio-gv1
    ...


## bricks info

# xfs_info /export/vdo0
meta-data=/dev/mapper/vdo0       isize=512    agcount=4, agsize=6553600 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=26214400, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=12800, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

# xfs_info /export/vdo1
meta-data=/dev/mapper/vdo1       isize=512    agcount=4, agsize=6553600 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=26214400, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=12800, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


## gluster volumes tested
 
2 volumes tested:
gv0 - using "features.shard on" - this disable support for SEEK_DATA/SEEK_HOLE
gv1 - using "features.shard off"

Volume Name: gv0
Type: Replicate
Volume ID: cbc5a2ad-7246-42fc-a78f-70175fb7bf22
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: voodoo4.tlv.redhat.com:/export/vdo0/brick
Brick2: voodoo5.tlv.redhat.com:/export/vdo0/brick
Brick3: voodoo8.tlv.redhat.com:/export/vdo0/brick (arbiter)
Options Reconfigured:
performance.client-io-threads: on
nfs.disable: on
transport.address-family: inet
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.low-prio-threads: 32
network.remote-dio: disable
cluster.eager-lock: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 10000
features.shard: on
user.cifs: off
cluster.choose-local: off
client.event-threads: 4
server.event-threads: 4
storage.owner-uid: 36
storage.owner-gid: 36
 
Volume Name: gv1
Type: Replicate
Volume ID: cd0c0b65-d2ee-44f6-9e8c-24cfe518f44d
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: voodoo4.tlv.redhat.com:/export/vdo1/brick
Brick2: voodoo5.tlv.redhat.com:/export/vdo1/brick
Brick3: voodoo8.tlv.redhat.com:/export/vdo1/brick (arbiter)
Options Reconfigured:
storage.owner-gid: 36
storage.owner-uid: 36
server.event-threads: 4
client.event-threads: 4
cluster.choose-local: off
user.cifs: off
features.shard: off
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: disable
performance.low-prio-threads: 32
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: on


## Hypervisor

The hypervisor is oVirt 4.4. installed on Fedora 29.

$ rpm -qa | grep qemu | sort
ipxe-roms-qemu-20190125-2.git36a4c85f.fc29.noarch
libvirt-daemon-driver-qemu-5.5.0-1.fc29.x86_64
qemu-audio-alsa-4.1.0-0.1.rc2.fc29.x86_64
qemu-audio-oss-4.1.0-0.1.rc2.fc29.x86_64
qemu-audio-pa-4.1.0-0.1.rc2.fc29.x86_64
qemu-audio-sdl-4.1.0-0.1.rc2.fc29.x86_64
qemu-block-curl-4.1.0-0.1.rc2.fc29.x86_64
qemu-block-dmg-4.1.0-0.1.rc2.fc29.x86_64
qemu-block-gluster-4.1.0-0.1.rc2.fc29.x86_64
qemu-block-iscsi-4.1.0-0.1.rc2.fc29.x86_64
qemu-block-nfs-4.1.0-0.1.rc2.fc29.x86_64
qemu-block-rbd-4.1.0-0.1.rc2.fc29.x86_64
qemu-block-ssh-4.1.0-0.1.rc2.fc29.x86_64
qemu-common-4.1.0-0.1.rc2.fc29.x86_64
qemu-img-4.1.0-0.1.rc2.fc29.x86_64
qemu-kvm-4.1.0-0.1.rc2.fc29.x86_64
qemu-system-x86-4.1.0-0.1.rc2.fc29.x86_64
qemu-system-x86-core-4.1.0-0.1.rc2.fc29.x86_64
qemu-ui-curses-4.1.0-0.1.rc2.fc29.x86_64
qemu-ui-gtk-4.1.0-0.1.rc2.fc29.x86_64
qemu-ui-sdl-4.1.0-0.1.rc2.fc29.x86_64


The gluster storage was accessed both with fuse and using gluser:// url.

Here are the fuse mounts:

$ mount | grep glusterfs
voodoo4.tlv.redhat.com:/gv0 on /rhev/data-center/mnt/glusterSD/voodoo4.tlv.redhat.com:_gv0 type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)
voodoo4.tlv.redhat.com:/gv1 on /rhev/data-center/mnt/glusterSD/voodoo4.tlv.redhat.com:_gv1 type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)


## Typical failure flow

name=qemu-fuse-empty.trace
images=/rhev/data-center/mnt/glusterSD/voodoo4.tlv.redhat.com\:_gv0/de566475-5b67-4987-abf3-3dc98083b44c/images
disk=4163cc03-c5ef-4956-ac0a-6e04699ce2e5/51de3a28-5c0a-4db1-a51c-b09c437b6f39
cdrom=3119fe7e-f576-46c3-880b-9590cb4619da/46efd27f-3bbe-4183-91f7-912c163b8eac

truncate -s 0 $images/$disk
truncate -s 1g $images/$disk

strace -f -o $name qemu-kvm \
	-drive file=$images/$disk,format=raw,cache=none,if=virtio \
	-cdrom $images/$cdrom \
	-m 1024 \
	-nographic

..
localhost:~# grep -s "" /sys/block/vda/queue/*
/sys/block/vda/queue/add_random:0
/sys/block/vda/queue/chunk_sectors:0
/sys/block/vda/queue/dax:0
/sys/block/vda/queue/discard_granularity:0
/sys/block/vda/queue/discard_max_bytes:0
/sys/block/vda/queue/discard_max_hw_bytes:0
/sys/block/vda/queue/discard_zeroes_data:0
/sys/block/vda/queue/fua:0
/sys/block/vda/queue/hw_sector_size:512
/sys/block/vda/queue/io_poll:1
/sys/block/vda/queue/io_poll_delay:-1
/sys/block/vda/queue/iostats:1
/sys/block/vda/queue/logical_block_size:512
/sys/block/vda/queue/max_discard_segments:1
/sys/block/vda/queue/max_hw_sectors_kb:2147483647
/sys/block/vda/queue/max_integrity_segments:0
/sys/block/vda/queue/max_sectors_kb:1280
/sys/block/vda/queue/max_segment_size:4294967295
/sys/block/vda/queue/max_segments:126
/sys/block/vda/queue/minimum_io_size:512
/sys/block/vda/queue/nomerges:0
/sys/block/vda/queue/nr_requests:256
/sys/block/vda/queue/optimal_io_size:0
/sys/block/vda/queue/physical_block_size:512
/sys/block/vda/queue/read_ahead_kb:128
/sys/block/vda/queue/rotational:1
/sys/block/vda/queue/rq_affinity:1
/sys/block/vda/queue/scheduler:[mq-deadline] kyber none
/sys/block/vda/queue/write_cache:write back
/sys/block/vda/queue/write_same_max_bytes:0
/sys/block/vda/queue/write_zeroes_max_bytes:0
/sys/block/vda/queue/zoned:none

localhost:~# setup-disk
...
Creating file systems...
ext2fs_write_inode_full: I/O error while writing reserved inodes
mount: mounting /dev/vda3 on /mnt failed: Invalid argument


In the trace we see that qemu fails to detect the block size of the
drive:

15454 openat(AT_FDCWD, "/rhev/data-center/mnt/glusterSD/voodoo4.tlv.redhat.com:_gv0/de566475-5b67-4987-abf3-3dc98083b44c/images/4163cc03-c5ef-4956-ac0a-6e04699ce2e5/51de3a28-5c0a-4db1-a51c-b09c437b6f39", O_RDONLY|O_DIRECT|O_CLOEXEC) = 9
15454 fstat(9, {st_mode=S_IFREG|0660, st_size=1073741824, ...}) = 0
...
15454 ioctl(9, BLKSSZGET, 0x7fff3ad7f984) = -1 ENOSYS (Function not implemented)
...
15454 pread64(9, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512, 0) = 512


When reading from sparse file on gluster storage backed by xfs (and same when
reading from local xfs), read from file open with direct I/O succeeds with
any block size:

$ truncate -s 4096 file

$ dd if=file bs=1 count=1 iflag=direct of=/dev/null
1+0 records in
1+0 records out 
1 byte copied, 0.000842153 s, 1.2 kB/s

And qemu detect block size by reading from the file with various block sizes:
https://github.com/qemu/qemu/blob/02ac2f7f613b47f6a5b397b20ab0e6b2e7fb89fa/block/file-posix.c#L352
https://github.com/qemu/qemu/blob/02ac2f7f613b47f6a5b397b20ab0e6b2e7fb89fa/block/file-posix.c#L300


When reading from allocated file, the read succeeds only when the block
size matches the storage allocation requirements:

$ echo -n "x" > file
$ truncate -s 4096 file                                                                                                                                                    
$ for n in 1 512 4096; do dd if=file bs=$n count=1 iflag=direct of=/dev/null; done

dd: error reading 'file': Invalid argument
0+0 records in
0+0 records out 
0 bytes copied, 0.00226002 s, 0.0 kB/s
dd: error reading 'file': Invalid argument
0+0 records in
0+0 records out 
0 bytes copied, 0.00162306 s, 0.0 kB/s
1+0 records in
1+0 records out 
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000811755 s, 5.0 MB/s


I checked if using allocating some data in the drive fixes the issue:

dd if=/dev/zero bs=4096 count=1 of=$images/$disk
truncate -s 1g $images/$disk

strace -f -o $name qemu-kvm \
    -drive file=$images/$disk,format=raw,cache=none,if=virtio \
    -cdrom $images/$cdrom \
    -m 1024 \
    -nographic
...
localhost:~# grep -s "" /sys/block/vda/queue/*
/sys/block/vda/queue/add_random:0
/sys/block/vda/queue/chunk_sectors:0
/sys/block/vda/queue/dax:0
/sys/block/vda/queue/discard_granularity:0
/sys/block/vda/queue/discard_max_bytes:0
/sys/block/vda/queue/discard_max_hw_bytes:0
/sys/block/vda/queue/discard_zeroes_data:0
/sys/block/vda/queue/fua:0
/sys/block/vda/queue/hw_sector_size:512
/sys/block/vda/queue/io_poll:1
/sys/block/vda/queue/io_poll_delay:-1
/sys/block/vda/queue/iostats:1
/sys/block/vda/queue/logical_block_size:512
/sys/block/vda/queue/max_discard_segments:1
/sys/block/vda/queue/max_hw_sectors_kb:2147483647
/sys/block/vda/queue/max_integrity_segments:0
/sys/block/vda/queue/max_sectors_kb:1280
/sys/block/vda/queue/max_segment_size:4294967295
/sys/block/vda/queue/max_segments:126
/sys/block/vda/queue/minimum_io_size:512
/sys/block/vda/queue/nomerges:0
/sys/block/vda/queue/nr_requests:256
/sys/block/vda/queue/optimal_io_size:0
/sys/block/vda/queue/physical_block_size:512
/sys/block/vda/queue/read_ahead_kb:128
/sys/block/vda/queue/rotational:1
/sys/block/vda/queue/rq_affinity:1
/sys/block/vda/queue/scheduler:[mq-deadline] kyber none
/sys/block/vda/queue/write_cache:write back
/sys/block/vda/queue/write_same_max_bytes:0
/sys/block/vda/queue/write_zeroes_max_bytes:0
/sys/block/vda/queue/zoned:none


Looking at the trace, qemu did detect the block size:

17620 openat(AT_FDCWD, "/rhev/data-center/mnt/glusterSD/voodoo4.tlv.redhat.com:_gv0/de566475-5b67-4987-abf3-3dc98083b44c/images/4163cc03-c5ef-4956-ac0a-6e04699ce2e5/51de3a28-5c0a-4db1-a51c-b09c437b6f39", O_RDONLY|O_DIRECT|O_CLOEXEC) = 9
...
17620 ioctl(9, BLKSSZGET, 0x7ffccc686da4) = -1 ENOSYS (Function not implemented)
...
17620 pread64(9, 0x560c60274400, 512, 0) = -1 EINVAL (Invalid argument)
17620 pread64(9, 0x560c60274400, 1024, 0) = -1 EINVAL (Invalid argument)
17620 pread64(9, 0x560c60274400, 2048, 0) = -1 EINVAL (Invalid argument)
17620 pread64(9, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 0) = 4096

But the guest still see the wrong block size.


I tested the same flow using gluster:// url with same results.

The attachment contains test scripts, traces, and output of commands
in the guests.


So we have 2 issues:
- Block size is not detect correctly when provisioning sparse image.
- Even when block size is detect, the guest still see the wrong block size


I tried to work around this issue by using

    <blockio logical_block_size="4096" physical_block_size="4096">

in libvirt xml. With this installation succeeds, but the disk is not
bootable. Will be tracked in another bug.

Comment 1 Nir Soffer 2019-08-04 18:02:52 UTC
Niels, do we have a way to detect the logical block size when using libgfapi?

Comment 2 Niels de Vos 2019-08-05 17:45:41 UTC
In QEMU, block/gluster.c does not implement BlockDriver->bdrv_probe_blocksizes(). I think it should be possible by using results from fstat()/glfs_fstat() or statvfs()/glfs_statvfs().

Those functions would return the same values as the matching syscalls over FUSE gluster:// URLs. It is not clear to me why that would fail... Can you get the block-size with `stat` on both a FUSE mount point and on the XFS filesystem that functions as the brick?

Comment 3 Nir Soffer 2019-08-05 23:37:35 UTC
I tested again the case when the first block of the device is allocated - in this
case qemu detect the block size correctly. Previously I aborted this test when I
found that the guest still see block size of 512 bytes.

I ran this script:

name=/tmp/qemu-fuse-allocated.trace
images=/rhev/data-center/mnt/glusterSD/voodoo4.tlv.redhat.com\:_gv0/de566475-5b67-4987-abf3-3dc98083b44c/images
disk=4163cc03-c5ef-4956-ac0a-6e04699ce2e5/51de3a28-5c0a-4db1-a51c-b09c437b6f39
cdrom=3119fe7e-f576-46c3-880b-9590cb4619da/46efd27f-3bbe-4183-91f7-912c163b8eac

dd if=/dev/zero bs=4096 count=1 of=$images/$disk conv=fsync
truncate -s 1g $images/$disk

strace -f -tt -o $name qemu-kvm \
        -drive file=$images/$disk,format=raw,cache=none,if=virtio \
        -cdrom $images/$cdrom \
        -netdev tap,id=net1,ifname=tap0,script=no,downscript=no \
        -device virtio-net-pci,netdev=net1,mac=56:6f:fe:64:00:01 \
        -m 1024 \
        -nographic

This the interesting output from the guest:

localhost:~# setup-alpine
...
Available disks are:
  vda   (1.1 GB 0x1af4 )
Which disk(s) would you like to use? (or '?' for help or 'none') [none] vda
The following disk is selected:
  vda   (1.1 GB 0x1af4 )
How would you like to use it? ('sys', 'data', 'lvm' or '?' for help) [?] sys
WARNING: The following disk(s) will be erased:
  vda   (1.1 GB 0x1af4 )
WARNING: Erase the above disk(s) and continue? [y/N]: y
Creating file systems...
Installing system on /dev/vda3:
/mnt/boot is device /dev/vda1
100% ████████████████████████████████████████████==> initramfs: creating /boot/initramfs-virt
gzip: write error: Read-only file system
cpio: write error: Broken pipe
rm: can't remove '/tmp/mkinitfs.bPjPpm/media/cdrom': Read-only file system
rm: can't remove '/tmp/mkinitfs.bPjPpm/media/floppy': Read-only file system
rm: can't remove '/tmp/mkinitfs.bPjPpm/media/usb': Read-only file system
...

Looking at the trace:

localhost:~# dmesg
...
[  231.776768]  vda: vda1 vda2 vda3
[  232.229337] print_req_error: I/O error, dev vda, sector 100354
[  232.229460] print_req_error: I/O error, dev vda, sector 101362
[  232.229603] print_req_error: I/O error, dev vda, sector 102370
[  232.229697] print_req_error: I/O error, dev vda, sector 102402
[  232.229791] print_req_error: I/O error, dev vda, sector 103410
[  232.229883] print_req_error: I/O error, dev vda, sector 104418
[  232.229976] print_req_error: I/O error, dev vda, sector 104450
[  232.230086] print_req_error: I/O error, dev vda, sector 105458
[  232.230180] print_req_error: I/O error, dev vda, sector 106466
[  232.230272] print_req_error: I/O error, dev vda, sector 106498
[  233.832220] Adding 262140k swap on /dev/vda2.  Priority:-2 extents:1 across:262140k 
[  235.754207] EXT4-fs (vda3): mounted filesystem with ordered data mode. Opts: (null)
[  235.988453] EXT4-fs (vda1): mounted filesystem with ordered data mode. Opts: (null)
[  238.602871] print_req_error: 2 callbacks suppressed
[  238.602909] print_req_error: I/O error, dev vda, sector 0
[  257.568921] print_req_error: I/O error, dev vda, sector 43684
[  257.568929] EXT4-fs warning (device vda1): ext4_end_bio:323: I/O error 10 writing to inode 15 (offset 0 size 2097152 starting block 22866)


Looking at the trace:

Thread 18708 opened the drive for read, and detected the block size:

18708 00:45:42.117603 openat(AT_FDCWD, "/rhev/data-center/mnt/glusterSD/voodoo4.tlv.redhat.com:_gv0/de566475-5b67-4987-abf3-3dc98083b44c/images/4163cc03-c5ef-4956-ac0a-6e046
99ce2e5/51de3a28-5c0a-4db1-a51c-b09c437b6f39", O_RDONLY|O_DIRECT|O_CLOEXEC) = 9
18708 00:45:42.117674 fstat(9, {st_mode=S_IFREG|0660, st_size=1073741824, ...}) = 0
18708 00:45:42.117711 lseek(9, 0, SEEK_END) = 1073741824
18708 00:45:42.117745 ioctl(9, BLKSSZGET, 0x7ffc897c6084) = -1 ENOSYS (Function not implemented)
18708 00:45:42.117798 pread64(9, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 0) = 4096
18708 00:45:42.119038 pread64(9, 0x559ae5039a00, 512, 0) = -1 EINVAL (Invalid argument)
18708 00:45:42.120328 pread64(9, 0x559ae5039a00, 1024, 0) = -1 EINVAL (Invalid argument)
18708 00:45:42.121431 pread64(9, 0x559ae5039a00, 2048, 0) = -1 EINVAL (Invalid argument)
18708 00:45:42.122442 pread64(9, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 0) = 4096
18708 00:45:42.123028 ioctl(9, BLKSSZGET, 0x7ffc897c6184) = -1 ENOSYS (Function not implemented)
18708 00:45:42.123112 pread64(9, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 0) = 4096
18708 00:45:42.123534 pread64(9,  <unfinished ...>
18711 00:45:42.124234 <... nanosleep resumed>0x7fb6af9a26b0) = 0
18711 00:45:42.124268 futex(0x559ae4d9aa68, FUTEX_WAIT, 4294967295, NULL <unfinished ...>
18708 00:45:42.124553 <... pread64 resumed>0x559ae503ac00, 512, 0) = -1 EINVAL (Invalid argument)
18708 00:45:42.124579 pread64(9, 0x559ae503ac00, 1024, 0) = -1 EINVAL (Invalid argument)
18708 00:45:42.125551 pread64(9, 0x559ae503ac00, 2048, 0) = -1 EINVAL (Invalid argument)
18708 00:45:42.126581 pread64(9, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 0) = 4096


Then it opens the drive again for read-write - but no attempt was made to detect
the block size:

18708 00:45:42.185682 openat(AT_FDCWD, "/rhev/data-center/mnt/glusterSD/voodoo4.tlv.redhat.com:_gv0/de566475-5b67-4987-abf3-3dc98083b44c/images/4163cc03-c5ef-4956-ac0a-6e04699ce2e5/51de3a28-5c0a-4db1-a51c-b09c437b6f39", O_RDWR|O_DIRECT|O_CLOEXEC) = 15

The first failure on this file descriptor is:

18924 00:49:34.531792 pwritev(15, [
{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=1024},
{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096},
{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096},
{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096},
{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096},
{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096},
{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096},
{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096},
{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096},
 ...], 1026, 51380224) = -1 EINVAL (Invalid argument)


So it looks qemu fail to detect that the vector is not aligned. Based on
bdrv_qiov_is_aligned(), bdrv_min_mem_align(bs) returns the wrong value
at this point.

Comment 4 Nir Soffer 2019-08-05 23:42:48 UTC
Created attachment 1600805 [details]
tests script, output and trace with allocated image

This contains the full output and trace mentioned in comment 3.

Comment 5 Niels de Vos 2019-08-06 06:42:24 UTC
Note that GlusterFS (over FUSE and libgfapi) does not support ioctl() calls. The BLKSSZGET that gets requested comes from block/file-posix.c when running over FUSE. Maybe block/file-posix.c:probe_logical_blocksize() could have a fallback to get the sector/blocksize from fstatvfs().

Please also see comment #2 so that we can verify that Gluster supports getting the non-512 size over the network.

Comment 6 Kevin Wolf 2019-08-06 09:22:58 UTC
(In reply to Niels de Vos from comment #2)
> In QEMU, block/gluster.c does not implement BlockDriver->bdrv_probe_blocksizes().

This is the wrong one. It should set bs->bl.request_alignment somewhere during bdrv_open().

(In reply to Nir Soffer from comment #3)
> Then it opens the drive again for read-write - but no attempt was made to
> detect the block size:

This looked like an interesting observation, but after checking the code and trying it out with a simple example, the alignment isn't reset for the new file descriptor, but we just keep using the old value. This should be fine because changing from O_RDONLY to O_RDWR shouldn't make any difference for the required alignment.

> So it looks qemu fail to detect that the vector is not aligned. Based on
> bdrv_qiov_is_aligned(), bdrv_min_mem_align(bs) returns the wrong value
> at this point.

I'm not sure what the required memory alignment actually is in the gluster case. But bdrv_co_preadv(), which uses bs->bl.request_alignment, should already have made the request aligned long before bdrv_qiov_is_aligned(), so somehow at least request_alignment, but probably both values got lost at some point.

I'll try to find a way to get a block device that requires 4k alignment on my host, so I can hopefully reproduce the problem.

(In reply to Niels de Vos from comment #5)
> Note that GlusterFS (over FUSE and libgfapi) does not support ioctl() calls.
> The BLKSSZGET that gets requested comes from block/file-posix.c when running
> over FUSE. Maybe block/file-posix.c:probe_logical_blocksize() could have a
> fallback to get the sector/blocksize from fstatvfs().

I can look into using fstatvfs() as an additional source of information.

However, I really think gluster should behave consistently with alignment requirements and enforce the same alignment whether or not the accessed data is allocated. If the ioctl() based detection fails, QEMU simply tries different request sizes and picks the smallest one that succeeds. So if gluster made sure that the 4k alignment is consistently required with O_DIRECT instead of just in some cases, QEMU would already pick up the right alignment.

Comment 7 Nir Soffer 2019-08-06 10:11:00 UTC
(In reply to Niels de Vos from comment #2)
> Can you
> get the block-size with `stat` on both a FUSE mount point and on the XFS
> filesystem that functions as the brick?

statvfs does not help, since it reports the file system block size, and not the underling
block device block size:

This is a result from gluster fuse mount:

>>> os.statvfs("/tmp/gv0/fedora29.raw")
posix.statvfs_result(f_bsize=4096, f_frsize=4096, f_blocks=26201600, f_bfree=25280117, f_bavail=25280117, f_files=52428800, f_ffree=52428010, f_favail=52428010, f_flag=4096, f_namemax=255)

This is a result inside the gluster node:

>>> os.statvfs("/export/vdo0/brick")
os.statvfs_result(f_bsize=4096, f_frsize=4096, f_blocks=26201600, f_bfree=25308562, f_bavail=25308562, f_files=52428800, f_ffree=52427979, f_favail=52427979, f_flag=4096, f_namemax=255)

This is a result from my local file system (512 sector size):

>>> os.statvfs(".")
posix.statvfs_result(f_bsize=4096, f_frsize=4096, f_blocks=46116700, f_bfree=28182751, f_bavail=25822739, f_files=11780096, f_ffree=11277430, f_favail=11277430, f_flag=4096, f_namemax=255)

I think gluster should either extend statvfs to return additional information, or
use another API to report storage alignment requirements. It would be best if 
this works both both fuse and libgfapi.

Comment 8 Nir Soffer 2019-08-06 10:48:36 UTC
I sent this patch fixing the issue for gluster fuse mount:

diff --git a/block/file-posix.c b/block/file-posix.c
index 4479cc7ab4..d29b9e5229 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1122,7 +1122,7 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
     }
 
     raw_probe_alignment(bs, s->fd, errp);
-    bs->bl.min_mem_alignment = s->buf_align;
+    bs->bl.min_mem_alignment = MAX(s->buf_align, bs->bl.request_alignment);
     bs->bl.opt_mem_alignment = MAX(s->buf_align, getpagesize());
 }


This will not help gluster:// drives, for this we need a way to probe the block size,
or maybe try the same guess as used in file-posix.c?

Comment 9 Vojtech Juranek 2019-08-06 11:44:02 UTC
Created attachment 1600968 [details]
qemu trace log on centos

FIY, I reproduced Nir's results on ovirt-4.3 on CentOS. Attaching qemu trace logs, obtained by Nir's scripts attached to this BZ.

# Gluster nodes

$rpm -qa | grep 'glusterfs\|vdo' | sort
glusterfs-api-6.4-1.el7.x86_64
glusterfs-client-xlators-6.4-1.el7.x86_64
glusterfs-cli-6.4-1.el7.x86_64
glusterfs-fuse-6.4-1.el7.x86_64
glusterfs-libs-6.4-1.el7.x86_64
glusterfs-server-6.4-1.el7.x86_64
glusterfs-6.4-1.el7.x86_64
kmod-kvdo-6.1.1.125-5.el7.x86_64
vdo-6.1.1.125-3.el7.x86_64

# Hypervisor

$rpm -qa | grep qemu | sort
ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch
libvirt-daemon-driver-qemu-4.5.0-10.el7_6.12.x86_64
qemu-guest-agent-2.12.0-2.el7.x86_64
qemu-img-ev-2.12.0-18.el7_6.5.1.x86_64
qemu-kvm-common-ev-2.12.0-18.el7_6.5.1.x86_64
qemu-kvm-ev-2.12.0-18.el7_6.5.1.x86_64

Comment 10 Kevin Wolf 2019-08-06 14:30:05 UTC
Niels, I think we need more information on the requirements that gluster makes for O_DIRECT files.

If the patch from comment 8 fixes things for Nir, the problem must be related to alignment somehow, but it's entirely unclear in what way because gluster behaves seemingly unpredictable. (Note that that patch is good for getting more debugging data, but it's nothing to actually merge in QEMU because it would involve unnecessary bounce buffers in cases where gluster isn't even involved.) I checked the kernel code to see which requirements are made there, but in the fuse path, I couldn't actually see any on the kernel side, so I suppose these errors come from the userspace implementation.

Both Nir's and Vojtech's trace show that the first failing request has a first iov member that isn't aligned to 4k. However, there are more such requests recorded (including the initial probing done by QEMU itself) that have succeeded before. To pick just one example, we have this in Vojtech's trace:

1416  pwritev(13, [{"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024}, {"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2048}, {"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024}], 3, 1339392 <unfinished ...>
1416  <... pwritev resumed> )           = 4096

So 4k alignment doesn't seem to be a requirement at least in some cases. On the other hand, I saw failing requests that have a perfectly aligned size and buffer address:

1763  pread64(13,  <unfinished ...>
1763  <... pread64 resumed> 0x565377b04000, 4096, 2097152) = -1 EINVAL (Invalid argument)

There are even a few fdatasync() calls that return EINVAL, which must be an internal problem in gluster because QEMU doesn't even have a chance to provide any data other than the file descriptor that could be invalid.

So from the outside, this looks like gluster behaving completely inconsistently, and I can't even manually figure out what the rules really are.

Comment 11 Nir Soffer 2019-08-07 18:37:01 UTC
Bug 1738657 is another issue caused by the same alignment issue, and fixed by
the same patch (with a workaround).

Comment 12 Nir Soffer 2019-08-08 19:06:20 UTC
I reproduced the same issue with local xfs file system with 4k sector size.

First I added a log in qemu to make it easier debug:

diff --git a/block/file-posix.c b/block/file-posix.c
index 4479cc7ab4..f33b542b33 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -378,6 +378,10 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
         error_setg(errp, "Could not find working O_DIRECT alignment");
         error_append_hint(errp, "Try cache.direct=off\n");
     }
+
+    /* TODO: add trace event */
+    printf("Probed aligment for %s buf_align=%ld request_alignment=%d\n",
+           bs->filename, s->buf_align, bs->bl.request_alignment);
 }


The file system was created like this:

# truncate -s 50g /var/tmp/backing
# losetup -f /var/tmp/backing --show --sector-size=4096
# mkfs -t xfs /dev/loop0
# mkdir /tmp/loop0
# mount /dev/loop0 /tmp/loop0
# chown nsoffer:nsoffer /tmp/loop0

Then I created new sparse raw file:

$ qemu-img create -f raw /tmp/loop0/fedora29.raw 6g
Formatting '/tmp/loop0/fedora29.raw', fmt=raw size=6442450944

And started qemu with this image and fedora 29 iso:

$ strace -f -tt -o /tmp/qemu-xfs-4k.trace x86_64-softmmu/qemu-system-x86_64 -accel kvm -m 2048 -smp 2 -drive file=/tmp/loop0/fedora29.raw,format=raw,cache=none -cdrom /home/nsoffer/var/libvirt/iso/Fedora-Server-dvd-x86_64-29-1.2.iso 
Probed aligment for /tmp/loop0/fedora29.raw buf_align=512 request_alignment=4096
Probed aligment for /tmp/loop0/fedora29.raw buf_align=512 request_alignment=4096

We can see that buf_align was not detected correctly - this is probably
because the file contains no data, so xfs optimized the request without
doing any I/O or checking the buffer alignment.

In the guest I started standard installation. It failed to create the root
filesystem after few seconds. I could not extract the error message from the
installed, copy and paste from the guest did not work (bad configuration?)

In the trace we can see how qemu try to read a request aligned to 512 bytes:

23534 20:44:53.194905 preadv(24,  <unfinished ...>
23163 20:44:53.194911 <... ioctl resumed>, 0) = 0
23534 20:44:53.194923 <... preadv resumed>[{iov_base=0x7ff724015000, iov_len=512}, {iov_base=0x7ff7241ea000, iov_len=3584}], 2, 0) = -1 EINVAL (Invalid argument)


To make the issue with buf_align more clear, I added also this patch:

diff --git a/block/file-posix.c b/block/file-posix.c
index f33b542b33..e93fd13a9a 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -323,6 +323,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
     BDRVRawState *s = bs->opaque;
     char *buf;
     size_t max_align = MAX(MAX_BLOCKSIZE, getpagesize());
+    size_t alignments[] = {1, 512, 1024, 2048, 4096};
 
     /* For SCSI generic devices the alignment is not really used.
        With buffered I/O, we don't have any restrictions. */
@@ -351,9 +352,11 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
 
     /* If we could not get the sizes so far, we can only guess them */
     if (!s->buf_align) {
+        int i;
         size_t align;
         buf = qemu_memalign(max_align, 2 * max_align);
-        for (align = 512; align <= max_align; align <<= 1) {
+        for (i = 0; i < ARRAY_SIZE(alignments); i++) {
+            align = alignments[i];
             if (raw_is_io_aligned(fd, buf + align, max_align)) {
                 s->buf_align = align;
                 break;
@@ -363,9 +366,11 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
     }
 
     if (!bs->bl.request_alignment) {
+        int i;
         size_t align;
         buf = qemu_memalign(s->buf_align, max_align);
-        for (align = 512; align <= max_align; align <<= 1) {
+        for (i = 0; i < ARRAY_SIZE(alignments); i++) {
+            align = alignments[i];
             if (raw_is_io_aligned(fd, buf, align)) {
                 bs->bl.request_alignment = align;
                 break;


With this we see:

xfs on 4k loop device:
    
$ ./qemu-img convert -f raw -O raw -t none -T none /tmp/loop0/src.raw /tmp/loop0/dst.raw
Probed aligment for /tmp/loop0/src.raw buf_align=1 request_alignment=4096
Probed aligment for /tmp/loop0/dst.raw buf_align=1 request_alignment=4096

request_alignment is correct since we use xfs_ctl().    

Gluster on xfs on 4k vdo device:
    
$ ./qemu-img convert -f raw -O raw -t none -T none /tmp/gv0/src.raw /tmp/gv0/dst.raw
Probed aligment for /tmp/gv0/src.raw buf_align=1 request_alignment=1
Probed aligment for /tmp/gv0/dst.raw buf_align=1 request_alignment=1

request_alignment is incorrect since ready from unallocated area does not
do any I/O on xfs.

And when we start try to provision a VM we get the same error:

$ qemu-img create -f raw /tmp/loop0/fedora29.raw 6g
Formatting '/tmp/loop0/fedora29.raw', fmt=raw size=6442450944

$ strace -f -tt -o /tmp/qemu-xfs-4k-align-1.trace x86_64-softmmu/qemu-system-x86_64 -accel kvm -m 2048 -smp 2 -drive file=/tmp/loop0/fedora29.raw,format=raw,cache=none -cdrom /home/nsoffer/var/libvirt/iso/Fedora-Server-dvd-x86_64-29-1.2.iso 
Probed aligment for /tmp/loop0/fedora29.raw buf_align=1 request_alignment=4096
Probed aligment for /tmp/loop0/fedora29.raw buf_align=1 request_alignment=4096

Started installation in guest here, stuck after few seconds.

In the trace we see the similar failure:

28947 21:29:05.771029 preadv(24, [{iov_base=0x7fb26e2bba00, iov_len=512}, {iov_base=0x7fb2d4559000, iov_len=3584}], 2, 1721761792) = -1 EINVAL (Invalid argument)


This shows that this issue is not an issue with gluster, but with qemu. Checking
alignment requirement using buf_align is not effective. It used to work with 
storage with sector size of 512 bytes just because this was the first value
checked.


I tested again with the patch using max(buf_align, request_alignment) to set
min_mem_alignment:

diff --git a/block/file-posix.c b/block/file-posix.c
index e93fd13a9a..08dd330fd6 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1131,7 +1131,7 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
     }
 
     raw_probe_alignment(bs, s->fd, errp);
-    bs->bl.min_mem_alignment = s->buf_align;
+    bs->bl.min_mem_alignment = MAX(s->buf_align, bs->bl.request_alignment);
     bs->bl.opt_mem_alignment = MAX(s->buf_align, getpagesize());
 }


$ qemu-img create -f raw /tmp/loop0/fedora29.raw 6g
Formatting '/tmp/loop0/fedora29.raw', fmt=raw size=6442450944

$ strace -f -tt -o /tmp/qemu-xfs-4k-align-1-patched.trace x86_64-softmmu/qemu-system-x86_64 -accel kvm -m 2048 -smp 2 -drive file=/tmp/loop0/fedora29.raw,format=raw,cache=none -cdrom /home/nsoffer/var/libvirt/iso/Fedora-Server-dvd-x86_64-29-1.2.iso 
Probed aligment for /tmp/loop0/fedora29.raw buf_align=1 request_alignment=4096
Probed aligment for /tmp/loop0/fedora29.raw buf_align=1 request_alignment=4096

This time the VM was provisioned successfully.

Comment 13 Nir Soffer 2019-08-08 19:48:32 UTC
Created attachment 1601918 [details]
Trace for failed installation on local xfs on 4k loop device

Comment 14 Nir Soffer 2019-08-08 19:50:44 UTC
Created attachment 1601919 [details]
Trace for failed installation on local xfs on 4k loop device with buf_align=1

Comment 15 Ben Cotton 2019-08-13 16:58:09 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle.
Changing version to '31'.

Comment 16 Ben Cotton 2019-08-13 17:19:40 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle.
Changing version to 31.

Comment 17 Nir Soffer 2019-08-13 18:33:55 UTC
Patch posted for review:
https://lists.nongnu.org/archive/html/qemu-block/2019-08/msg00568.html

Comment 19 Nir Soffer 2019-09-02 11:38:41 UTC
Cole, do you know which qemu version will provide this fix in virt-preview repo?

Same for bug 1738657.

Comment 20 Cole Robinson 2019-09-04 14:14:42 UTC
(In reply to Nir Soffer from comment #19)
> Cole, do you know which qemu version will provide this fix in virt-preview
> repo?
> 
> Same for bug 1738657.

I will backport the patches to rawhide today and kick off virt-preview builds to match

Comment 21 Fedora Update System 2019-09-04 20:14:36 UTC
FEDORA-2019-6a014367d5 has been submitted as an update to Fedora 31. https://bodhi.fedoraproject.org/updates/FEDORA-2019-6a014367d5

Comment 22 Fedora Update System 2019-09-04 21:35:38 UTC
qemu-4.1.0-2.fc31 has been pushed to the Fedora 31 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2019-6a014367d5

Comment 23 Fedora Update System 2019-09-28 00:01:43 UTC
qemu-4.1.0-2.fc31 has been pushed to the Fedora 31 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.