Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1223332
Summary: | ext4 corruption on kernel 4.0.2 | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Kamil Páral <kparal> |
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> |
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 22 | CC: | bruno, bugzilla, devin, fedora-kernel-extfs, gansalmon, itamar, jonathan, kernel-maint, lczerner, madhu.chinakonda, mattdm, mcatanzaro+wrong-account-do-not-cc, mchehab, me, mike, mschmidt, nonamedotc, pbrobinson, pschindl, readdytop, robatino, sezeroz, tomastrnka, work.eric |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | AcceptedBlocker | ||
Fixed In Version: | kernel-4.0.4-201.fc21 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-05-22 19:53:29 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1043130 |
Description
Kamil Páral
2015-05-20 10:59:01 UTC
The article is terrible, and actually has conflicting information in it. Looking at the debian and arch bug reports, we see this thread highlighted upstream: http://www.gossamer-threads.com/lists/linux/kernel/2175646?do=post_view_threaded#2175646 which winds up saying that there is a commit that fixes data corruption, but probably isn't related to the corruption that thread was talking about. The commit referenced is: d2dc317d564a4 "ext4: fix data corruption caused by unwritten and delayed extents" which went into the 4.0.3 stable release as commit ce879f96b5. F22 (and F21) are already at 4.0.4 in updates-testing, so we already have the only plausible patch built in an update. Adding the ext4 guys on CC in case they can make something of all this muck. I found another link that might explain the other bug: http://www.gossamer-threads.com/lists/linux/kernel/2175156 I'm not running dm-crypt. 2x256gb Sandisk SD6SP1M256G1102 (M.2 SATA drives), GPT, /boot md-RAID1, /boot/efi md-RAID1, / md-RAID0, swap md-RAID0. No LVM. Discard enabled, NCQ enabled. I didn't stumbled on this one, here is RAID1/EXT4. However you can try this: ext4: fix data corruption caused by unwritten and delayed extents https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=d2dc317 You know that you can test with libvirt, right. (In reply to poma from comment #4) > I didn't stumbled on this one, here is RAID1/EXT4. > > However you can try this: > ext4: fix data corruption caused by unwritten and delayed extents > https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/ > ?id=d2dc317 > Hah. ext4: fix data corruption caused by unwritten and delayed extents commit d2dc317d564a46dfc683978a2e5a4f91434e9711 upstream. https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=ce879f96b5 This is already in the stable 4.0.4, so the only difference, at least here, is: https://bugzilla.redhat.com/show_bug.cgi?id=1220519 This does not mean this is the actual solution, but I didn't stumbled on that Pokémon. (In reply to Michael Cronenworth from comment #3) > I'm not running dm-crypt. 2x256gb Sandisk SD6SP1M256G1102 (M.2 SATA drives), > GPT, /boot md-RAID1, /boot/efi md-RAID1, / md-RAID0, swap md-RAID0. No LVM. > Discard enabled, NCQ enabled. Michael, you may actually have this problem instead: https://bugzilla.kernel.org/show_bug.cgi?id=98501 From what I've read in several places, including this: "Nature of ext4 corruption fixed by recent patch?" http://www.gossamer-threads.com/lists/linux/kernel/2175646?search_string=Nature%20of%20ext4%20corruption%20fixed%20by%20recent%20patch%3F;#2175646 all roads lead to Rome: "md raid0 w/ fstrim causing data loss" https://bugzilla.kernel.org/show_bug.cgi?id=98501 Eric Work 2015-05-17 19:41:19 UTC Hardware: 2 x Crucial_CT256MX100SSD1 (MU02) Software: md raid0 w/ ext4 Kernel: 3.19.7-200.fc21.x86_64 md/raid0: fix restore to sector variable in raid0_make_request https://bugzilla.kernel.org/attachment.cgi?id=177291 0001-md-raid0-fix-restore-to-sector-variable-in-raid0_mak.patch It should occur here: https://patchwork.kernel.org/project/dm-devel/list Lukas, why the XFS tool is used to test EXT4? "ext4: fix data corruption caused by unwritten and delayed extents" https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=d2dc317 ... This problem can be easily reproduced by running the following xfs_io. xfs_io -f -c "pwrite -S 0xaa 4096 2048" \ -c "falloc 0 131072" \ -c "pwrite -S 0xbb 65536 2048" \ -c "fsync" /mnt/test/fff echo 3 > /proc/sys/vm/drop_caches xfs_io -c "pwrite -S 0xdd 67584 2048" /mnt/test/fff ... man 8 xfs_io xfs_io(8) System Manager's Manual xfs_io(8) NAME xfs_io - debug the I/O path of an XFS filesystem ... Because xfs_io is just a convenient tool for testing things like that. It only issues specified file system requests, but it's written by XFS developers as a part of their xfs_progs. Additionally it might support some xfs specific commands, however it's considered to be a generic tool. In the reproducer you mentioned there is nothing xfs specific. I hope that makes sense. Now for the bug. This particular problem fixed with commit d2dc317d564a46dfc683978a2e5a4f91434e9711 has been around for quite some time and it requires rather specific course of action to trigger and it's unlikely to be hit easily by applications. It's definitely _not_ new in 4.0.2. The information on softpedia does not seem to be pointing at this particular bug since most of the reports are showing corrupted file system which is something you would not see with this problem since it's plain data corruption. It's likely to be completely different. -Lukas The problem mentioned by Kamil is related to RAID configuration and it's probably fixed in https://bugzilla.kernel.org/show_bug.cgi?id=98501 ? -Lukas (In reply to Lukáš Czerner from comment #9) > Because xfs_io is just a convenient tool for testing things like that. It > only issues specified file system requests, but it's written by XFS > developers as a part of their xfs_progs. Additionally it might support some > xfs specific commands, however it's considered to be a generic tool. > > In the reproducer you mentioned there is nothing xfs specific. > > I hope that makes sense. > Super cool. > Now for the bug. This particular problem fixed with commit > d2dc317d564a46dfc683978a2e5a4f91434e9711 has been around for quite some time > and it requires rather specific course of action to trigger and it's > unlikely to be hit easily by applications. It's definitely _not_ new in > 4.0.2. The information on softpedia does not seem to be pointing at this > particular bug since most of the reports are showing corrupted file system > which is something you would not see with this problem since it's plain data > corruption. It's likely to be completely different. > > -Lukas Thanks. (In reply to Lukáš Czerner from comment #10) > The problem mentioned by Kamil is related to RAID configuration and it's > probably fixed in https://bugzilla.kernel.org/show_bug.cgi?id=98501 ? > > -Lukas Yep. Neil has this queued up here: http://git.neil.brown.name/?p=md.git;a=commitdiff;h=a81157768a00e8cf8a7b43b5ea5cac931262374f The workaround is to disable fstrim/discard on RAID setups. Large IO on non-4k aligned RAID setups could still hit this, but that is a bit more rare. I've added the patch to rawhide-F21. It will be in the next build of each. (In reply to Eric Work from comment #6) > Michael, you may actually have this problem instead: > https://bugzilla.kernel.org/show_bug.cgi?id=98501 This was applied in 4.0.2 and 3.19.7, so yes, it is most likely the cause. Thanks. *** Bug 1223760 has been marked as a duplicate of this bug. *** kernel-4.0.4-301.fc22 has been submitted as an update for Fedora 22. https://admin.fedoraproject.org/updates/kernel-4.0.4-301.fc22 Discussed at today's go/no-go meeting [1]. This bug was accepted as Final Blocker - This bug is a direct violation of the following Final Release Criterion: "All known bugs that can cause corruption of user data must be fixed or documented at Common F22 bugs." [1] http://meetbot.fedoraproject.org/fedora-meeting-2/2015-05-21/f22_final_gono-go_meeting.2015-05-21-17.00.log.txt kernel-4.0.4-201.fc21 has been submitted as an update for Fedora 21. https://admin.fedoraproject.org/updates/kernel-4.0.4-201.fc21 Package kernel-4.0.4-301.fc22: * should fix your issue, * was pushed to the Fedora 22 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing kernel-4.0.4-301.fc22' as soon as you are able to, then reboot. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2015-8690/kernel-4.0.4-301.fc22 then log in and leave karma (feedback). Tested with rc3. Works fine with our firmware raid and raid0 (and raid5). It doesn't work with 4.0.2. Thanks for fixing this. kernel-4.0.4-301.fc22 has been pushed to the Fedora 22 stable repository. If problems still persist, please make note of it in this bug report. The patch seems to be missing in f20 version of 4.0.4 kernel (In reply to Ozkan Sezer from comment #22) > The patch seems to be missing in f20 version of 4.0.4 kernel Apparently it is included in f20 version. My mistake, please ignore. kernel-4.0.4-201.fc21 has been pushed to the Fedora 21 stable repository. If problems still persist, please make note of it in this bug report. Updated from kernel-3.19.7-200.fc21.x86_64 to kernel-4.0.4.-201.fc21.x86_64 from stable repo on 2015-05-28 and my system would no longer boot. journalctl shows multiple RPC Pipe File System and NFS exceptions. Also multiple "failed command: WRITE FPDMA QUEUED" errors. Either the changes in 4.0.4-201.fc21 did not fix the corruption problem, or introduced a new one. Stepping back to 3.19.7-200.fc21 boots fine. |