Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1223332 - ext4 corruption on kernel 4.0.2
Summary: ext4 corruption on kernel 4.0.2
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 22
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: AcceptedBlocker
: 1223760 (view as bug list)
Depends On:
Blocks: F22FinalBlocker
TreeView+ depends on / blocked
 
Reported: 2015-05-20 10:59 UTC by Kamil Páral
Modified: 2015-05-29 23:15 UTC (History)
24 users (show)

Fixed In Version: kernel-4.0.4-201.fc21
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-05-22 19:53:29 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Linux Kernel 98501 0 None None None Never

Internal Links: 1382518

Description Kamil Páral 2015-05-20 10:59:01 UTC
Description of problem:
There are reports of a possible ext4 corruption issue on kernel 4.0.2, which is at the moment used in Fedora 22:

https://lists.fedoraproject.org/pipermail/test/2015-May/126579.html
http://news.softpedia.com/news/Linux-Kernel-Plagued-by-an-EXT4-Data-Corruption-Issue-481699.shtml

Since I haven't found any other bug in bugzilla, I'm filing this one to track this issue.

Version-Release number of selected component (if applicable):
kernel 4.0.2

How reproducible:
unknown at the moment

Comment 1 Josh Boyer 2015-05-20 11:43:04 UTC
The article is terrible, and actually has conflicting information in it.

Looking at the debian and arch bug reports, we see this thread highlighted upstream:

http://www.gossamer-threads.com/lists/linux/kernel/2175646?do=post_view_threaded#2175646

which winds up saying that there is a commit that fixes data corruption, but probably isn't related to the corruption that thread was talking about.  The commit referenced is:

d2dc317d564a4 "ext4: fix data corruption caused by unwritten and delayed extents"

which went into the 4.0.3 stable release as commit ce879f96b5.

F22 (and F21) are already at 4.0.4 in updates-testing, so we already have the only plausible patch built in an update.

Adding the ext4 guys on CC in case they can make something of all this muck.

Comment 2 Bruno Wolff III 2015-05-20 18:36:23 UTC
I found another link that might explain the other bug:
http://www.gossamer-threads.com/lists/linux/kernel/2175156

Comment 3 Michael Cronenworth 2015-05-20 18:43:03 UTC
I'm not running dm-crypt. 2x256gb Sandisk SD6SP1M256G1102 (M.2 SATA drives), GPT, /boot md-RAID1, /boot/efi md-RAID1, / md-RAID0, swap md-RAID0. No LVM. Discard enabled, NCQ enabled.

Comment 4 poma 2015-05-20 21:31:00 UTC
I didn't stumbled on this one, here is RAID1/EXT4.

However you can try this:
ext4: fix data corruption caused by unwritten and delayed extents
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=d2dc317

You know that you can test with libvirt, right.

Comment 5 poma 2015-05-20 21:44:21 UTC
(In reply to poma from comment #4)
> I didn't stumbled on this one, here is RAID1/EXT4.
> 
> However you can try this:
> ext4: fix data corruption caused by unwritten and delayed extents
> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/
> ?id=d2dc317
> 

Hah.

ext4: fix data corruption caused by unwritten and delayed extents
commit d2dc317d564a46dfc683978a2e5a4f91434e9711 upstream.
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=ce879f96b5

This is already in the stable 4.0.4, so the only difference, at least here, is:
https://bugzilla.redhat.com/show_bug.cgi?id=1220519

This does not mean this is the actual solution, 
but I didn't stumbled on that Pokémon.

Comment 6 Eric Work 2015-05-20 23:44:44 UTC
(In reply to Michael Cronenworth from comment #3)
> I'm not running dm-crypt. 2x256gb Sandisk SD6SP1M256G1102 (M.2 SATA drives),
> GPT, /boot md-RAID1, /boot/efi md-RAID1, / md-RAID0, swap md-RAID0. No LVM.
> Discard enabled, NCQ enabled.

Michael, you may actually have this problem instead:
https://bugzilla.kernel.org/show_bug.cgi?id=98501

Comment 7 poma 2015-05-21 04:01:24 UTC
From what I've read in several places, including this:

"Nature of ext4 corruption fixed by recent patch?"
http://www.gossamer-threads.com/lists/linux/kernel/2175646?search_string=Nature%20of%20ext4%20corruption%20fixed%20by%20recent%20patch%3F;#2175646

all roads lead to Rome:

"md raid0 w/ fstrim causing data loss"
https://bugzilla.kernel.org/show_bug.cgi?id=98501

 Eric Work 2015-05-17 19:41:19 UTC

Hardware: 2 x Crucial_CT256MX100SSD1 (MU02)
Software: md raid0 w/ ext4
Kernel: 3.19.7-200.fc21.x86_64

md/raid0: fix restore to sector variable in raid0_make_request
https://bugzilla.kernel.org/attachment.cgi?id=177291
0001-md-raid0-fix-restore-to-sector-variable-in-raid0_mak.patch


It should occur here:
https://patchwork.kernel.org/project/dm-devel/list

Comment 8 poma 2015-05-21 04:17:28 UTC
Lukas, why the XFS tool is used to test EXT4?

"ext4: fix data corruption caused by unwritten and delayed extents"
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=d2dc317
...
This problem can be easily reproduced by running the following xfs_io.

xfs_io -f -c "pwrite -S 0xaa 4096 2048" \
          -c "falloc 0 131072" \
          -c "pwrite -S 0xbb 65536 2048" \
          -c "fsync" /mnt/test/fff

echo 3 > /proc/sys/vm/drop_caches
xfs_io -c "pwrite -S 0xdd 67584 2048" /mnt/test/fff
...

man 8 xfs_io

xfs_io(8)                   System Manager's Manual                  xfs_io(8)

NAME
       xfs_io - debug the I/O path of an XFS filesystem
...

Comment 9 Lukáš Czerner 2015-05-21 06:48:18 UTC
Because xfs_io is just a convenient tool for testing things like that. It only issues specified file system requests, but it's written by XFS developers as a part of their xfs_progs. Additionally it might support some xfs specific commands, however it's considered to be a generic tool.

In the reproducer you mentioned there is nothing xfs specific.

I hope that makes sense.

Now for the bug. This particular problem fixed with commit d2dc317d564a46dfc683978a2e5a4f91434e9711 has been around for quite some time and it requires rather specific course of action to trigger and it's unlikely to be hit easily by applications. It's definitely _not_ new in 4.0.2. The information on softpedia does not seem to be pointing at this particular bug since most of the reports are showing corrupted file system which is something you would not see with this problem since it's plain data corruption. It's likely to be completely different.

-Lukas

Comment 10 Lukáš Czerner 2015-05-21 06:53:06 UTC
The problem mentioned by Kamil is related to RAID configuration and it's probably fixed in https://bugzilla.kernel.org/show_bug.cgi?id=98501 ?

-Lukas

Comment 11 poma 2015-05-21 10:29:42 UTC
(In reply to Lukáš Czerner from comment #9)
> Because xfs_io is just a convenient tool for testing things like that. It
> only issues specified file system requests, but it's written by XFS
> developers as a part of their xfs_progs. Additionally it might support some
> xfs specific commands, however it's considered to be a generic tool.
> 
> In the reproducer you mentioned there is nothing xfs specific.
> 
> I hope that makes sense.
> 

Super cool.

> Now for the bug. This particular problem fixed with commit
> d2dc317d564a46dfc683978a2e5a4f91434e9711 has been around for quite some time
> and it requires rather specific course of action to trigger and it's
> unlikely to be hit easily by applications. It's definitely _not_ new in
> 4.0.2. The information on softpedia does not seem to be pointing at this
> particular bug since most of the reports are showing corrupted file system
> which is something you would not see with this problem since it's plain data
> corruption. It's likely to be completely different.
> 
> -Lukas

Thanks.

Comment 12 Josh Boyer 2015-05-21 12:17:38 UTC
(In reply to Lukáš Czerner from comment #10)
> The problem mentioned by Kamil is related to RAID configuration and it's
> probably fixed in https://bugzilla.kernel.org/show_bug.cgi?id=98501 ?
> 
> -Lukas

Yep.  Neil has this queued up here:

http://git.neil.brown.name/?p=md.git;a=commitdiff;h=a81157768a00e8cf8a7b43b5ea5cac931262374f

The workaround is to disable fstrim/discard on RAID setups.  Large IO on non-4k aligned RAID setups could still hit this, but that is a bit more rare.

Comment 13 Josh Boyer 2015-05-21 12:42:27 UTC
I've added the patch to rawhide-F21.  It will be in the next build of each.

Comment 14 Michael Cronenworth 2015-05-21 13:28:46 UTC
(In reply to Eric Work from comment #6)
> Michael, you may actually have this problem instead:
> https://bugzilla.kernel.org/show_bug.cgi?id=98501

This was applied in 4.0.2 and 3.19.7, so yes, it is most likely the cause. Thanks.

Comment 15 Josh Boyer 2015-05-21 15:18:01 UTC
*** Bug 1223760 has been marked as a duplicate of this bug. ***

Comment 16 Fedora Update System 2015-05-21 17:38:49 UTC
kernel-4.0.4-301.fc22 has been submitted as an update for Fedora 22.
https://admin.fedoraproject.org/updates/kernel-4.0.4-301.fc22

Comment 17 Kamil Páral 2015-05-21 17:58:16 UTC
Discussed at today's go/no-go meeting [1].

This bug was accepted as Final Blocker - This bug is a direct violation of the following Final Release Criterion: "All known bugs that can cause corruption of user data must be fixed or documented at Common F22 bugs."

[1] http://meetbot.fedoraproject.org/fedora-meeting-2/2015-05-21/f22_final_gono-go_meeting.2015-05-21-17.00.log.txt

Comment 18 Fedora Update System 2015-05-21 19:58:45 UTC
kernel-4.0.4-201.fc21 has been submitted as an update for Fedora 21.
https://admin.fedoraproject.org/updates/kernel-4.0.4-201.fc21

Comment 19 Fedora Update System 2015-05-22 02:31:32 UTC
Package kernel-4.0.4-301.fc22:
* should fix your issue,
* was pushed to the Fedora 22 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-4.0.4-301.fc22'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2015-8690/kernel-4.0.4-301.fc22
then log in and leave karma (feedback).

Comment 20 Petr Schindler 2015-05-22 12:03:14 UTC
Tested with rc3. Works fine with our firmware raid and raid0 (and raid5). It doesn't work with 4.0.2. Thanks for fixing this.

Comment 21 Fedora Update System 2015-05-22 19:53:29 UTC
kernel-4.0.4-301.fc22 has been pushed to the Fedora 22 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 22 Ozkan Sezer 2015-05-26 08:38:19 UTC
The patch seems to be missing in f20 version of 4.0.4 kernel

Comment 23 Ozkan Sezer 2015-05-26 08:43:29 UTC
(In reply to Ozkan Sezer from comment #22)
> The patch seems to be missing in f20 version of 4.0.4 kernel

Apparently it is included in f20 version. My mistake, please ignore.

Comment 24 Fedora Update System 2015-05-27 16:05:17 UTC
kernel-4.0.4-201.fc21 has been pushed to the Fedora 21 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 25 readdytop 2015-05-29 23:15:04 UTC
Updated from kernel-3.19.7-200.fc21.x86_64 to kernel-4.0.4.-201.fc21.x86_64 from stable repo on 2015-05-28 and my system would no longer boot. journalctl shows multiple RPC Pipe File System and NFS exceptions. Also multiple "failed command: WRITE FPDMA QUEUED" errors. Either the changes in 4.0.4-201.fc21 did not fix the corruption problem, or introduced a new one. Stepping back to 3.19.7-200.fc21 boots fine.


Note You need to log in before you can comment on or make changes to this bug.