1211582 – strict aliasing breaks e2fsck on s390

Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.

Bug 1211582 - strict aliasing breaks e2fsck on s390

Summary: strict aliasing breaks e2fsck on s390

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	e2fsprogs
Sub Component:
Version:	rawhide
Hardware:	s390
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Eric Sandeen
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	ZedoraTracker
TreeView+	depends on / blocked

Reported:	2015-04-14 11:41 UTC by Dan Horák
Modified:	2015-06-08 21:21 UTC (History)
CC List:	10 users (show)
Fixed In Version:	e2fsprogs-1.42.13-2.fc23
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-06-08 21:21:38 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
test log file (164.75 KB, text/plain) 2015-04-14 17:27 UTC, Dan Horák	no flags	Details
View All

Description Dan Horák 2015-04-14 11:41:50 UTC

When e2fsprogs are built with hardened build enabled in redhat-rpm-config, r_ext4_small_bg test fails on 32-bit s390 (http://s390.koji.fedoraproject.org/koji/taskinfo?taskID=1776795). After downgrading to redhat-rpm-config-27-1.fc23 I was able to build e2fsprogs.


Version-Release number of selected component (if applicable):
e2fsprogs-1.42.12-3.fc23
redhat-rpm-config-28-1.fc23
gcc-5.0.0-0.17.fc22
gcc-5.0.1-0.1.fc22

Comment 1 Eric Sandeen 2015-04-14 13:40:43 UTC

Hm:

./scripts/resize_test: line 100: dumpe2fs: command not found
r_ext4_small_bg: ext4 1024 blocksize with small block groups: failed

but it seems to have installed & built fine ... do you have a task # for the build w/ the downgraded redhat-rpm-config?

Comment 2 Dan Horák 2015-04-14 14:00:30 UTC

http://s390.koji.fedoraproject.org/koji/taskinfo?taskID=1776816

the diff between the 2 buildroots is

[dan@eagle ~]$ diff /tmp/e-1 /tmp/e-2
11c11
< cpp-5.0.1-0.1.fc22.s390
---
> cpp-5.0.0-0.17.fc22.s390
32,34c32,34
< gcc-5.0.1-0.1.fc22.s390
< gcc-c++-5.0.1-0.1.fc22.s390
< gcc-gdb-plugin-5.0.1-0.1.fc22.s390
---
> gcc-5.0.0-0.17.fc22.s390
> gcc-c++-5.0.0-0.17.fc22.s390
> gcc-gdb-plugin-5.0.0-0.17.fc22.s390
63,64c63,64
< libblkid-2.26-1.fc22.s390
< libblkid-devel-2.26-1.fc22.s390
---
> libblkid-2.26-0.4.fc22.s390
> libblkid-devel-2.26-0.4.fc22.s390
72c72
< libfdisk-2.26-1.fc22.s390
---
> libfdisk-2.26-0.4.fc22.s390
74c74
< libgcc-5.0.1-0.1.fc22.s390
---
> libgcc-5.0.0-0.17.fc22.s390
76c76
< libgomp-5.0.1-0.1.fc22.s390
---
> libgomp-5.0.0-0.17.fc22.s390
81c81
< libmount-2.26-1.fc22.s390
---
> libmount-2.26-0.4.fc22.s390
89c89
< libsmartcols-2.26-1.fc22.s390
---
> libsmartcols-2.26-0.4.fc22.s390
91,92c91,92
< libstdc++-5.0.1-0.1.fc22.s390
< libstdc++-devel-5.0.1-0.1.fc22.s390
---
> libstdc++-5.0.0-0.17.fc22.s390
> libstdc++-devel-5.0.0-0.17.fc22.s390
98,99c98,99
< libuuid-2.26-1.fc22.s390
< libuuid-devel-2.26-1.fc22.s390
---
> libuuid-2.26-0.4.fc22.s390
> libuuid-devel-2.26-0.4.fc22.s390
167c167
< redhat-rpm-config-28-1.fc23.noarch
---
> redhat-rpm-config-27-1.fc22.noarch
184c184
< util-linux-2.26-1.fc22.s390
---
> util-linux-2.26-0.4.fc22.s390

Comment 3 Eric Sandeen 2015-04-14 14:38:02 UTC

Is e2fsprogs installed in the former build root and not the latter?  That would explain the missing dumpe2fs, and the script is trying to run it from the root, not from the build tree.  Does this fix it?



diff --git a/tests/scripts/resize_test b/tests/scripts/resize_test
index 0633e0c..dfd45ac 100755
--- a/tests/scripts/resize_test
+++ b/tests/scripts/resize_test
@@ -67,7 +67,7 @@ fi
 echo $FSCK -fy $TMPFILE >> $LOG 2>&1
 if ! $FSCK -fy $TMPFILE >> $LOG 2>&1
 then
-	dumpe2fs $TMPFILE >> $LOG
+	$DUMPE2FS $TMPFILE >> $LOG
 	return 1
 fi
 
@@ -97,7 +97,7 @@ fi
 echo $FSCK -fy $TMPFILE >> $LOG 2>&1
 if ! $FSCK -fy $TMPFILE >> $LOG 2>&1
 then
-	dumpe2fs $TMPFILE >> $LOG
+	$DUMPE2FS $TMPFILE >> $LOG
 	return 1
 fi
 
@@ -122,7 +122,7 @@ fi
 echo $FSCK -fy $TMPFILE >> $LOG 2>&1
 if ! $FSCK -fy $TMPFILE >> $LOG 2>&1
 then
-	dumpe2fs $TMPFILE >> $LOG
+	$DUMPE2FS $TMPFILE >> $LOG
 	return 1
 fi
 
@@ -147,7 +147,7 @@ fi
 echo $FSCK -fy $TMPFILE >> $LOG 2>&1
 if ! $FSCK -fy $TMPFILE >> $LOG 2>&1
 then
-	dumpe2fs $TMPFILE >> $LOG
+	$DUMPE2FS $TMPFILE >> $LOG
 	return 1
 fi

Comment 4 Dan Horák 2015-04-14 17:03:35 UTC

(In reply to Eric Sandeen from comment #3)
> Is e2fsprogs installed in the former build root and not the latter?  That
> would explain the missing dumpe2fs, and the script is trying to run it from
> the root, not from the build tree.  Does this fix it?

unfortunately not -> http://s390.koji.fedoraproject.org/koji/taskinfo?taskID=1777003

If I see correctly, then in neither case e2fsprogs are installed in buildroot.

Comment 5 Eric Sandeen 2015-04-14 17:12:46 UTC

Hm, thanks.

So that changes it from:

./scripts/resize_test: line 100: dumpe2fs: command not found
r_ext4_small_bg: ext4 1024 blocksize with small block groups: failed

to:

dumpe2fs 1.42.12 (29-Aug-2014)
r_ext4_small_bg: ext4 1024 blocksize with small block groups: failed

but it still fails.  Odd.  I need to fix that test infra to say *how* it failed.  :(

Comment 6 Dan Horák 2015-04-14 17:23:48 UTC

with added %undefine _hardened_build on top of the spec it passes - http://s390.koji.fedoraproject.org/koji/taskinfo?taskID=1777017

Comment 7 Dan Horák 2015-04-14 17:27:18 UTC

Created attachment 1014423 [details]
test log file

Comment 8 Dan Horák 2015-05-11 13:09:48 UTC

few more info
- rechecked with e2fsprogs-1.42.12-5.fc23
- reproduced with gcc-5.1.1-1.fc22.s390 and hardened build enabled
- going from -O2 to -O1 and keeping hardened build makes the tests pass

So very likely a compiler issue.

Jakub, I have a F-22 s390 chroot ready and accessible via ssh, ping me for details.

Comment 9 Jakub Jelinek 2015-05-11 15:04:17 UTC

So, it seems what matters for this test is whether e2fsck program has been linked with just -Wl,-z,relro -pie or -Wl,-z,relro -pie -Wl,-z,now (hardening adds both, which doesn't work).  But, there is no change in the generated code between that, the only changes are in the layout of the sections (the first one lays sections in the RW segment out so that the .got section is 4KB aligned, the latter has 16 byte longer .dynamic section and lays them out so that the .data section is 4KB aligned).
Replacing -Wl,-z,relro with -Wl,-z,norelro also fixes the problem (again, the section layout of the writable sections is very much different), both with -Wl,-z,now and without.

Thus, I'd say this is likely some uninitialized variable issue or something similar where things go wrong depending on the exact data section layout.

I'd suggest if somebody familiar with e2fsck sources just did a side-by-side debugging session between e2fsck linked with the default flags (-Wl,-z,relro -pie -Wl,-z,now) vs. any of the working combinations (-pie -Wl,-z,norelro,-z,now or -pie -Wl,z,relro) to find out where and why things start to look different.

Not ruling gcc issues, but but in e2fsck is much more likely in this case.

Comment 10 Eric Sandeen 2015-05-11 15:16:39 UTC

Wish this didn't require an s390 box!

I wonder if running that test through valgrind would offer any clues, if Jakub thinks it might be an uninitialized variable?

Comment 11 Dan Horák 2015-05-11 15:22:19 UTC

And there is no valgrind for s390 (32 bit). But Jakub gave me few hints, so I'll ask guys is our team to look on them first.

From calling sequence in the logs (mke2fs, e2fsck, resize, e2fsck, resize, e2fsck) it looks like the last e2fsck fails.

Comment 12 Eric Sandeen 2015-05-11 15:24:00 UTC

Ok, try to get it down to a filesystem image + e2fsck call which fails; if it is an uninit var it might be discoverable via valgrind on another platform.

I'm not sure how to arrive at your test platform so I'm not sure how to dig into this one.  If you have a box/chroot I could log into and poke around in, I'd be willing.

Maybe find me on IRC if you want me to log in and we can hash out details.  :)

Thanks,
-Eric

Comment 13 Eric Sandeen 2015-05-11 17:51:37 UTC

This seems to reproduce it, but not with a stock ./configure && make

#!/bin/bash

MKFS="misc/mke2fs -t ext4"
RESIZE2FS=resize/resize2fs
E2FSCK=e2fsck/e2fsck
DEBUGFS=debugfs/debugfs.static

rm -f fsfile.img
truncate -s 2G fsfile.img

$MKFS -O '^resize_inode' -b 1024 -g 512 -qF fsfile.img 64M

OUT_TMP=tmpfile
date > $OUT_TMP
cat $E2FSCK >> $OUT_TMP

# slightly populate the fs

$DEBUGFS -w fsfile.img  << EOF
mkdir test
cd test
write $OUT_TMP e2fsck
ls /test
stat /test/e2fsck
quit
EOF

rm -f $OUT_TMP

$E2FSCK -fn fsfile.img || echo "initial fsck failed"

$RESIZE2FS    fsfile.img 2G
$E2FSCK -fn fsfile.img || echo "post-resize fsck failed"

$RESIZE2FS -M fsfile.img 2G
$E2FSCK -fn fsfile.img || echo "post-resize -M fsck failed"

Comment 14 Eric Sandeen 2015-05-11 18:10:38 UTC

On x86_64, both resize2fs invocations yield this under valgrind:

==20334== Invalid read of size 4
==20334==    at 0x406E70: main (main.c:472)
==20334==  Address 0x4c26e08 is 40 bytes inside a block of size 296 free'd
==20334==    at 0x4A063F0: free (vg_replace_malloc.c:446)
==20334==    by 0x406679: resize_fs (resize2fs.c:211)
==20334==    by 0x406FFF: main (main.c:457)

Comment 15 Eric Sandeen 2015-05-11 19:50:00 UTC

Ok, that's not the root cause, that's just a use after free when printing the new size:

commit deae5e809b524a3cca3ecf66be28058134575a02
Author: Theodore Ts'o <tytso>
Date:   Wed Oct 8 12:09:35 2014 -0400

    resize2fs: fix fs->blocksize dereference after fs has been freed
    
    Commit 77255cf36944b introduced a use after free bug.
    
    Signed-off-by: Theodore Ts'o <tytso>

Comment 16 Eric Sandeen 2015-05-11 19:51:51 UTC

For posterity, the way this resize test fails is:

...
Pass 5: Checking group summary information
Group 12 block(s) in use but group is marked BLOCK_UNINIT
Fix? no

Block bitmap differences:  +6145
Fix? no


fsfile.img: ********** WARNING: Filesystem still has errors **********

fsfile.img: 13/2816 files (0.0% non-contiguous), 5491/10962 blocks
post-resize -M fsck failed

Comment 17 Eric Sandeen 2015-05-11 20:53:58 UTC

I'm stumped.  :/

Comment 18 Dan Horák 2015-05-11 21:02:11 UTC

Do we know if the fs image after the "resize -M" is correct and it is the e2fsck what's buggy? Or is the fs image already corrupted?

Comment 19 Eric Sandeen 2015-05-11 21:06:39 UTC

The image checks cleanly prior to the resize2fs -M, and badly after it.

(I suppose it's possible that the first e2fsck missed it but that seems unlikely)

I've tried racing through where we should get to the point where we clear the UNINIT flag, and ... things get wonky.  I'm not sure if that's my failing, or gcc's :)

Comment 20 Dan Horák 2015-05-11 21:13:29 UTC

ok, time for the hard way of "bisecting" through the sources then :-) Thanks for your time, Eric.

Comment 21 Dan Horák 2015-05-12 10:18:52 UTC

little progress - switching e2fsck/rehash.c to -O1 makes the problem go away, trying to find the function now ...

Comment 22 Dan Horák 2015-05-12 11:33:53 UTC

so there are 2 things that help - switching 3 functions in rehash.c to O1 (see below) or rebuilding e2fsck with -fno-strict-aliasing, so it still might be a coding issue

--- rehash.c	2015-05-12 07:28:36.000000000 -0400
+++ /home/sharkcz/e2fsprogs-sandeen/e2fsprogs-1.42.12/e2fsck/rehash.c	2014-08-02 16:26:22.000000000 -0400
@@ -276,7 +274,7 @@
  * expand the length of the filename beyond the padding available in
  * the directory entry.
  */
-__attribute__((optimize("O1"))) static void mutate_name(char *str, __u16 *len)
+static void mutate_name(char *str, __u16 *len)
 {
 	int	i;
 	__u16	l = *len & 0xFF, h = *len & 0xff00;
@@ -549,7 +547,7 @@
  * This function takes the leaf nodes which have been written in
  * outdir, and populates the root node and any necessary interior nodes.
  */
-__attribute__((optimize("O1"))) static errcode_t calculate_tree(ext2_filsys fs,
+static errcode_t calculate_tree(ext2_filsys fs,
 				struct out_dir *outdir,
 				ext2_ino_t ino,
 				ext2_ino_t parent)
@@ -634,7 +632,7 @@
 /*
  * Helper function which writes out a directory block.
  */
-__attribute__((optimize("O1"))) static int write_dir_block(ext2_filsys fs,
+static int write_dir_block(ext2_filsys fs,
 			   blk64_t *block_nr,
 			   e2_blkcnt_t blockcnt,
 			   blk64_t ref_block EXT2FS_ATTR((unused)),

Comment 23 Jakub Jelinek 2015-05-12 11:58:44 UTC

The code doesn't look to be aliasing safe.  Haven't spent much time on it, but e.g.
struct ext2_dx_entry {
        __u32 hash;
        __u32 block;
};

struct ext2_dx_countlimit {
        __u16 limit;
        __u16 count;
};

...
                                dx_ent = set_int_node(fs, block_start);
                                limit = (struct ext2_dx_countlimit *) dx_ent;
                                c2 = limit->limit;
                                root_offset += sizeof(struct ext2_dx_entry);
                                c1--;
                        }
                        dx_ent->block = ext2fs_cpu_to_le32(i);
                        if (c2 != limit->limit)
                                dx_ent->hash =
                                        ext2fs_cpu_to_le32(outdir->hashes[i]);
                        dx_ent++;
                        c2--;
                }
                limit->count = ext2fs_cpu_to_le16(limit->limit - c2);
                limit->limit = ext2fs_cpu_to_le16(limit->limit);

So, set_int_node writes some chunk of memory using ext2_dx_countlimit struct,
and then mixes reads and writes from it using both struct.  Not saying this is exactly the problem spot, the code seems to be full of issues like that.

Comment 24 Eric Sandeen 2015-05-13 15:39:40 UTC

Should we just add a -fno-strict-aliasing to the top-level Makefile's BUILD_CFLAGS?  (Given that the code seems to be full of issues like that)  :)

-Eric

Comment 25 Jakub Jelinek 2015-05-13 15:44:21 UTC

Supposedly yes, that is the quickest solution.  If you'd want to know why exactly it misbehaved and have proper analysis, it would require somebody to spend time on the concurrent debugging sessions of both -fno-strict-aliasing and -fstrict-aliasing built compiled e2fsck and see when the two start to diverge in their decisions and with those details let me have a look at that.
We don't (yet) have an strict aliasing sanitizer, and even if we would, I bet we'd certainly flag the above as violation, because the accesses aren't through a union and the aliasing incompatible structs overlap each other.

Comment 26 Eric Sandeen 2015-05-13 16:23:57 UTC

Sure, I'm not trying to argue that these things shouldn't be fixed, but on the other hand, it'd be good to get a non-broken build sooner rather than later.  :)  I'll certainly take a look at your example above, but like you said, I think it's one of many.

Comment 27 Jakub Jelinek 2015-05-13 16:26:56 UTC

Also a question is if this kind of programming style (reusing the same memory for different structs without using unions) is in all files, or just in e2fsck, or just in a subset of e2fsck.

Comment 28 Dan Horák 2015-05-13 16:31:09 UTC

(In reply to Eric Sandeen from comment #26)
> Sure, I'm not trying to argue that these things shouldn't be fixed, but on
> the other hand, it'd be good to get a non-broken build sooner rather than
> later.  :)  I'll certainly take a look at your example above, but like you
> said, I think it's one of many.

I would just add -fno-strict-aliasing to the spec file with a reference to this bz and left the real solution for later. And switch this bug to e2fsprogs and leave it open, so it's tracked.

Comment 29 Eric Sandeen 2015-05-13 16:49:03 UTC

Well, e2fsck has to cope with every possible bit of metadata; things like debugfs more or less do, as well, usually via libext2.  So this style is probably fairly common.  I'd have to go look.

ISTR that the directory indexing / htree code is the most frequent culprit.

There have been other fixes in the past, i.e..

c816ecb e2fsprogs: fix type-punning warnings
2694f31 Fix type punning bugs in ext2fs_get_mem() and ext2fs_free_mem()


-Eric

Comment 30 Dan Horák 2015-06-05 13:07:27 UTC

Erix, shall we do a new build with -fno-strict-aliasing set globally? I think it is the only safe option now.

Comment 31 Eric Sandeen 2015-06-05 13:45:41 UTC

Sorry, this fell off my TODO list somehow.  Yes, I can set that and rebuild.  Doing it for rawhide is enough, right?

-Eric

Comment 32 Dan Horák 2015-06-05 14:35:54 UTC

(In reply to Eric Sandeen from comment #31)
> Sorry, this fell off my TODO list somehow.  Yes, I can set that and rebuild.
> Doing it for rawhide is enough, right?
> 
> -Eric

no problem :-) Yes, rawhide for sure, not so sure about f22, because I think we are only lucky there, same gcc, potentially the same problem.

Comment 33 Eric Sandeen 2015-06-08 21:21:38 UTC

Ok built in rawhide now.  Sorry for the delay!
e2fsprogs-1.42.13-2.fc23

Note You need to log in before you can comment on or make changes to this bug.