Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1108219
Summary: | 4.9.0-6 broke build of postgresql on AArch64 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Marcin Juszkiewicz <mjuszkie> | ||||||||
Component: | gcc | Assignee: | Jakub Jelinek <jakub> | ||||||||
Status: | CLOSED RAWHIDE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||
Severity: | unspecified | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | rawhide | CC: | blc, hhorak, jakub, law, mjuszkie, pbrobinson, praiskup, vmakarov | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | aarch64 | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2014-06-24 15:55:08 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 922257 | ||||||||||
Attachments: |
|
Description
Marcin Juszkiewicz
2014-06-11 14:55:54 UTC
create_index generates this kernel message (which is bad pointer): [26985.027982] postgres[24006]: unhandled level 3 translation fault (11) at 0x13cc70b4, esr 0x92000007 [26985.037013] pgd = fffffe016dc20000 [26985.040399] [13cc70b4] *pgd=0000004156ea0003, *pmd=0000004156ea0003, *pte=0000000000000000 [26985.050155] CPU: 1 PID: 24006 Comm: postgres Tainted: GF 3.13.0-0.rc7.33.sa2.aarch64 #1 [26985.059072] task: fffffe03ff74cd00 ti: fffffe03a5100000 task.ti: fffffe03a5100000 [26985.066526] PC is at 0x4a5a2c [26985.069476] LR is at 0x4a58f8 [26985.072431] pc : [<00000000004a5a2c>] lr : [<00000000004a58f8>] pstate: 80000000 [26985.079787] sp : 000003ffcd808a60 [26985.083086] x29: 000003ffcd808a60 x28: 000003ffcd809020 [26985.088390] x27: 000003ffa2b5ce48 x26: 0000000000000000 [26985.093696] x25: 000003ffa2b5ce51 x24: 000003ffcd809020 [26985.098999] x23: 0000000009e63834 x22: 000003ffcd809050 [26985.104306] x21: 0000000000000022 x20: 0000000000000001 [26985.109608] x19: 0000000000000001 x18: 0000000000002000 [26985.114915] x17: 000003ffaa067c80 x16: 0000000000911040 [26985.120218] x15: 00000000ffffffff x14: 0000000000000020 [26985.125525] x13: 2074432020202020 x12: 2020202020202020 [26985.130828] x11: 2020202020202020 x10: 2020202072656243 [26985.136135] x9 : 0000000000100001 x8 : 00000087000000e0 [26985.141442] x7 : 0000000000000008 x6 : 000000000000006d [26985.146744] x5 : 0000000000000000 x4 : 0000000009e61280 [26985.152051] x3 : 000003ffa2b5ce52 x2 : 0000000009e63835 If you have access to an aarch64 box or chroot, can you please bisect which *.o file it is (try to mix *.o files from build with gcc-4.9.0-5.fc21 with *.o files from build with gcc-4.9.0-6.fc21 and ideally narrow it to one where if all *.o files come from 4.9.0-6.fc21 but that one from -5.fc21 it works and if all *.o files come from -5.fc21 but that one from -6.fc21 it doesn't work. If you get to this state, please attach here preprocessed source and mention all gcc command line options used to compile it, I can then find out what changed using a cross-compiler. Thanks. I have AArch64 machine under desk. My plan for tomorrow is bisecting gcc 4.9.0-5 -> 4.9.0-6 update to find out when exactly it failed. If you mean bisect redhat/gcc-4_9-branch, then that is unlikely to help, there have been exactly 2 svn commits, one which added -fsanitize=float-cast-overflow, very unlikely related, and one which backported about 46 fixes from upstream gcc-4_9-branch. I'd say bisecting *.o file is faster, then one e.g. can try to reproduce with upstream 4.9 branch, or -fdump-tree-all -fdump-rtl-all to find out where it starts to differ and from that guess problematic change, etc. It also wouldn't be a terrible idea to wait for Jakub to import the next build of gcc. We're still seeing a fair number of codegen bug reports which are being backported to the release (and presumably vendor) branch. So by *.o you mean /usr/lib/gcc/*/*/*.o files? No, I meant you build postgresql with gcc-4.9.0-5.fc21, make a backup copy of the build tree, build postgresql with gcc-4.9.0-6.fc21, make a backup copy of the build tree. Then, divide the *.o files in the postgresql approximately into two halves, for the first half copy (+ touch) them from the backup tree built with 4.9.0-5.fc21, for the second half copy (+ touch) them from the backup tree built with 4.9.0-6.fc21, relink (hopefully just make would do, but please verify no *.o files are rebuilt in that step), retest. If the result works, it means the problematic file is supposedly in the first half, if it doesn't, it means the problematic file is supposedly in the second half. Then divide the problematic half into two approx. same sized parts and continue until you narrow it to one file, then just verify it is really just that single one file. If nothing needs to be recompiled, each step will be just editing some list file containing names of the *.o files, copying/touching/relinking and retesting, so it shouldn't be that slow. Note postgresql has a copy of the Spencer regex library which we know GCC has been miscompiling on PPC & s390. I've just backported the fix for that bug into the upstream gcc-4.9 release branch and when Jakub does the next resync & koji build Fedora will pick up that fix. Martin, could you try just compiling the bits in the regex subdirectory with gcc-4.8 or without optimization and see if that improves the test results? Updated to the latest gcc. Still fails: ====================================================== 38 of 136 tests failed, 1 of these failures ignored. ====================================================== The differences that caused some tests to fail can be viewed in the file "/builddir/build/BUILD/postgresql-9.3.4/src/test/regress/regression.diffs". A copy of the test summary that you see above is saved in the file "/builddir/build/BUILD/postgresql-9.3.4/src/test/regress/regression.out". GNUmakefile:138: recipe for target 'check' failed make: *** [check] Error 1 błąd: Błędny stan wyjścia z /var/tmp/rpm-tmp.0iwiQ5 (%build) Błędy budowania pakietu RPM: Błędny stan wyjścia z /var/tmp/rpm-tmp.0iwiQ5 (%build) <mock-chroot>[mockbuild@pinkypie /]$ gcc --version gcc (GCC) 4.9.0 20140612 (Red Hat 4.9.0-9) Copyright (C) 2014 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. As you said it regressed between 4.9.0-{5,6}.fc21, then that is kind of expected, the jump threading bug Jeff fixed was older than that. Anyway, have you succeeded with the binary search to find out problematic *.o file (of course, brute force can be replaced with a guess based on what you see in the debugger, then you can just try to replace the single file). OK. I took gcc upstream git, extracted all commits between 20140518 (-5 fedora) and 20140529 (-6 fedora) and created quilt patchset from them. This gave me 52 patches (had to drop 2 of them as they were in Fedora already). Chrooted into mock with gcc 4.9.0-5 and started bisecting. 0029 = daily bump to 20140523 == fail 0021 = daily bump to 20140522 == works 0026 = PR target/61208 [1] == works 1. https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=a4fbabf41b62765a5da3a8e5394b4b3c7441315e So problem is in one of two patches: 7766b1d93 2014-05-22 Vladimir Makarov <vmakarov> a790cfefd gcc/ But as 7766b1d93 is rs6000 related I suspect a790cfefd to be faulty one. - https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=7766b1d931a85f4d6c887dd0164a94ee7b29be51 - https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=a790cfefdfbbc4e5aa23f11a163ecb51c84fd128 The PR60969 fix had I think 2 regressions it caused, but none of those is wrong-code. In any case, we really need the problematic file (and find problematic function in it, hopefully if the PR60969 fix doesn't cause too many changes in every function one could do that by comparing assembly between the two changes), otherwise Vlad can't work on a fix. Looks like src/backend/access/spgist/spgtextproc.o is to blame. Command to compile: gcc -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -DLINUX_OOM_SCORE_ADJ=0 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -I../../../../src/include -D_GNU_SOURCE -I/usr/include/libxml2 -c -o spgtextproc.o spgtextproc.c Can you preprocess it and attach spgtextproc.i ? Just add -save-temps to the above command. Created attachment 908547 [details]
archive with problematic file
Tarball contents:
t/spgtextproc.c
t/broken/ - built with gcc 4.9.0-5.0029
t/broken/spgtextproc.o
t/broken/spgtextproc.o.s
t/worked/ - built with gcc 4.9.0-5.0026
t/worked/spgtextproc.o
t/worked/spgtextproc.o.s
Assembly version done with "objdump -d".
Created attachment 908550 [details]
preprocessed source
Created attachment 908587 [details]
rh1108219.c
If my cross-compiler doesn't behave too differently from the native one, seems the only differences are due to different register allocator decisions in
spg_text_choose function, does that sound likely from what you see in the debugger? If yes, can you find out how many times that function is called before things go wrong and ideally in which iteration it does something wrong and if possible what?
I'm attaching a delta reduced source for the spg_text_choose, if the problem is indeed there, best would be if we could turn this into a self-contained executable testcase, so in particular stub the palloc and pg_detoast_datum_packed functions in a different *.c file, it is enough if they set and return only whatever spg_text_choose needs, and stub main function in that different *.c file so that it will call spg_text_choose with the parameters where it will misbehave (of course when it is dereferencing pointers, they must point to something etc.).
Appart from reshufling a few registers (x22->x24, x23->x22, x24->x23, w1->w21)
and some hopefully unimportant insn scheduling changes I see two hunks that look differently:
- ldr x3, [x1,w2,sxtw 3]
- uxtb w3, w3
+ ldrb w3, [x1,x0]
and
+ ldrb w3, [x1,x0]
add w4, w7, w5
asr w4, w4, 1
- ldr x3, [x1,w4,sxtw 3]
- uxtb w3, w3
Any help in finding out what goes wrong in that function and if it is really that function would be appreciated. E.g. try to add __attribute__((optimize (0))) on that function and see if the problems go away...
The patch in question itself should be safe but I guess it just triggered some hidden bug. I'll investigate this further. (In reply to Jakub Jelinek from comment #17) > Created attachment 908587 [details] > Appart from reshufling a few registers (x22->x24, x23->x22, x24->x23, > w1->w21) > and some hopefully unimportant insn scheduling changes I see two hunks that > look differently: > > - ldr x3, [x1,w2,sxtw 3] > - uxtb w3, w3 > + ldrb w3, [x1,x0] > > and > > + ldrb w3, [x1,x0] > add w4, w7, w5 > asr w4, w4, 1 > - ldr x3, [x1,w4,sxtw 3] > - uxtb w3, w3 > > Any help in finding out what goes wrong in that function and if it is really > that function would be appreciated. E.g. try to add __attribute__((optimize > (0))) on that function and see if the problems go away... I found this code also suspicious. It is a result of equiv. memory substitution. And I think some parts of the address was lost. The bug is in address decomposition code in rtlanal.c. It was written by Richard Sandiford and out of my maintained code base. I'll try to make a patch but I am not sure when it will be approved. I hope it will be fixed on next week. I found another solution inside LRA code base. I committed the patch into gcc-4.9-branch. Is the problem fixed with gcc-4.9.0-12.fc21 ? We built postgresql with gcc 4.9.0-10.fc21 just fine. http://arm.koji.fedoraproject.org/koji/buildinfo?buildID=203514 |