Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 985342
Summary: | illegal instructions with glibc-2.17.90 on armv7hl | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Jiri Kastner <jkastner> | ||||
Component: | glibc | Assignee: | Carlos O'Donell <codonell> | ||||
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | rawhide | CC: | awilliam, blc, codonell, cz172638, fweimer, hdegoede, jakub, jreznik, kmcmartin, kparal, law, pbrobinson, pfrankli, pschindl, pwhalen, robatino, schwab, spoyarek | ||||
Target Milestone: | --- | Keywords: | Reopened | ||||
Target Release: | --- | ||||||
Hardware: | arm | ||||||
OS: | Unspecified | ||||||
Whiteboard: | AcceptedBlocker | ||||||
Fixed In Version: | glibc-2.17-18.fc19 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2013-08-30 09:30:34 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 245418, 980649 | ||||||
Attachments: |
|
Description
Jiri Kastner
2013-07-17 10:24:05 UTC
<dgilmore> [root@wandboard01 ~]# rpm -q glibc <dgilmore> glibc-2.17.90-1.fc20.armv7hl <e-ndy> dgilmore, on trimslice i'm getting illegal instructions <nchauvet_> trimslice doesn't have neon instruction ... <dgilmore> e-ndy: im guessing something is enabled that shouldnt be <e-ndy> probably <dgilmore> either neon or 32 vfp registers <dgilmore> tegra2 has only 16 registers and no neon As of Fedora 19 the Trimslice is no longer supported. Fedora rawhide only supports the VFP ABI and your Tegra 2 based system doesn't support the instructions that are part of the VFP ABI and therefore you will get SIGILL when executing those instructions. The best that can be done here would have been to have the dynamic linker prevent you from starting any userspace applications at all by printing an error and halting. You would have immediately bricked your box, but you'd know why. Marking CLOSED/NOTABUG. Jiri, could you please echo 1 >/proc/sys/kernel/print-fatal-signals and check dmesg for the oops that should be printed? I'd like to decode the actual faulting instruction to figure out where the failure is. regards, Kyle Sorry, Trimslice is definitely supported in F19. The Fedora ARM project uses the armv7hl target. According to "/usr/lib/rpm/redhat/rpmrc" (Provided by redhat-rpm-config), the compiler flag definition is: optflags: armv7hl %{__global_cflags} -march=armv7-a -mfpu=vfpv3-d16 -mfloat-abi=hard This is what we have used since the Fedora 15 hard float bootstrap. (In reply to Brendan Conoboy from comment #4) > Sorry, Trimslice is definitely supported in F19. The Fedora ARM project > uses the armv7hl target. According to "/usr/lib/rpm/redhat/rpmrc" (Provided > by redhat-rpm-config), the compiler flag definition is: > > optflags: armv7hl %{__global_cflags} -march=armv7-a -mfpu=vfpv3-d16 > -mfloat-abi=hard > > This is what we have used since the Fedora 15 hard float bootstrap. Brendan thanks for the clarification and the distinction between armv7hl vs. armv7hnl (which requires neon). I had misunderstood and expected armv7hl to require NEON which excludes Trimslice. I will say that in general VFP ABI without NEON is not a configuration that upstream tests, but it's a valid configuration for systems that do not have NEON. The next step is to find out exactly what instruction faulted and where. Created attachment 775172 [details]
dmesg with oops
(In reply to Carlos O'Donell from comment #5) > (In reply to Brendan Conoboy from comment #4) > > Sorry, Trimslice is definitely supported in F19. The Fedora ARM project > > uses the armv7hl target. According to "/usr/lib/rpm/redhat/rpmrc" (Provided > > by redhat-rpm-config), the compiler flag definition is: > > > > optflags: armv7hl %{__global_cflags} -march=armv7-a -mfpu=vfpv3-d16 > > -mfloat-abi=hard > > > > This is what we have used since the Fedora 15 hard float bootstrap. > > Brendan thanks for the clarification and the distinction between armv7hl vs. > armv7hnl (which requires neon). I had misunderstood and expected armv7hl to > require NEON which excludes Trimslice. > > I will say that in general VFP ABI without NEON is not a configuration that > upstream tests, but it's a valid configuration for systems that do not have > NEON. > > The next step is to find out exactly what instruction faulted and where. what i did: echo 1 >/proc/sys/kernel/print-fatal-signals yum --disablerepo=\* --enablerepo=rawhide --installroot=/mnt/target install glibc coreutils chroot /mnt/target in chroot: ls ls /root rpm -qa rpmdb --help (In reply to Jiri Kastner from comment #7) > echo 1 >/proc/sys/kernel/print-fatal-signals This isn't enough since it doesn't appear to contain the faulting instruction. The best would be to get a core file for any SIGILL. Take that core file to a working system and use gdb to do a backtrace and disassembly to see exactly what instruction was the one that faulted. Extra points if you identify the routine in the original libc.so.6 that had the faulting instruction. The core files show a completely corrupt stack with no ability to determine the faulting instruction. It looks like the applications jumped into the heap and started executing data which leads to SIGILL. This looks like it could be a glibc bug. I need the reporter to try the following: * Try to reproduce this using a trivial appliation. e.g. write a hello world C application. * Compile the trivial application against the chroot's version of glibc. e.g. gcc -Wl,--dynamic-linker=/chroot/lib/ld-linux.so.3 -Wl,-rpath=/chroot/lib:/chroot/usr/lib -g3 -O0 -o app app.c * Debug the application e.g. gdb app, and see if it works. If glibc is a problem then the application should crash and you should be able to try debug why. Notes: - Don't use threads. [root@dhcp-26-143 ~]# gcc -Wl,--dynamic-linker=/root/rawhide/lib/ld-linux.so.3 -Wl,-rpath=/root/rawhide/lib/:/ro ot/rawhide/usr/lib -g3 -O0 -o app app.c [root@dhcp-26-143 ~]# gdb app GNU gdb (GDB) Fedora (7.6-30.fc19) Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "armv7hl-redhat-linux-gnueabi". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /root/app...done. (gdb) run Starting program: /root/app /root/app: relocation error: /root/rawhide/lib/libc.so.6: symbol _dl_find_dso_for_object, version GLIBC_PRIVATE not defined in file ld-linux-armhf.so.3 with link time reference [Inferior 1 (process 1675) exited with code 0177] (gdb) quit (In reply to Jiri Kastner from comment #10) > /root/app: relocation error: /root/rawhide/lib/libc.so.6: symbol > _dl_find_dso_for_object, version GLIBC_PRIVATE not defined in file > ld-linux-armhf.so.3 with link time reference What does ldd /root/app say? Either way I've found the problem. Your chroot is hard-float with NEON support required and that is incompatible with your Trimslice hardware which lacks NEON. readelf -a -W /root/rawhide/lib/libc-2.17.90.so ... Attribute Section: aeabi File Attributes Tag_CPU_name: "7-A" Tag_CPU_arch: v7 Tag_CPU_arch_profile: Application Tag_ARM_ISA_use: Yes Tag_THUMB_ISA_use: Thumb-2 Tag_FP_arch: VFPv3 Tag_Advanced_SIMD_arch: NEONv1 Tag_ABI_PCS_wchar_t: 4 Tag_ABI_FP_rounding: Needed Tag_ABI_FP_denormal: Needed Tag_ABI_FP_exceptions: Needed Tag_ABI_FP_number_model: IEEE 754 Tag_ABI_align_needed: 8-byte Tag_ABI_enum_size: int Tag_ABI_HardFP_use: SP and DP Tag_ABI_VFP_args: VFP registers Tag_CPU_unaligned_access: v6 Your installed system is simply hard-float with only vfpv3 reuqired e.g. armv7hl readelf -a -W /lib/libc-2.17.so ... Attribute Section: aeabi File Attributes Tag_CPU_name: "7-A" Tag_CPU_arch: v7 Tag_CPU_arch_profile: Application Tag_ARM_ISA_use: Yes Tag_THUMB_ISA_use: Thumb-2 Tag_FP_arch: VFPv3-D16 Tag_ABI_PCS_wchar_t: 4 Tag_ABI_FP_rounding: Needed Tag_ABI_FP_denormal: Needed Tag_ABI_FP_exceptions: Needed Tag_ABI_FP_number_model: IEEE 754 Tag_ABI_align_needed: 8-byte Tag_ABI_align_preserved: 8-byte, except leaf SP Tag_ABI_enum_size: int Tag_ABI_HardFP_use: SP and DP Tag_ABI_VFP_args: VFP registers Tag_CPU_unaligned_access: v6 You need to install a non-NEON rawhide chroot. Sorry for not noticing this earlier when you provided access to the box. Does that clarify the issue? Is everyone OK with me closing this as NOTABUG? All rawhide is "non-NEON" as we don't build armv7hnl, just armv7hl, so either GCC is generating NEON code when it hasn't been asked for, or something else is amiss. (In reply to Kyle McMartin from comment #12) > All rawhide is "non-NEON" as we don't build armv7hnl, just armv7hl, so > either GCC is generating NEON code when it hasn't been asked for, or > something else is amiss. OK, rawhide is new enough that it has IFUNC support and therefore multiarch support. e.g. 00089d80 <__GI___memcpy_neon>: __memcpy_neon(): 89d80: e1a0c000 mov ip, r0 89d84: e3520040 cmp r2, #64 ; 0x40 89d88: aa000019 bge 89df4 <__GI___memcpy_neon+0x74> 89d8c: e2023038 and r3, r2, #56 ; 0x38 89d90: e2633034 rsb r3, r3, #52 ; 0x34 89d94: e08ff003 add pc, pc, r3 89d98: f421070d vld1.8 {d0}, [r1]! 89d9c: f40c070d vst1.8 {d0}, [ip]! 89da0: f421070d vld1.8 {d0}, [r1]! 89da4: f40c070d vst1.8 {d0}, [ip]! 89da8: f421070d vld1.8 {d0}, [r1]! 89dac: f40c070d vst1.8 {d0}, [ip]! 89db0: f421070d vld1.8 {d0}, [r1]! 89db4: f40c070d vst1.8 {d0}, [ip]! 89db8: f421070d vld1.8 {d0}, [r1]! 89dbc: f40c070d vst1.8 {d0}, [ip]! 89dc0: f421070d vld1.8 {d0}, [r1]! 89dc4: f40c070d vst1.8 {d0}, [ip]! 89dc8: f421070d vld1.8 {d0}, [r1]! 89dcc: f40c070d vst1.8 {d0}, [ip]! ... The vld1.8 is a part of the NEON memcpy. However, in 2.17 we have IFUNC support for selecting between VFP and NEON. 00084600 <memcpy>: __GI_memcpy(): 84600: e59f1010 ldr r1, [pc, #16] ; 84618 <memcpy+0x18> 84604: e3100a01 tst r0, #4096 ; 0x1000 84608: 159f1004 ldrne r1, [pc, #4] ; 84614 <memcpy+0x14> 8460c: e081000f add r0, r1, pc 84610: e12fff1e bx lr 84614: 0000576c .word 0x0000576c 84618: 00005c6c .word 0x00005c6c $d(): 8461c: e1a00000 .word 0xe1a00000 84620: e1a00000 .word 0xe1a00000 84624: e1a00000 .word 0xe1a00000 84628: e1a00000 .word 0xe1a00000 8462c: e1a00000 .word 0xe1a00000 84630: e1a00000 .word 0xe1a00000 84634: e1a00000 .word 0xe1a00000 84638: e1a00000 .word 0xe1a00000 8463c: e1a00000 .word 0xe1a00000 Which is roughly: ldr r1, .Lmemcpy_vfp tst r0, #HWCAP_ARM_NEON ldrne r1, .Lmemcpy_neon add r0, r1, pc DO_RET(lr) The AT_HWCAP is passed into the resolver as r0 (arg0) of the function. #define HWCAP_NEON (1 << 12) So 4096 matches. Rebuilding /root/app with a direct reference to the non-symlinked dynamic loader lets us load it correctly. Note: ls -alt /root/rawhide/lib/ld-linux.so.3 lrwxrwxrwx. 1 root root 24 Jul 18 11:09 /root/rawhide/lib/ld-linux.so.3 -> /lib/ld-linux-armhf.so.3 Should have been a relative symlink :-( ~~~ glibc.spec ~~~ # Leave a compatibility symlink for the dynamic loader on armhfp targets, # at least until the world gets rebuilt %ifarch armv7hl armv7hnl ln -sf /lib/ld-linux-armhf.so.3 $RPM_BUILD_ROOT/lib/ld-linux.so.3 %endif ~~~ (Yes the glibc "Move to /usr" transition is still incomplete, but we've rewritten the spec file for that and it will be in rawhide and rhel-7.0 soon) That should be something like this: ~~~ pushd $RPM_BUILD_ROOT%{libdir} ln -sf ld-linux-armhf.so.3 ld-linux.so.3 popd ~~~ I'll make note of this for later. ./build-app.sh ldd app libc.so.6 => /root/rawhide/lib/libc.so.6 (0xb6dbb000) libgcc_s.so.1 => /root/rawhide/lib/libgcc_s.so.1 (0xb6d95000) /root/rawhide/lib/ld-linux-armhf.so.3 => /lib/ld-linux-armhf.so.3 (0x4d438000) gdb app GNU gdb (GDB) Fedora (7.6-30.fc19) Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "armv7hl-redhat-linux-gnueabi". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /root/app...done. (gdb) r Starting program: /root/app Missing separate debuginfo for /root/rawhide/usr/lib/ld-linux-armhf.so.3 Try: yum --disablerepo='*' --enablerepo='*debug*' install /usr/lib/debug/.build-id/ec/d92013935f4d40996c9f62b64c0acd56121fff.debug Missing separate debuginfo for /root/rawhide/lib/libc.so.6 Try: yum --disablerepo='*' --enablerepo='*debug*' install /usr/lib/debug/.build-id/9d/e30aa450690168dcf0bae2992ad434a09a67ab.debug Missing separate debuginfo for /root/rawhide/lib/libgcc_s.so.1 Try: yum --disablerepo='*' --enablerepo='*debug*' install /usr/lib/debug/.build-id/17/a726db52ebf56cdf57c03aa483476c70a63d75.debug hello, world[Inferior 1 (process 3474) exited with code 01] (gdb) No failure. Adding memcpy to the mix: gdb app GNU gdb (GDB) Fedora (7.6-30.fc19) Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "armv7hl-redhat-linux-gnueabi". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /root/app...done. (gdb) break main Breakpoint 1 at 0x85b0: file app.c, line 9. (gdb) r Starting program: /root/app Missing separate debuginfo for /root/rawhide/usr/lib/ld-linux-armhf.so.3 Try: yum --disablerepo='*' --enablerepo='*debug*' install /usr/lib/debug/.build-id/ec/d92013935f4d40996c9f62b64c0acd56121fff.debug Missing separate debuginfo for /root/rawhide/lib/libc.so.6 Try: yum --disablerepo='*' --enablerepo='*debug*' install /usr/lib/debug/.build-id/9d/e30aa450690168dcf0bae2992ad434a09a67ab.debug Missing separate debuginfo for /root/rawhide/lib/libgcc_s.so.1 Try: yum --disablerepo='*' --enablerepo='*debug*' install /usr/lib/debug/.build-id/17/a726db52ebf56cdf57c03aa483476c70a63d75.debug Breakpoint 1, main () at app.c:9 9 printf("hello, world"); (gdb) break memcpy Breakpoint 2 at gnu-indirect-function resolver at 0xb6f03600 (gdb) c Continuing. Breakpoint 2, 0xb6f09280 in __memcpy_vfp () from /root/rawhide/lib/libc.so.6 (gdb) We get the correct __memcpy_vfp from the resolver per the AT_HWCAP bits. I see nothing wrong nor why the binaries would be crashing. Someone needs to come up with a relinked example that fails and examine why it fails in the new rawhide runtime. There might be a non-ifunc function that is using NEON, but I don't know. Again a testcase that fails would really help :-) Discussed at 2013-08-21 blocker review meeting [1]. This is accepted as an Alpha blocker, because it violates the following F20 alpha release criterion for trimslice (and potentially others): "A system installed with a release-blocking desktop must boot to a log in screen where it is possible to log in to a working desktop using a user account created during installation or a 'first boot' utility." [2] [1] http://meetbot.fedoraproject.org/fedora-blocker-review/2013-08-21/ [2] https://fedoraproject.org/wiki/Fedora_20_Alpha_Release_Criteria#Expected_image_boot_behavior Carlos, we're definitely crashing inside __memcpy_neon with the vld1.8 instruction: [ 170.431044] bash (878): undefined instruction: pc=b6e10540 [ 170.431060] Code: f421070d f40c070d f421070d f40c070d (f421070d) Not entirely sure how this could be happening, all the kernel and glibc instrumentation I've done shows that elf_hwcap is right. (In reply to Kyle McMartin from comment #15) > Carlos, we're definitely crashing inside __memcpy_neon with the vld1.8 > instruction: > [ 170.431044] bash (878): undefined instruction: pc=b6e10540 > [ 170.431060] Code: f421070d f40c070d f421070d f40c070d (f421070d) > > Not entirely sure how this could be happening, all the kernel and glibc > instrumentation I've done shows that elf_hwcap is right. I'll debug from the ifunc resolver up to see what's happening. IIUC it shouldn't select the neon memcpy. Finished a 2.19 master build on the test box (trimslice2) provided by Kyle. The test results are *terrible*, I'd say a lot of the testsuite is failing with SIGILL. I don't know if it's random, but the results are terrible enough that such a build/test should never see the light of day. With glibc we are going to get to the point where such a build would have been failed if such failures didn't match baseline on a koji build system (but we aren't there yet). Looking into the testsuite failures. Kyle noticed that ARM isn't following the IFUNC API and is not passing in the hwcap as required to the resolver function for REL relocs. There was a partial patch by Will Newton from Linaro which fixed this for RELA relocs but not REL[1] AAELF for ARM requires support for both REL and RELA relocs: ~~~ 4.6.1.1 Addends and PC -bias compensation A binary file may use REL or RELA relocations or a mixture of the two (but multiple relocations for the same address m ust use only one type) ~~~ Kyle's patch fixes it for REL. Testing Kyle's patch right now. If this fixes it I'll post and checkin upstream. [1] http://patches.linaro.org/18232/ Patch posted upstream: http://sourceware.org/ml/libc-ports/2013-08/msg00053.html Set of testsuite failures is now down to a much more reasonable 7 failures, of which one is due to my failure to install stdc++-static for the static C++ tests (tst-cancel24-static). Pushed into rawhide. commit b8280fad3dff1766f9b87d38acefdfdb7ba61c09 Author: Carlos O'Donell <carlos> Date: Wed Aug 28 00:34:43 2013 -0400 Fix indirect function support to avoid calling optimized routines for the wrong hardware (#985342). Build started: http://koji.fedoraproject.org/koji/taskinfo?taskID=5863325 (In reply to Carlos O'Donell from comment #22) > Pushed into rawhide. > > commit b8280fad3dff1766f9b87d38acefdfdb7ba61c09 > Author: Carlos O'Donell <carlos> > Date: Wed Aug 28 00:34:43 2013 -0400 > > Fix indirect function support to avoid calling optimized routines > for the wrong hardware (#985342). > > Build started: > http://koji.fedoraproject.org/koji/taskinfo?taskID=5863325 Carlos, could you please build it for F20 too - as this is approved F20 Alpha blocker. Thanks. (In reply to Jaroslav Reznik from comment #23) > (In reply to Carlos O'Donell from comment #22) > > Pushed into rawhide. > > > > commit b8280fad3dff1766f9b87d38acefdfdb7ba61c09 > > Author: Carlos O'Donell <carlos> > > Date: Wed Aug 28 00:34:43 2013 -0400 > > > > Fix indirect function support to avoid calling optimized routines > > for the wrong hardware (#985342). > > > > Build started: > > http://koji.fedoraproject.org/koji/taskinfo?taskID=5863325 > > Carlos, could you please build it for F20 too - as this is approved F20 > Alpha blocker. Thanks. The rawhide build baseline matches my expectations so everything went well. Pushed to f20. Build started: http://koji.fedoraproject.org/koji/taskinfo?taskID=5864423 awesome, thank you carlos. i just tested it on trimslice with rawhide, no illegal instructions. same reproducer as in description. Fixed upstream now: http://sourceware.org/bugzilla/show_bug.cgi?id=15905 we should probably mark this fixed as soon as someone confirms an image built with the fixed glibc works on a trimslice. We've had a number of people confirm this is now working so closing glibc-2.17-18.fc19 has been submitted as an update for Fedora 19. https://admin.fedoraproject.org/updates/glibc-2.17-18.fc19 glibc-2.18-9.fc20 has been submitted as an update for Fedora 20. https://admin.fedoraproject.org/updates/glibc-2.18-9.fc20 glibc-2.18-9.fc20 has been pushed to the Fedora 20 stable repository. If problems still persist, please make note of it in this bug report. glibc-2.17-18.fc19 has been pushed to the Fedora 19 stable repository. If problems still persist, please make note of it in this bug report. |