Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1615586 - segfault in ns-slapd during FreeIPA server deployment (or post-deployment operation) on Rawhide
Summary: segfault in ns-slapd during FreeIPA server deployment (or post-deployment ope...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: 389-ds-base
Version: 29
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: mreynolds
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: AcceptedBlocker
: 1615452 (view as bug list)
Depends On:
Blocks: F29BetaBlocker
TreeView+ depends on / blocked
 
Reported: 2018-08-13 18:54 UTC by Adam Williamson
Modified: 2018-09-07 23:54 UTC (History)
12 users (show)

Fixed In Version: 389-ds-base-1.4.0.15-1.fc29
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-07 23:54:07 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
stacktrace.txt (31.18 KB, text/plain)
2018-08-14 19:16 UTC, Viktor Ashirov
no flags Details
core.ns-slapd.389.c2d5d91a0ca341219d1c21ee196232b1.19895.1534273439000000.lz4 (4.85 MB, application/octet-stream)
2018-08-14 19:17 UTC, Viktor Ashirov
no flags Details

Description Adam Williamson 2018-08-13 18:54:09 UTC
Most FreeIPA server deployment tests failed on both prod and staging openQA in the last two Rawhide composes, Fedora-Rawhide-20180812.n.0 and Fedora-Rawhide-20180813.n.0. In all cases, it seems like this was caused by a segfault in ns-slapd, like this (0813.n.0 prod):

======

Aug 13 09:48:27 ipa002.domain.local ns-slapd[32126]: [13/Aug/2018:12:48:27.184902543 -0400] - INFO - ldbm_back_ldbm2index - userRoot: Finished indexing.
Aug 13 09:48:27 ipa002.domain.local audit[32126]: ANOM_ABEND auid=4294967295 uid=389 gid=389 ses=4294967295 subj=system_u:system_r:dirsrv_t:s0 pid=32126 comm="ns-slapd" exe="/usr/sbin/ns-slapd" sig=11 res=1
Aug 13 09:48:27 ipa002.domain.local kernel: ns-slapd[32156]: segfault at 1a2000 ip 00007f44055a651f sp 00007f43ebbf9360 error 4 in libjemalloc.so.2[7f44055a4000+6c000]
Aug 13 09:48:27 ipa002.domain.local kernel: Code: 28 41 81 e4 f0 00 00 00 49 01 dc 4d 8b 4c 24 28 4d 39 ce 0f 85 da 01 00 00 48 89 e8 48 c1 e8 09 25 f8 ff 1f 00 49 03 44 24 30 <4c> 8b 00 4c 8d 15 97 c8 06 00 4d 89 c1 49 c1 e9 30 44 89 ca 49 8b 
Aug 13 09:48:27 ipa002.domain.local systemd[1]: dirsrv: Main process exited, code=killed, status=11/SEGV
Aug 13 09:48:27 ipa002.domain.local systemd[1]: dirsrv: Failed with result 'signal'.
Aug 13 09:48:27 ipa002.domain.local audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dirsrv@DOMAIN-LOCAL comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'

======

or this (0813.n.0 stg):

Aug 13 09:19:46 ipa002.domain.local systemd[1]: Started The Apache HTTP Server.
Aug 13 09:19:46 ipa002.domain.local audit[32244]: ANOM_ABEND auid=4294967295 uid=389 gid=389 ses=4294967295 subj=system_u:system_r:dirsrv_t:s0 pid=32244 comm="ns-slapd" exe="/usr/sbin/ns-slapd" sig=11 res=1
Aug 13 09:19:46 ipa002.domain.local kernel: ns-slapd[32257]: segfault at bd800 ip 00007efc81ad651f sp 00007efc6c5fb360 error 4 in libjemalloc.so.2[7efc81ad4000+6c000]
Aug 13 09:19:46 ipa002.domain.local kernel: Code: 28 41 81 e4 f0 00 00 00 49 01 dc 4d 8b 4c 24 28 4d 39 ce 0f 85 da 01 00 00 48 89 e8 48 c1 e8 09 25 f8 ff 1f 00 49 03 44 24 30 <4c> 8b 00 4c 8d 15 97 c8 06 00 4d 89 c1 49 c1 e9 30 44 89 ca 49 8b 
Aug 13 09:19:46 ipa002.domain.local systemd[1]: dirsrv: Main process exited, code=killed, status=11/SEGV
Aug 13 09:19:46 ipa002.domain.local systemd[1]: dirsrv: Failed with result 'signal'.
Aug 13 09:19:46 ipa002.domain.local audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dirsrv@DOMAIN-LOCAL comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'

======

or this (0812.n.0 stg):

Aug 12 08:22:59 ipa002.domain.local [32363]: GSSAPI client step 1
Aug 12 08:22:59 ipa002.domain.local kernel: ns-slapd[32250]: segfault at 6b18 ip 00007f10fb8be51f sp 00007f10e79fd360 error 4 in libjemalloc.so.2[7f10fb8bc000+6c000]
Aug 12 08:22:59 ipa002.domain.local kernel: Code: 28 41 81 e4 f0 00 00 00 49 01 dc 4d 8b 4c 24 28 4d 39 ce 0f 85 da 01 00 00 48 89 e8 48 c1 e8 09 25 f8 ff 1f 00 49 03 44 24 30 <4c> 8b 00 4c 8d 15 97 c8 06 00 4d 89 c1 49 c1 e9 30 44 89 ca 49 8b 
Aug 12 08:22:59 ipa002.domain.local audit[32239]: ANOM_ABEND auid=4294967295 uid=389 gid=389 ses=4294967295 subj=system_u:system_r:dirsrv_t:s0 pid=32239 comm="ns-slapd" exe="/usr/sbin/ns-slapd" sig=11 res=1
Aug 12 08:22:59 ipa002.domain.local systemd[1]: dirsrv: Main process exited, code=killed, status=11/SEGV
Aug 12 08:22:59 ipa002.domain.local named-pkcs11[32658]: LDAP error: Can't contact LDAP server: ldap_sync_poll() failed
Aug 12 08:22:59 ipa002.domain.local systemd[1]: dirsrv: Failed with result 'signal'.
Aug 12 08:22:59 ipa002.domain.local named-pkcs11[32658]: ldap_syncrepl will reconnect in 60 seconds

======

The segfault doesn't seem to happen at the exact same time or point of the tests each run, and in a couple of cases didn't happen at all (the non-replica server tests for 0813.n.0). It happened in 6 out of 8 total FreeIPA server deployment tests run on those two composes across prod and staging openQA.

Comment 1 Adam Williamson 2018-08-13 18:55:27 UTC
*** Bug 1615452 has been marked as a duplicate of this bug. ***

Comment 2 Adam Williamson 2018-08-13 19:05:13 UTC
Correction: crash happened in at least 7 out of 8 cases, it also happened in the 'successful' https://openqa.fedoraproject.org/tests/265344 , probably while one of the client tests was trying to do something, likely causing client test failure. It's a bit difficult trying to unpick all the different test failures ATM, but I suspect a lot if not all are down to this.

Comment 3 mreynolds 2018-08-13 19:13:12 UTC
Which version of 389-ds-base was used?

We just added changes to satisfy this bug around jemalloc (1.4.0.14-2):

https://bugzilla.redhat.com/show_bug.cgi?id=1615302

Perhaps this is part of the problem?

Comment 4 mreynolds 2018-08-13 19:25:05 UTC
Also, is there a core file or stack trace that I can look at?

Comment 5 mreynolds 2018-08-13 19:43:58 UTC
https://kojipkgs.fedoraproject.org/compose/rawhide/Fedora-Rawhide-20180813.n.0/compose/Everything/x86_64/os/Packages/3/

This is actually 389-ds-base-1.4.0.14-1. So this is not related to https://bugzilla.redhat.com/show_bug.cgi?id=1615302 (which is good).

But there are no known crashes in 1.4.0.14-1, so this is a new issue.  Really need a reproducer or corefile/stacktrace to look at.

Comment 6 Adam Williamson 2018-08-13 21:21:18 UTC
Well, it seems it quite often is reproduced just by running a FreeIPA server deployment on current Rawhide. I will try to reproduce it in a local VM and get at the core dump somehow (it seems like it's not being placed in /var/lib/systemd/coredump or /var/spool/abrt , where openQA looks for them from systemd and abrt respectively).

Comment 7 mreynolds 2018-08-13 21:49:56 UTC
(In reply to Adam Williamson from comment #6)
> Well, it seems it quite often is reproduced just by running a FreeIPA server
> deployment on current Rawhide. 

Do you mean just installing FreeIPA or running some kind of test suite?  If it's just installing FreeIPA then I can work on that.

Comment 8 Adam Williamson 2018-08-13 22:01:47 UTC
Just installing necessary packages and doing 'ipa-server-install' with appropriate options. The test that's failing the most runs it like this:

ipa-server-install -U --realm=DOMAIN.LOCAL --domain=domain.local --ds-password=monkeys123 --admin-password=monkeys123 --setup-dns --reverse-zone=2.0.10.in-addr.arpa --allow-zone-overlap --forwarder=10.5.126.21 --forwarder=10.5.126.22

obviously adjust host / realm names and IP addresses as appropriate.

Comment 9 Jan Kurik 2018-08-14 10:25:09 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 29 development cycle.
Changing version to '29'.

Comment 10 Viktor Ashirov 2018-08-14 19:16:40 UTC
Created attachment 1475948 [details]
stacktrace.txt

Comment 11 Viktor Ashirov 2018-08-14 19:17:59 UTC
Created attachment 1475949 [details]
core.ns-slapd.389.c2d5d91a0ca341219d1c21ee196232b1.19895.1534273439000000.lz4

Comment 13 mreynolds 2018-08-14 19:23:56 UTC
Looks like its a double free in the connection code:

#8  0x000055e04c06d1fb in connection_threadmain () at ldap/servers/slapd/connection.c:1585
1585	                        slapi_sdn_free(&anon_sdn);


(gdb) p *anon_sdn
$3 = {flag = 46 '.', udn = 0x65 <error: Cannot access memory at address 0x65>, 
  dn = 0x7fb49264a04000 <error: Cannot access memory at address 0x7fb49264a04000>, ndn = 0x0, ndn_len = 10496}


This is possibly related to nunc-stans being disabled.  Needs more investigation...

Comment 14 mreynolds 2018-08-14 19:34:01 UTC
Here is the crashing thread

#0  0x00007fb498f3f51f in atomic_load_p (mo=atomic_memory_order_relaxed, a=0x125020)
    at include/jemalloc/internal/atomic_gcc_atomic.h:18
#1  0x00007fb498f3f51f in rtree_leaf_elm_bits_read (dependent=true, elm=0x125020, rtree=<optimized out>, tsdn=<optimized out>)
    at include/jemalloc/internal/rtree.h:175
#2  0x00007fb498f3f51f in rtree_szind_slab_read
    (r_slab=<synthetic pointer>, r_szind=<synthetic pointer>, dependent=true, key=35945862888701952, rtree_ctx=0x7fb482172f28, rtree=<optimized out>, tsdn=<optimized out>) at include/jemalloc/internal/rtree.h:464
#3  0x00007fb498f3f51f in ifree (slow_path=false, tcache=0x7fb4821730c0, ptr=0x7fb49264a04000, tsd=<optimized out>)
    at src/jemalloc.c:2206
#4  0x00007fb498f3f51f in free (ptr=0x7fb49264a04000) at src/jemalloc.c:2393
#5  0x00007fb498e36fb5 in slapi_ch_free () at /usr/lib64/dirsrv/libslapd.so.0
#6  0x00007fb498e40479 in slapi_sdn_done () at /usr/lib64/dirsrv/libslapd.so.0
#7  0x00007fb498e40afe in slapi_sdn_free () at /usr/lib64/dirsrv/libslapd.so.0
#8  0x000055e04c06d1fb in connection_threadmain () at ldap/servers/slapd/connection.c:1585
#9  0x00007fb498a00ac8 in _pt_root (arg=0x7fb493fb8740) at ../../.././nspr/pr/src/pthreads/ptthread.c:201
#10 0x00007fb49899e58e in start_thread (arg=<optimized out>) at pthread_create.c:486
#11 0x00007fb4987d0513 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

So the anon_sdn is corrupted, but it should not be looking at the code.  Going to try and reproduce using valgrind...

Comment 15 Adam Williamson 2018-08-14 20:01:42 UTC
Thanks for getting the core dump, Viktor!

Comment 16 Viktor Ashirov 2018-08-15 12:36:33 UTC
Update:
I've tested with ASan build:
* with 389-ds-base-1.4.0.13-1.fc29.x86_64 install succeeds without a crash.
* with 389-ds-base-1.4.0.14-2.fc29.x86_64 install fails with the stack trace: https://pagure.io/389-ds-base/issue/49029#comment-526297

With halt_on_error=false ASan build doesn't continue and exits anyway. I also tried to attach with gdb (with sleep_before_dying=1000), but gdb coredumps :|

Anyway, looks like a newly added internal operations logging code caused the issue.

Comment 17 mreynolds 2018-08-16 19:08:48 UTC
Fixed upstream:

https://pagure.io/389-ds-base/issue/49029

New fedora build complete.

Comment 18 Geoffrey Marr 2018-08-20 19:45:53 UTC
Discussed during the 2018-08-20 blocker review meeting: [1]

The decision to classify this bug as an "AcceptedBlocker" was made as it violates the following criteria:

"It must be possible to configure a Fedora Server system installed according to the above criteria as a FreeIPA domain controller, using the official deployment tools provided in the distribution FreeIPA packages. Once deployed, the system must handle multiple client enrolments and unenrolments, and client authentication via Kerberos..." 

[1] https://meetbot.fedoraproject.org/fedora-blocker-review/2018-08-20/f29-blocker-review.2018-08-20-16.00.txt

Comment 19 Adam Williamson 2018-09-07 23:54:07 UTC
FreeIPA tests have been passing quite reliably of late, so this was clearly fixed.


Note You need to log in before you can comment on or make changes to this bug.