Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 2064219 - 5 out of 6 OSD crashing after update to 17.1.0-0.2.rc1.fc37.x86_64
Summary: 5 out of 6 OSD crashing after update to 17.1.0-0.2.rc1.fc37.x86_64
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: ceph
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kaleb KEITHLEY
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-15 10:51 UTC by Tomasz Torcz
Modified: 2022-03-18 11:26 UTC (History)
11 users (show)

Fixed In Version: ceph-17.1.0-0.4.31.g1ccf6db7.fc37
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-18 04:38:21 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
ceph-osd.0.2022-03-14.log (deleted)
2022-03-15 10:51 UTC, Tomasz Torcz
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 54561 0 None None None 2022-03-15 11:47:58 UTC

Description Tomasz Torcz 2022-03-15 10:51:09 UTC
Created attachment 1865996 [details]
ceph-osd.0.2022-03-14.log

Description of problem:
After upgrading to 17.1.0-0.2.rc1.fc37.x86_64, 5 out of 6 of my OSDs are crashing on start.

2022-03-14T11:20:44.682+0100 7ff5a50d0180 -1 bluestore::NCB::__restore_allocator::Failed open_for_read with error-code -2
2022-03-14T11:20:44.682+0100 7ff5a50d0180  0 bluestore(/var/lib/ceph/osd/ceph-0) _init_alloc::NCB::restore_allocator() failed! Run Full Recovery from ONodes (might ta
ke a while) ...
2022-03-14T11:20:54.767+0100 7ff5a50d0180 -1 /builddir/build/BUILD/ceph-17.1.0/src/os/bluestore/AvlAllocator.cc: In function 'virtual void AvlAllocator::init_add_free
(uint64_t, uint64_t)' thread 7ff5a50d0180 time 2022-03-14T11:20:54.766296+0100
/builddir/build/BUILD/ceph-17.1.0/src/os/bluestore/AvlAllocator.cc: 442: FAILED ceph_assert(offset + length <= uint64_t(device_size))

 ceph version 17.1.0 (c675060073a05d40ef404d5921c81178a52af6e0) quincy (dev)


(full log attached)


Version-Release number of selected component (if applicable):
17.1.0-0.2.rc1.fc37.x86_64

How reproducible:


Steps to Reproduce:
1. Upgrade working cluster to quincy rc1 release.
2.
3.

Actual results:
OSD crashing

Expected results:
OSD working.

Additional info:
My cluster has 3 control nodes running rawhide (mons, mgrs, mds).
1 physical server with 6 HDDs running 6 OSDs (rawhide).
I'm using CephFS and RGW.

Comment 1 Kaleb KEITHLEY 2022-03-16 11:34:36 UTC
Try the latest build ceph-17.1.0-0.3.28.g1b309fef.fc37 at https://koji.fedoraproject.org/koji/buildinfo?buildID=1934049. I believe we are waiting for a compose before you can just dnf update. 

Or the scratch build of ceph-17.1.0-0.4.31.g1ccf6db7 at https://koji.fedoraproject.org/koji/taskinfo?taskID=84236387

Comment 2 Kaleb KEITHLEY 2022-03-16 13:35:23 UTC
 https://github.com/ceph/ceph/pull/45342

Comment 3 Fedora Update System 2022-03-18 04:31:36 UTC
FEDORA-2022-5ca7aa480b has been submitted as an update to Fedora 37. https://bodhi.fedoraproject.org/updates/FEDORA-2022-5ca7aa480b

Comment 4 Fedora Update System 2022-03-18 04:38:21 UTC
FEDORA-2022-5ca7aa480b has been pushed to the Fedora 37 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 5 Tomasz Torcz 2022-03-18 09:56:23 UTC
Actually, g1ccf6db7 scratch build did not fix the problem, the OSDs are still crashing. But I cannot locate this commit in https://github.com/ceph/ceph/commits/quincy , so I do not know if PR was merged before g1ccf6db7.
Anyway, the fix is known and in the upstream repo, so next release should work for me. I'm going to leave this bug closed.

Comment 6 Kaleb KEITHLEY 2022-03-18 11:26:29 UTC
FYI, the scratch build did not contain the fix. The fix was added (in Patch0020) to the koji build at https://koji.fedoraproject.org/koji/buildinfo?buildID=1935306.

The fix is commit bf57e1631607dfb8446e9a2061a855c6cab4c09b in the quincy branch


Note You need to log in before you can comment on or make changes to this bug.