Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 1762199
Summary: | RAID6 Grow causes md to crash in aarch64 AMD Seattle | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Sahaj Sarup <sahajsarup> | ||||
Component: | mdadm | Assignee: | Jes Sorensen <jes.sorensen> | ||||
Status: | CLOSED RAWHIDE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 30 | CC: | agk, dledford, gary.hook, jes.sorensen, msalter, pbrobinson, thomas.lendacky, xni | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | aarch64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-01-09 17:04:36 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 245418 | ||||||
Attachments: |
|
Description
Sahaj Sarup
2019-10-16 07:58:43 UTC
Created attachment 1626328 [details]
dump
Can you have a look at this, or advise who would be best to look at it, I'm concerned there's a possible data loss issue with a mdraid raid-6 crash on aarch64 I'll see if I can reproduce with the 5 puny half-TB drives I have immediate access to. Bit of an update on the situation that might help. (Also this is just FYI, not asking for data recovery help) I was able to continue the reshape process in an x86 box and ended up adding the 5th drive as an active raid disk. At first all seemed fine until i ran fsck on the md0 device and it turns out the entire ext4 filesystem on the md0 device got corrupted pretty bad. Now weather is happened because i tried to recover raid numerous times instead of just plugging the drives to another system or was this caused by the mdraid crashing, I'm not sure. full of an endless list of: inode <big number in billions> seem to contain garbage. Clear? Thanks for update. I reproduced on my seattle running 5.3.5-200.fc30 kernel. I'll keeping digging at it. So the backtrace really tells the story here. There's a NULL-pointer dereference in __list_del_entry_valid(). The caller, ccp_tx_submit() has this: static dma_cookie_t ccp_tx_submit(struct dma_async_tx_descriptor *tx_desc) { ... list_del(&desc->entry); list_add_tail(&desc->entry, &chan->pending); ... } Apparently, given the right set of circumstances, desc->entry is uninitialized leading to NULL-pointer dereference. This makes the problem go away: index a54f9367a580..0770a83bf1a5 100644 --- a/drivers/crypto/ccp/ccp-dmaengine.c +++ b/drivers/crypto/ccp/ccp-dmaengine.c @@ -342,6 +342,7 @@ static struct ccp_dma_desc *ccp_alloc_dma_desc(struct ccp_dma_chan *chan, desc->tx_desc.flags = flags; desc->tx_desc.tx_submit = ccp_tx_submit; desc->ccp = chan->ccp; + INIT_LIST_HEAD(&desc->entry); INIT_LIST_HEAD(&desc->pending); INIT_LIST_HEAD(&desc->active); desc->status = DMA_IN_PROGRESS; Most code paths end up adding ccp_dma_desc->entry to another list which implicitly initializes it and avoids the problem. I'm still trying to understand the bad path to make sure it is valid. Once I'm satisfied, I'll post a patch upstream. BTW, ccp is an AMD crypto chip with a dma offload engine. I'm surprised that this hasn't been seen before on an AMD x86 box. We think that the precipitating condition may be that the source or destination length of a transfer is 0. If so, then the code in ccp_create_desc() would kick out of a loop without adding the new descriptor to a list, and entry would be uninitialized. We also think that initializing the list head is the "right" solution. Not having seen a failure like this, I'd be interested in a test of an alternative patch, but without your patch above: crypto:ccp - Verify length parameters before creating a descriptor The source and destination lengths should be vetted early. If either is zero, the transfer is a no-op, so return. Signed-off-by: Gary R Hook <gary.hook> diff --git a/drivers/crypto/ccp/ccp-dmaengine.c b/drivers/crypto/ccp/ccp-dmaengine.c index 0770a83bf1a5..4669c5ac3c81 100644 --- a/drivers/crypto/ccp/ccp-dmaengine.c +++ b/drivers/crypto/ccp/ccp-dmaengine.c @@ -376,16 +376,20 @@ static struct ccp_dma_desc *ccp_create_desc(struct dma_chan *dma_chan, if (!dst_nents || !src_nents) return NULL; + src_len = sg_dma_len(src_sg); + if (!src_len) + return NULL; + + dst_len = sg_dma_len(dst_sg); + if (!dst_len) + return NULL; + desc = ccp_alloc_dma_desc(chan, flags); if (!desc) return NULL; total_len = 0; - - src_len = sg_dma_len(src_sg); src_offset = 0; - - dst_len = sg_dma_len(dst_sg); dst_offset = 0; while (true) { Yes, that patch (without mine) also makes the problem go away. Good to hear; thanks! Yours is the official fix, as I stated above, but we're glad to understand the corner case better. Would you like us to handle the backport to 8.2? We should try to get this in, I think. Sure thing: accepted in cryptodev-2.6: 0d3c6781d8d8 I am in the process of getting this into 8.2 (as part of a larger patchset). Do you want to assign this BZ to me, or use it to track a KB item? Gah! Ignore comment 10; wrong BZ. The fix has landed in Linus' tree. commit 691505a803a7 |