Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 118165
Summary: | (NET B44) (4G/4G?) Hangs on bootup - b44 (Broadcom) network driver cannot deal with > 1Gb ram | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Craig Cruden <cacruden> | ||||||||||||||||
Component: | kernel | Assignee: | John W. Linville <linville> | ||||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | |||||||||||||||||
Severity: | medium | Docs Contact: | |||||||||||||||||
Priority: | medium | ||||||||||||||||||
Version: | rawhide | CC: | alan, bammi, barryn, bertil, bkc, davej, hugh, marc_schwartz, mingo, nesnegroj, pp, wtogami | ||||||||||||||||
Target Milestone: | --- | ||||||||||||||||||
Target Release: | --- | ||||||||||||||||||
Hardware: | i686 | ||||||||||||||||||
OS: | Linux | ||||||||||||||||||
Whiteboard: | |||||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||
Last Closed: | 2005-09-07 01:23:55 UTC | Type: | --- | ||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
Embargoed: | |||||||||||||||||||
Bug Depends On: | |||||||||||||||||||
Bug Blocks: | 114961, 123268, 136451 | ||||||||||||||||||
Attachments: |
|
Description
Craig Cruden
2004-03-12 18:49:18 UTC
This was a problem in 2.6.1-1.65 as well. This is only ipv4, since ipv6 seems to work. Craig, I have a suspicion. You normally use i686 kernels right? Try the i586 latest rawhide kernel and see if behavior is any different. I tried the .i586 kernel (.315) and the b44 loads and I am able to connect to the outside world from that computer. BTW: I never did have the problem with .65 kernel. Anyways, I am guessing that this means it is a simple fix to make :p Okay just as I suspected, 4G/4G enabled in all kernels after .118 may be interfering with your network driver. i586 has 4G/4G disabled while i686 is enabled. You can further confirm this by using the .315 sources and the standard i686 config file, but disabling only 4G/4G and rebuilding it. I suspect it will work then too. Did you try .315 i686 though? A major 4G/4G patch went in that may make a difference. Please report your results. warren: there's also differences acpi wise.... not full conclusion as of yet... Hence the question mark and suggestion of further more specific testing. Before installing the .i586 kernel, I had installed the .i686 kernel and did a "modprobe b44" to test it before adding it back to the modprobe.conf. It hung the machine in about the time it took to hit the enter key the fourth time (so I did not try it at boot time). I did the same thing during the testing of the .i586 and "modprobe b44" did not hang the machine so I added it in for boot and it worked. I dare not add it into boot time right now since I do not have a recovery disk with me at this time :p I am working on compiling a custom kernel now.... Compiled a custom kernel with the 4G/4G turned off. The b44 driver seems to work with that option checked off. Rebooted with stock .i686 kernel and it hangs.... ------------------------------------------ config differences *************** *** 91,101 **** CONFIG_X86_GOOD_APIC=y CONFIG_X86_INTEL_USERCOPY=y CONFIG_X86_USE_PPRO_CHECKSUM=y ! CONFIG_X86_4G=y ! CONFIG_X86_SWITCH_PAGETABLES=y ! CONFIG_X86_4G_VM_LAYOUT=y ! CONFIG_X86_UACCESS_INDIRECT=y ! CONFIG_X86_HIGH_ENTRY=y CONFIG_HPET_TIMER=y CONFIG_HPET_EMULATE_RTC=y # CONFIG_SMP is not set --- 91,101 ---- CONFIG_X86_GOOD_APIC=y CONFIG_X86_INTEL_USERCOPY=y CONFIG_X86_USE_PPRO_CHECKSUM=y ! # CONFIG_X86_4G is not set ! # CONFIG_X86_SWITCH_PAGETABLES is not set ! # CONFIG_X86_4G_VM_LAYOUT is not set ! # CONFIG_X86_UACCESS_INDIRECT is not set ! # CONFIG_X86_HIGH_ENTRY is not set CONFIG_HPET_TIMER=y CONFIG_HPET_EMULATE_RTC=y # CONFIG_SMP is not set http://people.redhat.com/wtogami/temp/ Just a FYI. If other people want to test i686 kernel-2.6.5-1.322, I have made test i686 RPMS at the above URL for your convenience. Only difference in configuration is the disabled 4G/4G memory split. Double checked with: kernel-2.6.5-1.322 kernel-2.6.5-1.322.disabled4G precompiled kernels and I have the same result of: kernel-2.6.5-1.322 - hanging kernel-2.6.5-1.322.disabled4G - working Works with 4G/4G (1.305) for me (ASUS A7V8X, 1GB of memory)... One thing that would be interesting to know is whether bcm4400 (the broadcom driver) works. There's one other known problem (shown as timeout waiting for bit xxx when loading the driver), that seems to only be a "problem" for dell inspiron users. On my A7V8X it only gets triggered when I load the broadcom driver and loading b44 afterwards (and is fixed by adding a return; at a certain point in the init code of bcm4400 or a full powerdown :-) ) There are certainly some hardware quirks in the chip that need to be worked around, and how they manifest themselves might be related to how the OEM has wired the chip... Craig, how much RAM does your system have? one side-effect of 4G/4G is that it enables much more RAM to be used for 'lowmem' - which is the place where network buffers (skbs) go. The 3:1 kernel has a lowmem range of 0-960MB, while the 4:4 one can have up to 3GB of lowmem RAM. If the card hardware has a bug in that it can only do DMA up to say 2 GB (or 1 GB) then such a bug would only trigger under the 4:4 kernel. It is a Dell Inspiron 8500 - memory at maximum (2GB). FYI, I am getting the same problems on an Inspiron 9100 with 1 GB memory (BTW: I am using the 2.6.5-1.327 kernel). Random possible connection - both b44 and acenic dont use the ethtool hooks. Both b44 and acenic have a problem with 4G/4G. Don't understand why however does the problem trigger if you boot with mem=512m? the 351 kernel in http://people.redhat.com/arjanv/2.6 has b44 using the ethtool ops infrastructure.... Little more testing. I installed .352 kernel and booted. It hung. I then added mem=512 to the 352 kernel boot parameters.... It worked. Latest development kernel (.351) solved the problem for me (Inspiron 9100, 1GB mem) even without the mem=512 option... So, I am assuming that the closing of this issue means that you believe there is NO solution to this problem for the Dell Inspiron 8500? So, I am assuming that the closing of this issue means that you believe there is NO solution to this problem for the Dell Inspiron 8500 - with 2GB of memory? More info -- tested again to make sure that I was not mistaken: Testing with kernel options of: mem=512m - worked mem=960m - worked mem=1024m - worked mem=1536m - failed - hung during bootup <omitted> - failed - hung during bootup Reopen - pending comment. It was closed because it was assumed to be fixed by the 351 kernel since our internal bugs and Jochens problem went away. Clearly that wasnt a correct asusmption. Does acpi=off help, and can you attach an lspci -vxx from a working one if the boundary is exactly 1Gb, can we rule out someone forgetting to wire up the highest 2 address lines ? took away the mem option, put acpi=off and acpi=off lspci -vxx both of which hung at b44 I am not familiar with the lspci -vxx option, do I attach it when I have mem=1024m and let it properly boot, then hunt around for a file.... or has it already generated a file which I have not found? "if the boundary is exactly 1Gb, can we rule out someone forgetting to wire up the highest 2 address lines ?" don't know anything about the quote above... lspci -vxx is a command you run at the shell prompt. You can redirects its output to a text file and attach that file here. lspci -vxx > output.txt Created attachment 100095 [details]
nomemnob44 - no mem, no acpi=off, no b44 loaded in modprobe
Did not know what/why contents differ but included some from several different
boots:
nomemnob44 - no mem, no acpi=off, no b44 loaded in modprobe
mem1024 - mem=1024m
mem1024acpioff - mem=1024m acpi=off
Created attachment 100096 [details]
mem=1024m option
Created attachment 100097 [details]
mem=1024m acpi=off
Arjan I see one thing suspicious there. The pci spaces seem to be hard up against the end of memory - but only for the ones Linux assigned. I wonder if we are getting memory and other resources overlapping due to a PCI resource handling bug or E820 data funnies ? I have a A7V8X motherboard with 1,5 GB memory I have the same problem -> my system freeze with ifup eth1 (eth0 3com, eth1 bcm44) -sigh- I'll try to borrow some memory for my A7V8X to hunt this down. Could you confirm the situation is the same with the latest kernel from updates-testing or http://people.redhat.com/arjanv/ and whether mem=1024m changes the situation. We fixed a couple of 4/4 bugs lately, these fixes should be in Arjan's .391 kernel: http://people.redhat.com/arjanv/2.6/RPMS.kernel/kernel-2.6.6-1.391.i686.rpm could you try this kernel? Pekka, if this kernel shows the problem too then indeed it would be nice if you could check it out - we've run out of ideas. There are no other known 4:4 related bugs or weirdnesses pending, other than this one. Tried the "391" kernel and the same problem persists. i.e when the mem=1024m is not added to the kernel options -- the system freezes on b44; when the mem=1024m is added it boots. I managed to borrow some extra memory (1.25G total) and was able to reproduce the bug with 1.397. Debugging time, the night is still young... :-) Initial results show that the broadcom bcm4400 driver is affected as well and pci_set_dma_mask(pdev, (u64) 0x3fffffff); did not fix the problem as it might have. Will continue poking around. Close your eyes and find a barf-bag... Following patch (a bit over-kill, GFP_DMA for the rx skbs, illegal_highdma() == 1 and the pci dma masks should be enough) makes the chip receive fine. Transmitting still breaks after a short while, I assume once it hits an skb that is located above 1GB. So infrastructure changes needed if this is to ever work if I understood correctly. Created attachment 100723 [details]
Bring out the barf-bags
Yuck yuck 8) You might want to steal the logic from the old ISA bus drivers like lance. Those use fixed rings for RX and bounce buffer tx sk_buffs. Here's a "works-for-me-but-I'm-still-not-sure-I-want-to-admit-writing-this-patch" patch, which seems to be the best that can be done with just driver changes (I've posted this to netdev/l-k too) Created attachment 100897 [details]
The barf-bags strike back!
Hello, I experience the same problem with my BCM4401 on my Asus Pundit, but I only have 512Mo RAM is it the same problem or is it another bug?? I use kernel 2.6.6 compiled with 4kstacks off. Regards, Ludovic C. Sounds like a different bug. If it's a self-compiled kernel bugzilla.redhat.com is the wrong place (bugme.osdl.org would be more appropriate). If you file a bug there, please include details such as whether there is some combination that does work? (2.4, the bcm4400 driver from broadcom, some earlier 2.6 version/vanilla 2.6.7 release candidates, acpi=off?) and notify me of the bug so I can start tracking it there and hopefully be able to reproduce it, in which case there is some hope that I might manage to fix it as well :-) Created attachment 100988 [details]
Return of the barf-bags
This is the "final" version of the patch, which I've submitted to netdev.
I have a Dell Dimension 2400 with a bcm4401 (rev 1) ethernet adapter. This machine has 256 meg of ram. I just installed stock fedora core 2 from iso. On startup, modprobe b44 works. But initializing eth0 results in many lines of BUG! Timeout waiting for bit 80000000 of register 428 to clear. ifconfig eth0 shows no packets sent or received. This machine was running RedHat 9 (stock kernel), I did an upgrade install to Fedora Core 2 and changed "alias eth0 bcm4400" to "alias eth0 b44". The bcm4400 driver worked fine under RedHat 9. Shall I try the patch above? I guess I'll need to install kernel source from CD. If someone has a binary they'd care to share, it only has to work long enough for me to get on the 'net. Different bug, was fixed a few weeks ago in 2.6 (should be in the fc2 kernel update errata candidate too). Workaround is to not run any drivers from broadcom (either windows or the linux one) before b44, as they flip the magic bit that makes b44 not work (by powering down the PHY on shutdown) A full power-down (physically removing the cable is the surest) should put the chip into a sane state, just pressing the reset button isn't enough. I see this also in the latest released FC2 kernel 2.6.6-1.435.2.3 on a Inspiron 8500 with the BCM 4401 100Base-T 10/100 ethernet card. Either using b44.ko from FC2 or bcm4401 driver from Broadcom, my system freezes when I do ifup. The mem=1024m solves the problem on my 2G RAM system. Now, I really would like to be able to get my full RAM back AND still be connected! As a temporary workaround, grab http://www.ee.oulu.fi/~pp/b44-4g4g.tgz untar, compile (there's a script, too lazy to figure out how to make "make" do the right thing vs. having to give it arguments). Then just replace your b44.ko with the one it creates and you should be a-ok. This is the same patch as is attached to this bug, this is just in a more user-friendly format. I briefly looked at creating a src.rpm but that looked _WAY_ too tricky :-) Getting the patch included (which would happen if this got merged upstream) is a bit tricky. The workaround isn't necessary with a non-fedora kernel since it only gets triggered with a 4:4 layout (possibly something less exotic like 2:2 too, but that'd be a non-standard config too), and patches for neither are unlikely to make it to 2.6. Well ok, corner-cases where this fix or something similar would be required even on a vanilla kernel (x86-64's with a bcm4401) could be imagined but in reality there's really only a few types of x86 motherboards/laptops with the chip out there. Anyway, the fix is pretty ugly, which is why I'm somewhat reluctant to push it too much, but there's not really much that can be done about that due to the nature of the (apparent hardware) bug. Option B is making this a fedora-specific patch, but that'll increase the maintenance of the already-overworked RH kernel people, so I can see why they're reluctant to add it either. > Getting the patch included (which would happen if this got merged > upstream) is a bit tricky. The workaround isn't > necessary with a non-fedora kernel since it only gets triggered with a > 4:4 layout (possibly something less exotic like 2:2 too, but that'd > be a non-standard config too), and patches for neither are unlikely > to make it to 2.6. "I plan to merge the 4g split immediately after 2.7 forks." So says Andrew Morton, here: http://www.uwsg.iu.edu/hypermail/linux/kernel/0402.3/1351.html And here's a post by Andrew Morton explaining the logic behind doing it that way: http://www.uwsg.iu.edu/hypermail/linux/kernel/0308.0/0365.html So, unless Andrew Morton has changed his mind recently and I haven't noticed, merging of 4:4 into the mainline kernel is almost a certainty. On second thought, maybe "almost a certainty" is too strong a phrase, but it's still quite likely to happen unless I've missed something recent... Just a confirmation that Pekka's fix in #49 above works. This is on a new Dell Inspiron 5150 laptop with a 3.2 Ghz P4 with 2 Gb RAM running FC2 with the 2.6.6-1.435.2.3 kernel. The internal NIC is a BCM4401 100Base-T (rev 01). Yeah! I can now put away my PCMCIA NIC. Prior to this I would get a hard lock up when trying to bring up the internal NIC. What can I send you for Christmas Pekka? :-) Thanks! Pekka's fix (see #49 above) works just fine. How can something that works be ugly? Something that fixes this problem has to be released to the next kernel! Or do I have to switch to another distribution? When something that works is seen as ugly, I always fault the theory from being incomplete when it doesn't take reality into account ! > When something that works is seen as ugly, I always fault the theory
> from being incomplete when it doesn't take reality into account !
If ugly hardware necessitates ugly code, that doesn't magically make
the code not-ugly...
Could you try http://www.ee.oulu.fi/~pp/b44-095.tgz ? That one includes a cleaned up version and bcm47xx support which someone just submitted (you'll need to add a #define PCI_DEVICE_ID_BCM4713 0x4713 I know, that goes in include/linux/pci_ids.h in principle :-) ). If it works in the > 1GB of ram case (had to return the memory I borrowed so can't verify easily myself :-( ) I'll go into patch-bomb mode until it goes in in the mainstream kernel. Pekka, I am getting an Access Forbidden error message when trying to get the new file. Can you verify the URL? Thanks. yikes, fixed :-) Pekka, It is working here with the #DEFINE change made. Same system as defined in comment #52. Same kernel as well. That's one... :-) Thanks! One more data point. The latest FC2 kernel hit my mirror (2.6.7-1.494.2.2). I have installed it and rebuilt the b44 driver. It works. Works fine on my Inspiron 8500 w/ 2 GByte memory and running linux-2.6.7-1.494.2.2. Thanks! Just updated to 2.6.8-1.520 on FC2. The patch does not appear to work now. Trying to bring up the NIC now results in X locking up. I can get out of X back to a console. There were no errors during the compilation and nothing reported when using ifup eth0 in a console. Anyone else try this with the new kernel? I am temporarily back to 2.6.7-1.494.2.2 for the moment. Thanks. Works for me (tm) but this is with 1GB of memory only (but with things modified so the workaround code gets used for every packet :-) ). Suppose I'll have to find some extra memory again to try things out... Could you try http://www.ee.oulu.fi/~pp/b44-095-2.tgz with and without mem=1024m . Other things to try, line 644 if(mapping+RX_PKT_BUF_SZ > ..) -> if(1 || ...) and on line 939 if (mapping+len > B44_DMA_MASK) -> if( 1 || ...) to make it always use memory under 16 MB. Current logic _should_ be fine though. Pekka, The updated file works, both with and without mem=1024m. So...back to 2.6.8-1.520. Thanks! Whew, 095-2 is what was commited upstream (drivers touching skb->data is a no-no apparently :-) ). Supposedly the fix might go in for 2.6.9, remains to be seen... Just a status update, the fix is in 2.6.9-rc2-mm2 (bk-netdev.patch), hopefully will propagate to the Linus (and thus Fedora) tree soon. Didn't make it into Linus's tree in time for FC3 to get it that way, but hopefully the fix can still be included (davej added to Cc: list since otherwise this would probably go on missed). I've attached yet-another-patch, it's the patches related to b44.c and b44.h (and nothing else) from bk-netdev.patch in 2.6.9-rc4-mm1. Also standalone version of the same in http://www.ee.oulu.fi/~pp/b44-bk.tgz for those who just want a quick fix. Patches go into 2.6.9-1.639 just fine. No real changes wrt. the previous one, it's just the version that will end up upstream eventually. The one known issue that this patch creates is that loading the module long after booting might fail if GFP_DMA is used by other stuff, this is due to x86 pci_alloc_consistent() limitations, it uses GFP_DMA if the mask is anything other than 4GB... We load network modules early on boot tho, so this shouldn't be a big problem. Created attachment 105554 [details]
b44 update from -mm tree
One additional data point: Applicable to Shuttle XPC systems, with FT61 motherboards. They have the broadcomm chip. Thanks to Pekka for the patch. Applied to stock Fedora FC3 kernel-sources with no problems. fixed in cvs, will be in next build. When using the standalone package http://www.ee.oulu.fi/~pp/b44-bk.tgz with linux-2.6.9-1.6_FC2 I get the following error # ifup /etc/sysconfig/network-scripts/ifcfg-eth0 Determining IP information for eth0... SIOCSIFFLAGS: Cannot allocate memory SIOCSIFFLAGS: Cannot allocate memory This happens when I have loaded all my usual applications and am using up my RAM of 2 GByte. With no applications, after having rebooted, b44 works fine. See last paragraph of comment #66. The problem is that the driver needs about 750k of memory that has to be located under 1GB physically to not trigger the hardware bug that causes crashes and other fun. The driver tries to allocate that kind of memory (pci_set_consistent_dma_mask(pdev, 0x3fffffff) ). There should be plenty, right? Unfortunately the way it's implemented right now in the generic x86 pci code is that if you ask for some memory with a dma mask of < 4GB, it falls back to giving you memory from the first 16MB. Now that's a pretty limited resource :-(. There seems to be 3 drivers that need similar workarounds (wanxl, aacraid and b44). I just installed the latest test kernel 2.6.9-1.698_FC3 on the system referenced in comment #52 and my b44 is working fine without having to utilize the external patch. Between the now included patch and the 4g/4g fix, it looks like we might be able to put this one to bed. :-) Thanks to everyone who has spent time on the patches and fixes! Marc I'm running FC3 w/ kernel-2.6.10-1.737 and I'm getting the "SIOCFLAGS: Cannot allocate memory" error when trying to activate "eth0" (i.e. the b44 driver). Here's what I'm doing: in the morning, boot up: eth0 (the b44) is active on boot eth1 (ipw2100, wireless) NOT active everything is fine, in the evening, go home (NO poweroff/reboot): % ifdown eth0 (to get route table default route cleared, etc) % ifup eth1 (to use my wireless access at home) still everything ok, UNTIL I come back to work in the morning (NO poweroff/reboot): % ifdown eth1 (to clear wireless route stuff) % ifup eth0 <-- HERE's where I get the "SIOCCFLAGS: cannot allocate memory" error Any suggestions? thanks, russ What about if you do a "modprobe -r b44" in the evening: % ifdown eth0 % modprobe -r b44 % ifup eth1 Does that change anything? thanks for the response, John, that actually worked for a quick test. I did notice that the "mii" module has a dependency on the b44 module, perhaps that was introducing some problems? I'll try it again over a longer period on monday and see how it works and report back here. thanks again. Well, sadly it just happened again. Here's the deal this time: Beginning of weekend % ifdown eth0 % modprobe -r b44 % ifup eth1 This morning: % ifup eth0 <-- SIOCCFLAGS cannot allocate memory error again.. It was working in short timeframes, but does unloading over long periods of time (workday > 7 hours?) change conditions in the kernel sufficiently that it can't get reloaded? btw, this behavior has been consistent even over the last handful of updated kernels for FC3 over the last 2 weeks or so. See #66 and #71 :-) This is being tracked as #145109 and on netdev. There's a untested patch at http://www.ee.oulu.fi/~pp/b44hack I no longer have the hardware so testing it is a bit hard :-) If I understand the kernel's Documentation/DMA-mapping.txt (and I may not), would it not be simpler to allocate buffers for the device using a suitable dma mask? (I'm visiting this bug entry because the Broadcom BCM94306 802.11g WiFi chip seems to have a similar problem with 1G+ physical addresses. Things are made more interesting because the driver is a 64-bit MS Windows driver + ndiswrapper64.) I see that you answered your own question in bug 145109 comment 28...posting here so that future searchers find the same answer... :-) Just triaging some of bugs I've commented to. This definately is fixed upstream and all possible relevant errata kernels these days, even the #145109 spin-off bug is, so I'll close the bug. If there's anything new that's broken in b44, please file a separate bug. And someone recently broke ACPI, so try acpi=off if it breaks for you even now! And even that should be fixed now! This driver is officially bug-free (tm)! |