Note: This is a public test instance of Red Hat Bugzilla. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback at bugzilla.redhat.com.
Bug 201796
Summary: | [x86-32/PAE] loading modular netbk causes panic | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Matt C <wago> |
Component: | xen | Assignee: | Herbert Xu <herbert.xu> |
Status: | CLOSED DUPLICATE | QA Contact: | Brian Brock <bbrock> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 5 | CC: | bstein, katzj, nobody+bjmason, wago |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
URL: | https://www.redhat.com/archives/fedora-xen/2006-August/msg00041.html | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2006-08-15 10:14:12 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 150224 | ||
Attachments: |
Description
Matt C
2006-08-08 21:40:58 UTC
Created attachment 133830 [details]
Output from 'xm info' and 'xm dmesg' on system
This is isolated to the module-based network driver. I've worked around it by building the kernel with both the network and loopback drivers into the monolithic kernel: CONFIG_XEN_NETDEV_BACKEND=y CONFIG_XEN_NETDEV_LOOPBACK=y The non-PAE kernel has these settings already, so it could be that this is just a module/non-module difference instead of PAE. I'll try that permutation next. I'm also going to test the FC6 1.2517 kernel to see if this is still broken upstream. Okay, so this bug _is_ limited to the combination of PAE and the modular netbk driver. I built the xen kernel without PAE support (but with a modular netbk/netloop), booted with a non-PAE xen hypervisor, and the netbk/netloop modules worked just fine. Everything else remained the same (hardware, etc), the kernel config change was just: @@ -164,11 +163,10 @@ CONFIG_DELL_RBU=m CONFIG_DCDBAS=m # CONFIG_NOHIGHMEM is not set -# CONFIG_HIGHMEM4G is not set -CONFIG_HIGHMEM64G=y +CONFIG_HIGHMEM4G=y +# CONFIG_HIGHMEM64G is not set CONFIG_PAGE_OFFSET=0xC0000000 CONFIG_HIGHMEM=y -CONFIG_X86_PAE=y CONFIG_SELECT_MEMORY_MODEL=y CONFIG_FLATMEM_MANUAL=y # CONFIG_DISCONTIGMEM_MANUAL is not set This bug is limited to PAE itself, however (as far as I can tell). I tried a couple permutations of the mem=xxx flag to the xen hypervisor at boot time. These are using the kernel-xen-2.6.17-1.2157 as-shipped xen/PAE kernel: 16GB: broken 4GB: broken 800M: works! I assume that this is because the 800M test was under the 896M (?) lowmem boundary, and therefore the PAE paging technology was not used. netbk requires 1M of contiguous physical memory to load. On Linux it is very difficult to get that much contiguous memory once the system has been up and running for a while. Therefore if it is built as a module it must be loaded as early as possible during the boot process, before significant memory fragmentation sets in. The best option is to load it from initrd/initramfs which should be equivalent to building it into the kernel. Thanks for the explanation. I've confirmed that this works around my problem. I was able to use /etc/rc.modules as a simpler method of loading these modules early in the boot process (albeit not as early as the initrd, of course, but it worked). So, it's still interesting that the presence of PAE tables in the xen hypervisor impacts this. In my test cases above, I found that constraining the RAM available to the xen hypervisor would make this work... even when the dom0 memory footprint remained constant. To state this another way: "xen.gz dom0_mem=128M mem=4G" fails "xen.gz dom0_mem=128M mem=800M" works Also, while rc.modules is a fine workaround for me, it seems like this should be fixed before RHEL5. I can see a lot of problems with this if users have to hack the initrd or init scripts. The presence of large amounts of high memory causes various kernel data structures to grow to occupy unreasonable amounts of low memory (the lower 896M of your RAM). Since netbk is only able to use low memory, this greatly reduces the likelihood of it finding 1M of contiguous low memory once the system has been running for a while. For RHEL5, my recommendation would be for the system bootup scripts to be organised so that blkbk as well as netbk are loaded as early as possible. So I'm probably just being dense here, so I apologize... Since I constrained dom0_mem to 128M in both cases, the domain0 linux kernel only had 128MB of lowmem at all times. The only parameter that changed was the amount of memory made available to the hypervisor itself. The memory was never allocated to any running domain. If the netbk driver required 1MB of contig lowmem inside the xen hypervisor, I suppose this would make sense to me. It sounds like it requires 1MB of contig lowmem inside linux, though, correct? Thanks for taking the time to explain this to me :) Sorry, I missed this interesting obersvation. This doesn't change the fact that allocating 1MB of contiguous low memory is *not* expected to succeed other than during early boot (people have problems allocating 8KB of memory, let along 1MB :) It does sound as if the presence of the extra memory in Xen is somehow influencing what memory is available in dom0. Could you please get a memory dump by hitting SHIFT-SCROLLLOCK frmo the console in dom0? I'd like to see what it says when you boot xen with mem=800M just before you load netbk (the highmem case is already evident from your backtrace at the beginning). Created attachment 133998 [details]
Complete boot dmesg and 'shift-scrlock' meminfo: xen.gz mem=800M
Created attachment 133999 [details]
Complete boot dmesg and 'shift-scrlock' meminfo: xen.gz mem=16G
Check out the two files that I attached. Each is the complete dmesg and the meminfo dump from two consecutive boots of the system. One is with the full 16GB, the other is with xen.gz mem=800M constraint. I found that there are some interesting differences in the dmesg output when you diff the two, such as: --- dmesg-16G +++ dmesg-800M -Memory: 57356k/139264k available (2090k kernel code, 73564k reserved, 841k data, 172k init, 0k highmem) +Memory: 121228k/139264k available (2090k kernel code, 9704k reserved, 841k data, 172k init, 0k highmem) Thanks a lot for the dmesg attachments. This shows that the reason 4G/16G is running out of memory when loading netbk is because of the software IOTLB size. The software IOTLB is used to bounce IO buffers coming from guest domains so that they are contiguous when presented to the hardware. The size of the software IOTLB in dom0 is indeed dependent on the amount of memory in the hypervisor (although its size can be overridden by setting swiotlb=). In particular, for <2G systems it takes 2MB while everyone else gets a 64MB swiotlb. So in your case you really should be assigning at least another 64MB of memory to your dom0 to compensate for the bigger swiotlb. However, this does not change the fact that loading blkbk/netbk after early boot is like playing Russian Roulette :) Yep, no argument that a late modprobe is dangerous here. Thanks for the further explanation. I guess I'd just stress that it's important to fix this operationally before RHEL5 lands. It could be a hacked mkinitrd script, or an /etc/sysconfig/modules/*.modules file, something like that. We can work around it for our testing, of course. Thanks. I'm going to merge this with #202182 which is really the same issue (the boot script isn't loading the modules automatically). *** This bug has been marked as a duplicate of 202182 *** |