On Sun, 2006-09-10 at 03:41 +0100, Andy Whitcroft wrote:
> keith mannthey wrote:
> > Hello,
> > I the current i386 numa the numa_kva (the area used to remap node
> > local data in lowmem) space is acquired by adjusting the end of low
> > memroy during boot.
> >
> > (from setup_memory)
> > reserve_pages = calculate_numa_remap_pages();
> > (then)
> > system_max_low_pfn = max_low_pfn = find_max_low_pfn() - reserve_pages;
> >
> > The problem this is that initrds can be trampled over (the kva can
> > adjust system_max_low_pfn into the initrd area) This results in kernel
> > throwing away the intird and a failed boot. This is a long standing
> > issue. (It has been like this at least for the last few years).
> >
> > This patch keeps the numa kva code from adjusting the end of memory and
> > coverts it is just use the reserve_bootmem call to reserve the large
> > amount of space needed for the numa_kva. It is mindful of initrds when
> > present.
> >
> > This patch was built against 2.6.17-rc1 originally but applies and boots
> > against 2.6.17 just fine. I have only test this against the summit
> > subarch (I don't have other i386 numa hw).
> >
> > all feedback welcome!
> >
> > Signed-off-by: Keith Mannthey <[email protected]>
> >
> >
> > ------------------------------------------------------------------------
> >
> > diff -urN linux-2.6.17/arch/i386/kernel/setup.c linux-2.6.17-work/arch/i386/kernel/setup.c
> > --- linux-2.6.17/arch/i386/kernel/setup.c 2006-06-17 18:49:35.000000000 -0700
> > +++ linux-2.6.17-work/arch/i386/kernel/setup.c 2006-06-20 23:04:37.000000000 -0700
> > @@ -1210,6 +1210,9 @@
> > extern void zone_sizes_init(void);
> > #endif /* !CONFIG_NEED_MULTIPLE_NODES */
> >
> > +#ifdef CONFIG_NUMA
> > +extern void numa_kva_reserve(void);
> > +#endif
> > void __init setup_bootmem_allocator(void)
> > {
> > unsigned long bootmap_size;
> > @@ -1265,7 +1268,9 @@
> > */
> > find_smp_config();
> > #endif
> > -
> > +#ifdef CONFIG_NUMA
> > + numa_kva_reserve();
> > +#endif
> > #ifdef CONFIG_BLK_DEV_INITRD
> > if (LOADER_TYPE && INITRD_START) {
> > if (INITRD_START + INITRD_SIZE <= (max_low_pfn << PAGE_SHIFT)) {
> > diff -urN linux-2.6.17/arch/i386/mm/discontig.c linux-2.6.17-work/arch/i386/mm/discontig.c
> > --- linux-2.6.17/arch/i386/mm/discontig.c 2006-06-17 18:49:35.000000000 -0700
> > +++ linux-2.6.17-work/arch/i386/mm/discontig.c 2006-06-20 23:11:49.000000000 -0700
> > @@ -118,7 +118,8 @@
> >
> > void *node_remap_end_vaddr[MAX_NUMNODES];
> > void *node_remap_alloc_vaddr[MAX_NUMNODES];
> > -
> > +static unsigned long kva_start_pfn;
> > +static unsigned long kva_pages;
> > /*
> > * FLAT - support for basic PC memory model with discontig enabled, essentially
> > * a single node with all available processors in it with a flat
> > @@ -287,7 +288,6 @@
> > {
> > int nid;
> > unsigned long system_start_pfn, system_max_low_pfn;
> > - unsigned long reserve_pages;
> >
> > /*
> > * When mapping a NUMA machine we allocate the node_mem_map arrays
> > @@ -299,14 +299,23 @@
> > find_max_pfn();
> > get_memcfg_numa();
> >
> > - reserve_pages = calculate_numa_remap_pages();
> > + kva_pages = calculate_numa_remap_pages();
> >
> > /* partially used pages are not usable - thus round upwards */
> > system_start_pfn = min_low_pfn = PFN_UP(init_pg_tables_end);
> >
> > - system_max_low_pfn = max_low_pfn = find_max_low_pfn() - reserve_pages;
> > - printk("reserve_pages = %ld find_max_low_pfn() ~ %ld\n",
> > - reserve_pages, max_low_pfn + reserve_pages);
> > + kva_start_pfn = find_max_low_pfn() - kva_pages;
> > +
> > +#ifdef CONFIG_BLK_DEV_INITRD
> > + /* Numa kva area is below the initrd */
> > + if (LOADER_TYPE && INITRD_START)
> > + kva_start_pfn = PFN_DOWN(INITRD_START) - kva_pages;
> > +#endif
> > + kva_start_pfn -= kva_start_pfn & (PTRS_PER_PTE-1);
> > +
> > + system_max_low_pfn = max_low_pfn = find_max_low_pfn();
> > + printk("kva_start_pfn ~ %ld find_max_low_pfn() ~ %ld\n",
> > + kva_start_pfn, max_low_pfn);
> > printk("max_pfn = %ld\n", max_pfn);
> > #ifdef CONFIG_HIGHMEM
> > highstart_pfn = highend_pfn = max_pfn;
> > @@ -324,7 +333,7 @@
> > (ulong) pfn_to_kaddr(max_low_pfn));
> > for_each_online_node(nid) {
> > node_remap_start_vaddr[nid] = pfn_to_kaddr(
> > - highstart_pfn + node_remap_offset[nid]);
> > + kva_start_pfn + node_remap_offset[nid]);
> > /* Init the node remap allocator */
> > node_remap_end_vaddr[nid] = node_remap_start_vaddr[nid] +
> > (node_remap_size[nid] * PAGE_SIZE);
> > @@ -339,7 +348,6 @@
> > }
> > printk("High memory starts at vaddr %08lx\n",
> > (ulong) pfn_to_kaddr(highstart_pfn));
> > - vmalloc_earlyreserve = reserve_pages * PAGE_SIZE;
> > for_each_online_node(nid)
> > find_max_pfn_node(nid);
> >
> > @@ -349,6 +357,12 @@
> > return max_low_pfn;
> > }
> >
> > +void __init numa_kva_reserve (void)
> > +{
> > + reserve_bootmem(PFN_PHYS(kva_start_pfn),PFN_PHYS(kva_pages));
> > +
> > +}
> > +
> > void __init zone_sizes_init(void)
> > {
> > int nid;
>
> The primary reason that the mem_map is cut from the end of ZONE_NORMAL
> is so that memory that would back that stolen KVA gets pushed out into
> ZONE_HIGHMEM, the boundary between them is moved down. By using
> reserve_bootmem we will mark the pages which are currently backing the
> KVA you are 'reusing' as reserved and prevent their release; we pay
> double for the mem_map.
Perhaps just freeing the reserve pages and remapping them at an
appropriate time could accomplish this? Sorry I don't know the KVA
"freeing" path can you describe it a little more? When are these pages
returned to the system? It was my understanding that that KVA pages
were lost (the original wayu shrinks ZONE_NORMAL and creates a hole
between the zones).
> If the initrd's are falling into this space, can we not allocate some
> bootmem for those and move them out of our way? As filesystem images
> they are essentially location neutral so this should be safe?
AFAIK bootloaders choose where map initrds. Grub seems to put it around
the top of ZONE_NORMAL but it is pretty free to map it where it wants. I
suppose INITRD_START INITRD_END and all that could be dynamic and moved
around a bit but it seems a little messy. I would rather see the special
case (i386 numa the rare beast it is) jump thought a few extra hoops
than to muck with the initrd code.
Thanks,
Keith
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]