Re: [RFC] patch[1/1] i386 numa kva conversion to use bootmem reserve

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, 2006-09-10 at 03:41 +0100, Andy Whitcroft wrote:
> keith mannthey wrote:
> > Hello,
> >   I the current i386 numa the numa_kva (the area used to remap node
> > local data in lowmem) space is acquired by adjusting the end of low
> > memroy during boot. 
> > 
> > (from setup_memory)
> > reserve_pages = calculate_numa_remap_pages();
> > (then)
> > system_max_low_pfn = max_low_pfn = find_max_low_pfn() - reserve_pages;
> > 
> > The problem this is that initrds can be trampled over (the kva can
> > adjust system_max_low_pfn into the initrd area) This results in kernel
> > throwing away the intird and a failed boot.  This is a long standing
> > issue. (It has been like this at least for the last few years). 
> > 
> > This patch keeps the numa kva code from adjusting the end of memory and
> > coverts it is just use the reserve_bootmem call to reserve the large
> > amount of space needed for the numa_kva. It is mindful of initrds when
> > present. 
> > 
> > This patch was built against 2.6.17-rc1 originally but applies and boots
> > against 2.6.17 just fine.  I have only test this against the summit
> > subarch (I don't have other i386 numa hw). 
> > 
> > all feedback welcome!
> > 
> > Signed-off-by:  Keith Mannthey <[email protected]>
> > 
> > 
> > ------------------------------------------------------------------------
> > 
> > diff -urN linux-2.6.17/arch/i386/kernel/setup.c linux-2.6.17-work/arch/i386/kernel/setup.c
> > --- linux-2.6.17/arch/i386/kernel/setup.c	2006-06-17 18:49:35.000000000 -0700
> > +++ linux-2.6.17-work/arch/i386/kernel/setup.c	2006-06-20 23:04:37.000000000 -0700
> > @@ -1210,6 +1210,9 @@
> >  extern void zone_sizes_init(void);
> >  #endif /* !CONFIG_NEED_MULTIPLE_NODES */
> >  
> > +#ifdef CONFIG_NUMA
> > +extern void numa_kva_reserve(void);
> > +#endif
> >  void __init setup_bootmem_allocator(void)
> >  {
> >  	unsigned long bootmap_size;
> > @@ -1265,7 +1268,9 @@
> >  	 */
> >  	find_smp_config();
> >  #endif
> > -
> > +#ifdef CONFIG_NUMA
> > +	numa_kva_reserve();
> > +#endif 
> >  #ifdef CONFIG_BLK_DEV_INITRD
> >  	if (LOADER_TYPE && INITRD_START) {
> >  		if (INITRD_START + INITRD_SIZE <= (max_low_pfn << PAGE_SHIFT)) {
> > diff -urN linux-2.6.17/arch/i386/mm/discontig.c linux-2.6.17-work/arch/i386/mm/discontig.c
> > --- linux-2.6.17/arch/i386/mm/discontig.c	2006-06-17 18:49:35.000000000 -0700
> > +++ linux-2.6.17-work/arch/i386/mm/discontig.c	2006-06-20 23:11:49.000000000 -0700
> > @@ -118,7 +118,8 @@
> >  
> >  void *node_remap_end_vaddr[MAX_NUMNODES];
> >  void *node_remap_alloc_vaddr[MAX_NUMNODES];
> > -
> > +static unsigned long kva_start_pfn;
> > +static unsigned long kva_pages;
> >  /*
> >   * FLAT - support for basic PC memory model with discontig enabled, essentially
> >   *        a single node with all available processors in it with a flat
> > @@ -287,7 +288,6 @@
> >  {
> >  	int nid;
> >  	unsigned long system_start_pfn, system_max_low_pfn;
> > -	unsigned long reserve_pages;
> >  
> >  	/*
> >  	 * When mapping a NUMA machine we allocate the node_mem_map arrays
> > @@ -299,14 +299,23 @@
> >  	find_max_pfn();
> >  	get_memcfg_numa();
> >  
> > -	reserve_pages = calculate_numa_remap_pages();
> > +	kva_pages = calculate_numa_remap_pages();
> >  
> >  	/* partially used pages are not usable - thus round upwards */
> >  	system_start_pfn = min_low_pfn = PFN_UP(init_pg_tables_end);
> >  
> > -	system_max_low_pfn = max_low_pfn = find_max_low_pfn() - reserve_pages;
> > -	printk("reserve_pages = %ld find_max_low_pfn() ~ %ld\n",
> > -			reserve_pages, max_low_pfn + reserve_pages);
> > +	kva_start_pfn = find_max_low_pfn() - kva_pages;
> > +
> > +#ifdef CONFIG_BLK_DEV_INITRD
> > +	/* Numa kva area is below the initrd */
> > +	if (LOADER_TYPE && INITRD_START) 
> > +		kva_start_pfn = PFN_DOWN(INITRD_START)  - kva_pages;
> > +#endif 
> > +	kva_start_pfn -= kva_start_pfn & (PTRS_PER_PTE-1);
> > +
> > +	system_max_low_pfn = max_low_pfn = find_max_low_pfn();
> > +	printk("kva_start_pfn ~ %ld find_max_low_pfn() ~ %ld\n", 
> > +		kva_start_pfn, max_low_pfn);
> >  	printk("max_pfn = %ld\n", max_pfn);
> >  #ifdef CONFIG_HIGHMEM
> >  	highstart_pfn = highend_pfn = max_pfn;
> > @@ -324,7 +333,7 @@
> >  			(ulong) pfn_to_kaddr(max_low_pfn));
> >  	for_each_online_node(nid) {
> >  		node_remap_start_vaddr[nid] = pfn_to_kaddr(
> > -				highstart_pfn + node_remap_offset[nid]);
> > +				kva_start_pfn + node_remap_offset[nid]);
> >  		/* Init the node remap allocator */
> >  		node_remap_end_vaddr[nid] = node_remap_start_vaddr[nid] +
> >  			(node_remap_size[nid] * PAGE_SIZE);
> > @@ -339,7 +348,6 @@
> >  	}
> >  	printk("High memory starts at vaddr %08lx\n",
> >  			(ulong) pfn_to_kaddr(highstart_pfn));
> > -	vmalloc_earlyreserve = reserve_pages * PAGE_SIZE;
> >  	for_each_online_node(nid)
> >  		find_max_pfn_node(nid);
> >  
> > @@ -349,6 +357,12 @@
> >  	return max_low_pfn;
> >  }
> >  
> > +void __init numa_kva_reserve (void) 
> > +{
> > +	reserve_bootmem(PFN_PHYS(kva_start_pfn),PFN_PHYS(kva_pages));
> > +
> > +}
> > +
> >  void __init zone_sizes_init(void)
> >  {
> >  	int nid;
> 
> The primary reason that the mem_map is cut from the end of ZONE_NORMAL
> is so that memory that would back that stolen KVA gets pushed out into
> ZONE_HIGHMEM, the boundary between them is moved down.  By using
> reserve_bootmem we will mark the pages which are currently backing the
> KVA you are 'reusing' as reserved and prevent their release; we pay
> double for the mem_map.

Perhaps just freeing the reserve pages and remapping them at an
appropriate time could accomplish this?  Sorry I don't know the KVA
"freeing" path can you describe it a little more?  When are these pages
returned to the system?  It was my understanding that that KVA pages
were lost (the original wayu shrinks ZONE_NORMAL and creates a hole
between the zones).

> If the initrd's are falling into this space, can we not allocate some
> bootmem for those and move them out of our way?  As filesystem images
> they are essentially location neutral so this should be safe?

AFAIK bootloaders choose where map initrds.  Grub seems to put it around
the top of ZONE_NORMAL but it is pretty free to map it where it wants. I
suppose INITRD_START INITRD_END and all that could be dynamic and moved
around a bit but it seems a little messy. I would rather see the special
case (i386 numa the rare beast it is) jump thought a few extra hoops
than to muck with the initrd code. 
  
Thanks,
  Keith 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux