Re: [PATCH 0/7] [RFC] Sizing zones and holes in an architecture independent manner V2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 13 Apr 2006, Andi Kleen wrote:

On Thursday 13 April 2006 02:22, Mel Gorman wrote:

I experimented with the idea of all architectures sharing the struct
node_active_region rather than storing the information twice. It got very
messy, particularly for x86 because it needs to store more than nid,
start_pfn and end_pfn for a range of page frames (see node_memory_chunk_s
in arch/i386/kernel/srat.c). Worse, some architecture-specific code
remembers the ranges of active memory as addresses and others as pfn's. In
the end, I was not too worried about having the information in two places,
because the active ranges are kept in __initdata and gets freed.

The problem is not memory consumption but complexity of code/data structures.

The architecture-independent code is simpler than i386's SRAT messing, about the same complexity as ppcs dealings with LMB (in fact, much of the code is lifted from ppc) and comparable in complexity to what IA64 does. For x86_64, there is less architecture-specific code that has to be understood.

Keeping information in two places is usually a good cue that something
is wrong. This code is also fragile and hard to test.


At minimum, it requires a boot test - not that massive a burden. For the active, a look at the value of the zones before and after the patches.

To test architectures that register PFNs in unexpected ways that I don't have a test machine for (like IA64), I wrote the attached test program. It was a simply case of

1. Few #defines to pretend it's compiled in-kernel
2. Cut and paste from the architecture-independent code in mem_init.c to
   the driver program
3. Pass in sample input from main() and see what pops out

It caught a number of simple bugs (including one this morning) without having to even boot a machine. The same type of testing is hard with the architecture specific code. This is sample output of the driver program handing PFN ranges supplied by IA64;

mel@joshua:~/tmp$ gcc driver_test.c -o driver_test && ./driver_test | grep -v "active with no"
Stage 1: Registering active ranges
add_active_range(0, 0, 4096): New
add_active_range(0, 0, 131072): Merging forward
add_active_range(0, 0, 131072): Merging backwards
add_active_range(0, 393216, 523264): New
add_active_range(0, 393216, 523264): Merging backwards
add_active_range(0, 393216, 524288): Merging forward
add_active_range(0, 393216, 524288): Merging backwards

Stage 2: Calculating zone sizes and holes
Hole found zone 1 index 1: 131072 -> 393216

Stage 3: Dumping zone sizes and holes
zone_size[0][0] =   131020 zone_holes[0][0] =        0
zone_size[0][1] =   393268 zone_holes[0][1] =   262144

Stage 4: Printing present pages
On node 0, 262144 pages
 zone 0 present_pages = 131020
 zone 1 present_pages = 131124

So, testing is not that hard.

How is the code fragile? Even *if* it is fragile, it only has to be fixed once to benefit any architecture using the code path.

I'll admit that for x86_64, the entire code path for initialisation (i.e.
architecture specific and architecture independent paths) is now more
complex. The architecture independent code needed to be able to handle
every variety of node layout which is overkill for x86_64. Nevertheless,
without size_zones(), I thought the architecture-specific code for x86_64
memory initialisation was a bit easier to read. With
architecture-independent zone size and hole calculation, you only have to
understand the relevant code once, not once for each architecture.


I think i386 SRAT NUMA should be just removed at some point - it never
worked all that well and is quite complicated.

Assuming you mean the code, are these patches not a readable replacement?

That leaves IA64, x86-64
and ppc64.  I suspect keeping the code there near their low level
data structures is better.


For PPC64, the architecture-independent representation *is* the only copy (which is why 128 arch-specific LOC were deleted for ppc including a stryct init_node_data array of nids, start_pfns and end_pfns). IA64's low level representation uses addresses, not pfns, so having only one copy would be a very invasive patch which is not a good idea without a test box.

In many cases, the low-level representation between architectures is similar. The representation I use is the common elements all the architectures need - nid, start_pfn, end_pfn.

I have my doubts that is really a improvement over the old state.


For x86_64 in isolation or the entire set of patches?

For x86-64/i386. I haven't read the other architectures.


ok.

I think it would be better if you just defined some simple "library functions"
that can be called from the architecture specific code instead of adding
all this new high level code.


What sort of library functions would you recommend? x86_64 uses
add_active_range() and free_area_init_nodes() from this patchset which
seemed fairly straight-forward.

e.g. a generic size_zones(). Possibly some others.


For a generic size_zones(), one would need an architecture independent way to pass in active page frame ranges and what node they are on. So we end up with code very similar to what I've posted.

--
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#define ZONE_DMA 0
#define ZONE_NORMAL 1
#define printk(x, args...) printf(x, ## args)
#define BUG(x) do { printf("BUG AT %d\n", __LINE__); } while (0)
#define BUG_ON(x) do { if (x) BUG(); } while (0)
#define __init
#define __initdata
#define KERN_WARNING
#define KERN_ERR
#define MAX_NR_NODES 32
#define MAX_NR_ZONES 5

struct node_active_region {
	unsigned long start_pfn;
	unsigned long end_pfn;
	int nid;
};
#define MAX_ACTIVE_REGIONS (MAX_NR_NODES * MAX_NR_ZONES)
struct node_active_region __initdata early_node_map[MAX_ACTIVE_REGIONS];
unsigned long __initdata arch_zone_lowest_possible_pfn[MAX_NR_ZONES];
unsigned long __initdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];

static int __init first_active_region_index_in_nid(int nid)
{
	int i;
	for (i = 0; early_node_map[i].end_pfn; i++) {
		if (early_node_map[i].nid == nid)
			return i;
	}

	return MAX_ACTIVE_REGIONS;
}

static int __init next_active_region_index_in_nid(unsigned int index, int nid)
{
	for (index = index + 1; early_node_map[index].end_pfn; index++) {
		if (early_node_map[index].nid == nid)
			return index;
	}

	return MAX_ACTIVE_REGIONS;
}

#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
int __init early_pfn_to_nid(unsigned long pfn)
{
	int i;

	for (i = 0; early_node_map[i].end_pfn; i++) {
		unsigned long start_pfn = early_node_map[i].start_pfn;
		unsigned long end_pfn = early_node_map[i].end_pfn;

		if ((start_pfn <= pfn) && (pfn < end_pfn))
			return early_node_map[i].nid;
	}

	return -1;
}
#endif

#define for_each_active_range_index_in_nid(i, nid) \
	for (i = first_active_region_index_in_nid(nid); \
				i != MAX_ACTIVE_REGIONS; \
				i = next_active_region_index_in_nid(i, nid))

void __init get_pfn_range_for_nid(unsigned int nid,
			unsigned long *start_pfn, unsigned long *end_pfn)
{
	unsigned int i;
	*start_pfn = -1UL;
	*end_pfn = 0;

	for_each_active_range_index_in_nid(i, nid) {
		if (early_node_map[i].start_pfn < *start_pfn)
			*start_pfn = early_node_map[i].start_pfn;

		if (early_node_map[i].end_pfn > *end_pfn)
			*end_pfn = early_node_map[i].end_pfn;
	}

	if (*start_pfn == -1UL) {
		printk(KERN_WARNING "Node %u active with no memory\n", nid);
		*start_pfn = 0;
	}
}

unsigned long __init zone_present_pages_in_node(int nid,
					unsigned long zone_type,
					unsigned long *ignored)
{
	unsigned long node_start_pfn, node_end_pfn;
	unsigned long zone_start_pfn, zone_end_pfn;

	/* Get the start and end of the node and zone */
	get_pfn_range_for_nid(nid, &node_start_pfn, &node_end_pfn);
	zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
	zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];

	/* Check that this node has pages within the zone's required range */
	if (zone_end_pfn < node_start_pfn || zone_start_pfn > node_end_pfn)
		return 0;

	/* Move the zone boundaries inside the node if necessary */
	if (zone_end_pfn > node_end_pfn)
		zone_end_pfn = node_end_pfn;
	if (zone_start_pfn < node_start_pfn)
		zone_start_pfn = node_start_pfn;

	/* Return the spanned pages */
	return zone_end_pfn - zone_start_pfn;
}

unsigned long __init zone_absent_pages_in_node(int nid,
					unsigned long zone_type,
					unsigned long *ignored)
{
	int i = 0;
	unsigned long prev_end_pfn = 0, hole_pages = 0;
	unsigned long start_pfn;

	/* Find the end_pfn of the first active range of pfns in the node */
	i = first_active_region_index_in_nid(nid);
	prev_end_pfn = early_node_map[i].start_pfn;

	/* Find all holes for the zone within the node */
	for (; i != MAX_ACTIVE_REGIONS;
			i = next_active_region_index_in_nid(i, nid)) {

		/* No need to continue if prev_end_pfn is outside the zone */
		if (prev_end_pfn >= arch_zone_highest_possible_pfn[zone_type])
			break;

		/* Make sure the end of the zone is not within the hole */
		start_pfn = early_node_map[i].start_pfn;
		if (start_pfn > arch_zone_highest_possible_pfn[zone_type])
			start_pfn = arch_zone_highest_possible_pfn[zone_type];
		if (prev_end_pfn > start_pfn) {
			printk("prev_end > start_pfn : %lu > %lu\n",
					prev_end_pfn,
					start_pfn);
			BUG();
		}

		/* Update the hole size cound and move on */
		if (start_pfn > arch_zone_lowest_possible_pfn[zone_type]) {
			hole_pages += start_pfn - prev_end_pfn;
			printk("Hole found zone %lu index %d: %lu -> %lu\n",
					zone_type, i, prev_end_pfn, start_pfn);
		}
		prev_end_pfn = early_node_map[i].end_pfn;
	}

	return hole_pages;
}

void __init add_active_range(unsigned int nid, unsigned long start_pfn,
						unsigned long end_pfn)
{
	unsigned int i;

	printk("add_active_range(%d, %lu, %lu): ",
			nid, start_pfn, end_pfn);

	/* Merge with existing active regions if possible */
	for (i = 0; early_node_map[i].end_pfn; i++) {
		if (early_node_map[i].nid != nid)
			continue;

		/* Merge forward if suitable */
		if (start_pfn <= early_node_map[i].end_pfn &&
				end_pfn > early_node_map[i].end_pfn) {
			printk("Merging forward\n");
			early_node_map[i].end_pfn = end_pfn;
			return;
		}

		/* Merge backward if suitable */
		if (start_pfn < early_node_map[i].end_pfn &&
				end_pfn >= early_node_map[i].start_pfn) {
			printk("Merging backwards\n");
			early_node_map[i].start_pfn = start_pfn;
			return;
		}
	}

	/*
	 * Leave last entry NULL so we dont iterate off the end (we use
	 * entry.end_pfn to terminate the walk).
	 */
	if (i >= MAX_ACTIVE_REGIONS - 1) {
		printk(KERN_ERR "WARNING: too many memory regions in "
				"numa code, truncating\n");
		return;
	}

	printk("New\n");
	early_node_map[i].nid = nid;
	early_node_map[i].start_pfn = start_pfn;
	early_node_map[i].end_pfn = end_pfn;
}

unsigned long __init find_min_pfn(void)
{
	int i;
	unsigned long min_pfn = -1UL;

	for (i = 0; early_node_map[i].end_pfn; i++) {
		if (early_node_map[i].start_pfn < min_pfn)
			min_pfn = early_node_map[i].start_pfn;
	}

	return min_pfn;
}

/* Find the lowest pfn in a node. This depends on a sorted early_node_map */
unsigned long __init find_start_pfn_for_node(unsigned long nid)
{
	int i;

	/* Assuming a sorted map, the first range found has the starting pfn */
	for_each_active_range_index_in_nid(i, nid) {
		return early_node_map[i].start_pfn;
	}

	/* nid does not exist in early_node_map */
	printk(KERN_WARNING "Could not find start_pfn for node %lu\n", nid);
	return 0;
}

int main()
{
	unsigned long zone_size[MAX_NR_NODES][MAX_NR_ZONES];
	unsigned long zone_holes[MAX_NR_NODES][MAX_NR_ZONES];
	unsigned long node_present[MAX_NR_NODES];
	unsigned long start_pfn, end_pfn;
	int i, j;

	/* init */
	memset(early_node_map, 0, sizeof(early_node_map));
	memset(zone_size, 0, sizeof(zone_size));
	memset(zone_holes, 0, sizeof(zone_holes));
	memset(node_present, 0, sizeof(node_present));

	printk("Stage 1: Registering active ranges\n");

	/* Test Case: IA64 flatmem with VIRT_MEM_MAP=y */
	/*
	add_active_range(0, 1024, 130688);
	add_active_range(0, 130984, 131020);
	add_active_range(0, 393216, 524164);
	add_active_range(0, 524192, 524269);
	*/

	/* Test Case: IA64 generic kernel */
	add_active_range(0, 0, 4096);
	add_active_range(0, 0, 131072);
	add_active_range(0, 0, 131072);
	add_active_range(0, 393216, 523264);
	add_active_range(0, 393216, 523264);
	add_active_range(0, 393216, 524288);
	add_active_range(0, 393216, 524288);

	/* Test Case: Simple PPC64 LMB */
	/*
	add_active_range(0, 0, 100);
	add_active_range(0, 100, 200);
	add_active_range(0, 200, 300);
	add_active_range(1, 1000, 2000);
	add_active_range(1, 500, 1000);
	*/

	/* Test Case: PPC64 with interleving nodes and holes*/
	/*
	add_active_range(0, 0, 100);
	add_active_range(0, 200, 300);
	add_active_range(1, 100, 200);
	add_active_range(0, 500, 1000);
	*/

	/* IA64 zone pfn ranges */
	arch_zone_lowest_possible_pfn[ZONE_DMA] = 0;
	arch_zone_highest_possible_pfn[ZONE_DMA] = 131020;
	arch_zone_lowest_possible_pfn[ZONE_NORMAL] = arch_zone_highest_possible_pfn[ZONE_DMA];
	arch_zone_highest_possible_pfn[ZONE_NORMAL] = 800000;
	
	/* PPC64 zone pfn ranges */
	/*
	arch_zone_lowest_possible_pfn[ZONE_DMA] = 0;
	arch_zone_highest_possible_pfn[ZONE_DMA] = 131020;
	*/

	printf("\nStage 2: Calculating zone sizes and holes\n");
	for (i = 0; i < MAX_NR_NODES; i++) {
		for (j = 0; j < MAX_NR_ZONES; j++) {
			zone_size[i][j] = zone_present_pages_in_node(i, j, NULL);
			zone_holes[i][j] = zone_absent_pages_in_node(i, j, NULL);
			node_present[i] += zone_size[i][j];
			node_present[i] -= zone_holes[i][j];
		}
	}
	
	printf("\nStage 3: Dumping zone sizes and holes\n");
	for (i = 0; i < MAX_NR_NODES; i++) {
		if (!node_present[i])
			continue;

		for (j = 0; j < MAX_NR_ZONES; j++) {
			if (!zone_size[i][j])
				continue;

			printf("zone_size[%d][%d] = %8lu zone_holes[%d][%d] = %8lu\n",
					i, j, zone_size[i][j], i, j, zone_holes[i][j]);
		}
	}

	printf("\nStage 4: Printing present pages\n");
	for (i = 0; i < MAX_NR_NODES; i++) {
		if (!node_present[i])
			continue;

		printf("On node %d, %lu pages\n", i, node_present[i]);
		for (j = 0; j < MAX_NR_ZONES; j++) {
			if (!zone_size[i][j])
				continue;
			printf(" zone %d present_pages = %lu\n", j,
					zone_size[i][j] - zone_holes[i][j]);
		}
	}
}


[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux