Re: NUMA aware slab allocator V3

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 2005-05-16 at 09:47 -0700, Christoph Lameter wrote:
> On Mon, 16 May 2005, Dave Hansen wrote:
> > There are some broken assumptions in the kernel that
> > CONFIG_DISCONTIG==CONFIG_NUMA.  These usually manifest when code assumes
> > that one pg_data_t means one NUMA node.
> > 
> > However, NUMA node ids are actually distinct from "discontigmem nodes".
> > A "discontigmem node" is just one physically contiguous area of memory,
> > thus one pg_data_t.  Some (non-NUMA) Mac G5's have a gap in their
> > address space, so they get two discontigmem nodes.
> 
> I thought the discontigous memory in one node was handled through zones? 
> I.e. ZONE_HIGHMEM in i386?

You can only have one zone of each type under each pg_data_t.  For
instance, you can't properly represent (DMA, NORMAL, HIGHMEM, <GAP>,
HIGHMEM) in a single pg_data_t without wasting node_mem_map[] space.
The "proper" discontig way of representing that is like this:

        pg_data_t[0] (DMA, NORMAL, HIGHMEM)
        <GAP>
        pg_data_t[1] (---, ------, HIGHMEM)

Where pg_data_t[1] has empty DMA and NORMAL zones.  Also, remember that
both of these could theoretically be on the same NUMA node.  But, I
don't think we ever do that in practice.

> > So, that #error is bogus.  It's perfectly valid to have multiple
> > discontigmem nodes, when the number of NUMA nodes is 1.  MAX_NUMNODES
> > refers to discontigmem nodes, not NUMA nodes.
> 
> Ok. We looked through the code and saw that the check may be removed 
> without causing problems. However, there is still a feeling of uneasiness 
> about this.

I don't blame you :)

> To what node does numa_node_id() refer?

That refers to the NUMA node that you're thinking of.  Close CPUs and
memory and I/O, etc...

> And it is legit to use 
> numa_node_id() to index cpu maps and stuff?

Yes, those are all NUMA nodes.

> How do the concepts of numa node id relate to discontig node ids?

I believe there are quite a few assumptions on some architectures that,
when NUMA is on, they are equivalent.  It appears to be pretty much
assumed everywhere that CONFIG_NUMA=y means one pg_data_t per NUMA node.

Remember, as you saw, you can't assume that MAX_NUMNODES=1 when NUMA=n
because of the DISCONTIG=y case.

So, in summary, if you want to do it right: use the
CONFIG_NEED_MULTIPLE_NODES that you see in -mm.  As plain DISCONTIG=y
gets replaced by sparsemem any code using this is likely to stay
working.

-- Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux