This is a pretty early draft stage of the patch. It works on x86_64 only. Its a bit massive so I'd like to have some feedback before proceeding (and maybe some help)?. The support for other arches was not tested yet. The patch establishes a new set of cpu operations that allow to exploit single instruction atomicity to allow per cpu variable modifications without disabling/enabling preempt or interrupts and without the need to do an offset calculation in order to determine the location of the variable on the current processor. It then implements these operations on x86_64 after consolidating per cpu access for allocpercpu, percpu and pda. All per cpu data is then accessible via gs segment override. This results in a reduction in code size of the kernel and in more efficient operation of per cpu access. Before: text data bss dec hex filename 4041907 512371 1302360 5856638 595d7e vmlinux After (this includes the code added for the cpu allocator!): text data bss dec hex filename 3861532 527715 1298072 5687319 56c817 vmlinux On x86_64 the segment override results in the following change for a simple vm counter increment: Before: mov %gs:0x8,%rdx Get smp_processor_id mov tableoffset,%rax Get table base incq varoffset(%rax,%rdx,1) Perform the operation with a complex lookup adding the var offset An interrupt or a reschedule action can move the execution thread to another processor if interrupt or preempt is not disabled. Then the variable of the wrong processor may be updated in a racy way. After: incq %gs:varoffset(%rip) Single instruction that is safe from interrupts or moving of the execution thread. It will reliably operate on the current processors data area. Other platforms can also perform address relocation plus atomic ops on a memory location. Exploiting of the atomicity of instructions vs interrupts is therefore possible and will reduce the cpu op processing overhead. F.e on IA64 we have per cpu virtual mapping of the per cpu area. If we add an offset to the per cpu area variable address then we can guarantee that we always hit the per cpu areas local to a processor. Other platforms (SPARC?) have registers that can be used to form addresses. If the cpu area address is in one of those then atomic per cpu modifications can be generated for those platforms in the same way. Slub best performance in the fast fastpath goes from 47 cycles to 41 cycles through the use of the segment override. -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
- Follow-Ups:
- Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64
- From: David Miller <[email protected]>
- Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64
- From: Christoph Lameter <[email protected]>
- [rfc 45/45] Modules: Hack to handle symbols that have a zero value
- From: [email protected]
- [rfc 44/45] Remove local_t support
- From: [email protected]
- [rfc 32/45] Module handling: Use CPU_xx ops to dynamically allocate counters
- From: [email protected]
- [rfc 43/45] x86_64: Add a CPU_OR to support or_pda()
- From: [email protected]
- [rfc 41/45] VM statistics: Use CPU ops
- From: [email protected]
- [rfc 40/45] x86_64: Provide per_cpu_var definition
- From: [email protected]
- [rfc 39/45] x86_64: Remove the data_offset field from the pda.
- From: [email protected]
- [rfc 38/45] x86_64: Remove obsolete per_cpu offset calculations
- From: [email protected]
- [rfc 37/45] x86_64: Support for fast per cpu operations
- From: [email protected]
- [rfc 36/45] X86_64: Place pda first in cpu area.
- From: [email protected]
- [rfc 35/45] X86_64: Declare pda as per cpu data thereby moving it into the cpu area
- From: [email protected]
- [rfc 34/45] x86_64: Fold percpu area into the cpu area.
- From: [email protected]
- [rfc 33/45] x86_64: Use CPU ops for nmi alert counter
- From: [email protected]
- [rfc 31/45] cpu alloc: Remove the allocpercpu functionality
- From: [email protected]
- [rfc 30/45] cpu alloc: Use in the crypto subsystem.
- From: [email protected]
- [rfc 29/45] cpu alloc: Use for infiniband
- From: [email protected]
- [rfc 28/45] cpu_alloc: convert network sockets
- From: [email protected]
- [rfc 27/45] cpu alloc: convert mib handling to cpu alloc
- From: [email protected]
- [rfc 24/45] cpu alloc: convert loopback statistics
- From: [email protected]
- [rfc 26/45] cpu alloc: Chelsio statistics conversion
- From: [email protected]
- [rfc 25/45] cpu alloc: veth conversion
- From: [email protected]
- [rfc 23/45] cpu alloc: dmaengine conversion
- From: [email protected]
- [rfc 22/45] cpu alloc: convert scatches
- From: [email protected]
- [rfc 21/45] cpu alloc: tcp statistics
- From: [email protected]
- [rfc 20/45] cpu alloc: neigbour statistics
- From: [email protected]
- [rfc 16/45] cpu alloc: blktrace conversion
- From: [email protected]
- [rfc 19/45] cpu alloc: NFS statistics
- From: [email protected]
- [rfc 18/45] cpu alloc: XFS counters
- From: [email protected]
- [rfc 17/45] cpu alloc: SRCU
- From: [email protected]
- [rfc 15/45] cpu alloc: genhd statistics conversion
- From: [email protected]
- [rfc 14/45] cpu alloc: ACPI cstate handling conversion
- From: [email protected]
- [rfc 13/45] cpu alloc: workqueue conversion
- From: [email protected]
- [rfc 12/45] cpu alloc: crash_notes conversion
- From: [email protected]
- [rfc 06/45] cpu alloc: page allocator conversion
- From: [email protected]
- [rfc 09/45] cpu alloc: IA64 support
- From: [email protected]
- [rfc 11/45] cpu alloc: percpu_counter conversion
- From: [email protected]
- [rfc 10/45] cpu_alloc: Sparc64 support
- From: [email protected]
- [rfc 07/45] cpu_alloc: Implement dynamically extendable cpu areas
- From: [email protected]
- [rfc 08/45] cpu alloc: x86 support
- From: [email protected]
- [rfc 05/45] cpu alloc: Remove SLUB fields
- From: [email protected]
- [rfc 04/45] cpu alloc: Use in SLUB
- From: [email protected]
- [rfc 02/45] cpu alloc: Simple version of the allocator (static allocations)
- From: [email protected]
- [rfc 03/45] Generic CPU operations: Core piece
- From: [email protected]
- [rfc 01/45] ACPI: Avoid references to impossible processors.
- From: [email protected]
- Re: [rfc 00/45] [RFC] CPU ops and a rework of per cpu data handling on x86_64
- Prev by Date: Re: Power Saving
- Next by Date: [rfc 01/45] ACPI: Avoid references to impossible processors.
- Previous by thread: Power Saving
- Next by thread: [rfc 01/45] ACPI: Avoid references to impossible processors.
- Index(es):