Re: [openib-general] Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation

I don't think that we should jump to the conclusion that in the longterm HPC users cannot benefit from support of mechanisms such ashotremoval of memory or other forms of page migration in physicalmemory. In an earlier exchange on the openib-general list Mike Krausesent the message quoted below on very much the same topic. On the otherhand I am willing to accept that there is practical value toimplementations which are not (yet) sophisticated to enough to supportthe migration functions.


Steve Langdon

Michael Krause wrote: At 05:35 PM 3/14/2005, Caitlin Bestler wrote:
> -----Original Message-----
> From: Troy Benjegerdes [ mailto:[email protected]]
> Sent: Monday, March 14, 2005 5:06 PM
> To: Caitlin Bestler
> Cc: [email protected]
> Subject: Re: [openib-general] Getting rid of pinned memory requirement
>
> >
> > The key is that the entire operation either has to be fast
> > enough so that no connection or application session layer
> > time-outs occur, or an end-to-end agreement to suspend the
> > connetion is a requirement. The first option seems more
> > plausible to me, the second essentially
> > reuqires extending the CM protocol. That's a tall order even for
> > InfiniBand, and it's even worse for iWARP where the CM
> > functionality typically ends when the connection is established.
>> I'll buy the good network design argument.
I and others designed InfiniBand RNR (Receiver not ready) operationsto allow one to adjust V-to-P mappings (not change the address thatwas advertised) in order to allow an OS to safely play some games withmemory and not drop a connection. The time values associated with RNRallow a solution to tolerate up to infinite amount of time to performsuch operations but the envisioned goal was to do this on the order ofa handful or milliseconds in the worse case. For iWARP, there was nosupport for defining RNR functionality as indeed many people claimedone could just drop in-bound segments and allow the retransmissionprotocol to deal with the delay (even if this has performanceimplications due to back-off algorithms though some claim SACK wouldminimize this to a large extent). Again, the idea was to minimize theworse case to milliseconds of down time. BTW, all of this assumedthat the OS would not perform these types of changes that often so thelong-term impact on an application would be minimum.
>
> I suppose if the kernel wants to revoke a card's pinned
> memory, we should be able to guarantee that it gets new
> pinned memory within a bounded time. What sort of timing do
> we need? Milliseconds?
> Microseconds?
>
> In the case of iWarp, isn't this just TCP underneath? If so,
> can't we just drop any packets in the pipe on the floor and
> let them get retransmitted? (I suppose the same argument goes
> for infiniband..
> what sort of a time window do we have for retransmission?)
>
> What are the limits on end-to-end flow control in IB and iWarp?
>
>From the RDMA Provider's perspective, the short answer is "quickenough so that I don't have to do anything heroic to keep theconnection alive."
It should not require anything heroic. What is does require is alocal method to suspend the local QP(s) so that it cannot place orread memory in the effected area. That can take some time dependingupon the implementation. There is then the time to over write themappings which again depending upon the implementation and the numberof mappings could be milliseconds in length.
With TCP you also have to add "and healthy". If you've ever had along download that got effectively stalled by a burst of noise andyou just hit the 'reload' button on your browser then you know whatI'm talking about.
But in transport neutral terms I would think that one RTT isdefinitely safe -- that much data could have
been dropped by one switch failure or one nasty spike in inbound noise.

> >
> > Yes, there are limits on how much memory you can mlock, or even
> > allocate. Applications are required to reqister memory precisely
> > because the required guarantess are not there by default.
> Eliminating
> > those guarantees *is* effectively rewriting every RDMA application
> > without even letting them know.
>
> Some of this argument is a policy issue, which I would argue
> shouldn't be hard-coded in the code or in the network hardware.
>
> At least in my view, the guarantees are only there to make
> applications go fast. We are getting low latency and high
> performance with infiniband by making memory registration go
> really really slow. If, to make big HPC simulation
> applications work, we wind up doing memcpy() to put the data
> into a registered buffer because we can't register half of
> physical memory, the application isn't going very fast.
>

What you are looking for is a distinction between registering
memory to *enable* the RNIC to optimize local access and
registering memory to enable its being advertised to the
remote end.

Early implementations of RDMA, both IB and iWARP, have not
distinquished between the two. But theoretically *applications*
do not need memory regions that are not enabled for remote
access to be pinned. That is an RNIC requirement that could
evolve. But applications themselves *do* need remotely
accessible memory regions, portions of which they intend
to advertise with RKeys, to be truly available (i.e., pinned).

You are also making a policy assumption that an application
that actually needs half of physical memory should be using
paged memory. Memory is cheap, and if performance is critical
why should this memory be swapped out to disk?

Is the limitation on not being able to register half of
physical memory based upon some assumption that swapping
is a requirement? Or is it a limitation in the memory region
size? If it's the latter, you need to get the OS to support
larger page sizes.
For some OS, you can pin very large areas. I've seen 15/16 of memorybeing able to be pinned with no adverse impacts on the applications.For these OS, kernel memory is effectively pinned memory. As such,depending upon the mix of services being provided, the system mayoperate quite nicely with such large amounts of memory being pinned.As more services are "ported" to operate over RDMA technologies,memory management isn't necessarily any harder; it just becomessomething people have to think more about. Today's VM designs haveallowed people to get sloppy as they assume that swapping will occurand since many platforms are not that loaded, they don't see any realadverse impacts. User-space RDMA applications requires people tothink once again about memory management and that swapping isn't aget-out-of-jail card. One needs to develop resource management toolsto determine who obtains specified amounts of resources and theirpriorities. For the most part, this is somewhat a re-invention ofsome thinking that went into the micro-kernel work in past years.These problems are not intractable; they are only constrained by thelegacy inertia inherent in all technologies today.
Mike




IWAMOTO Toshihiro wrote:

At Mon, 25 Apr 2005 16:58:03 -0700,
Roland Dreier wrote:

   Andrew> It would be better to obtain this memory via a mmap() of
   Andrew> some special device node, so we can perform appropriate
   Andrew> permission checking and clean everything up on unclean
   Andrew> application exit.

This seems to interact poorly with how applications want to use RDMA,
ie typically through a library interface such as MPI.  People doing
HPC don't want to recode their apps to use a new allocator, they just
want to link to a new MPI library and have the app go fast.


Such HPC users cannot use the memory hotremoval feature, and something
needs to be implemented so that the NUMA migration can handle such
memory properly, but I see your point.

If such memory were allocated by a driver, the memory could be placed
in non-hotremovable areas to avoid the above problems.

--
IWAMOTO Toshihiro
_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

begin:vcard
fn:Steve Langdon
n:Langdon;Stephen
org:Hewlett-Packard;Consulting & Architecture Group
adr:MS LKG1-3/B19;;550 King Street;Littleton;MA;01460;USA
email;internet:[email protected]
title:Fellow
tel;work:+1 978-506-5771
tel;fax:+1 978-742-1144
tel;home:+1 978-456-8177
tel;cell:+1 978-618-8599
x-mozilla-html:TRUE
version:2.1
end:vcard

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

References:
- [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
  - From: Roland Dreier <[email protected]>
- Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
  - From: Troy Benjegerdes <[email protected]>
- Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
  - From: Roland Dreier <[email protected]>
- Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
  - From: Troy Benjegerdes <[email protected]>
- Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
  - From: Roland Dreier <[email protected]>
- Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
  - From: Troy Benjegerdes <[email protected]>
- Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
  - From: Roland Dreier <[email protected]>
- Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
  - From: Andrew Morton <[email protected]>
- Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
  - From: Timur Tabi <[email protected]>
- Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
  - From: Christoph Hellwig <[email protected]>
- Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
  - From: Timur Tabi <[email protected]>
- Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
  - From: Andrew Morton <[email protected]>
- Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
  - From: Roland Dreier <[email protected]>
- Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
  - From: Andrew Morton <[email protected]>
- Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
  - From: Roland Dreier <[email protected]>
- Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
  - From: Andrew Morton <[email protected]>
- Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
  - From: Timur Tabi <[email protected]>
- Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
  - From: Andrew Morton <[email protected]>
- Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
  - From: Roland Dreier <[email protected]>
- Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
  - From: IWAMOTO Toshihiro <[email protected]>

Prev by Date: [PATCH]broadcast IPI race condition on CPU hotplug
Next by Date: Re: Mercurial 0.3 vs git benchmarks
Previous by thread: Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Next by thread: Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]