Re: [Patch 0/8] per-task delay accounting

Jay Lan wrote:

I made two feedback on 3/31 only to see them bounced
back over the weekend. :(

Here was my first feedback:

Shailabh Nagar wrote:
>>
>>Following Andrew's suggestion, here's my quick overview
>>of the various other accounting packages that have been
>>proposed on lse-tech with a focus on whether they can
>>utilize the netlink-based taskstats interface being proposed
>>by the delay accounting patches.
>>
>>Please note that unification of statistics *collection* is not
>>being discussed since that kind of merger can be done as these
>>patches get accepted, if at all, into the kernel. To try and
>>unify right away would hold every patch (esp. delay accounting !)
>>hostage to the problems in every other patch unnecessarily. As
>>long as the interface can be unified, the merger of the
>>collection bits can always happen without affecting user space.
>>
>>Stakeholders of each of these patches, on cc, are requested to
>>please correct any misunderstandings of what their patches do.
>
>To me, data collection and formation before sending down to
>userspace is very important part.  What this taskstats netlink
>interface does is  just to provide an interface to send "already
>formatted" data to userspace. In other words, it will replace
>"writing accounting records to an accounting file" step currently

>performed in BSD accouting and in CSA.


Exactly. The writing of the accounting file can be done in userspace
through a CSA-specific daemon reading the data.

If i understand it correctly,
>you have delayacct.c sitting on top of taskstats interface, and
>all other accounting methods should build their own layer on top

>of taskstats as well.


Yes, all the new ones that are yet to be included in the kernel

For example, potentially BSD acct.c can replace
>fput() (and other statements dealing with acctounting file) with
>this interface. Same for CSA.


Yes. I'm not sure if changing BSD would be useful (since I don't
know how often it is used ?) but yes, it can be done and CSA is
similar.

>
>This approach sounds right to me. Actually i am very glad that you
>made effort to provide a common ground here. Yet, this is only
>one step. I will apply your patchset on top of 2.6.16-mm to see
>what i get and give more feedback later.

And, here is the second one:



This taskstats thing is much more complicated than what Guillaume
used to have when he put up a prototype of doing ELSA over netlink.
One confusing point is the struct taskstats. If it is to be used
as the big data struct to contain all accounting data everybody
needs (as Shailabh suggested on his CSA analysis section), then
if at do_exit() every accounting methods are to be invoked to
handle their netlink transmission (as currently implemented in
delayed accounting), would it be a lot of overhead sending "grand
data" too many times? Maybe each layer should just format data of
their interest when invoked from do_exit, and then we do one call

to genetlink to deliver formated struct taskstats data?


Good idea. One can already do this in the code we submitted by adding

functions similar to delayacct_add_tsk() within the fill_pid() andfill_tgid() partsof the taskstats code. Then the delayacct_tsk_exit() routine will serveas the

"one call" to deliver formatted data.

However, using delayacct_tsk_exit (which does have delay accounting specific

bits too) as the data delivery call isn't intuitive. So I'll separateout the taskstats_exit_pidas a separate call directly made within do_exit(). Will require somerefactoring but it

can be done.

Also, as you pointed out, CSA only retrieve data at end of task
but delayed accounting needs to retrieve data during the process.
So, i think we need more than one record types, not just the
struct taskstats, so that the user space delayed accountingapplication can specify to get only delayed accounting record.

A separate record type isn't needed, atleast for now. For delayaccounting, the data obtained during aprocess' lifetime is the same as the one expected at the end. So byitself, it has no need to distinguishrecords generated during the lifetime and those generated after aprocess exits.

Yes, the additional fields added to the taskstats struct by CSA will be"unnecessary" for delay accountingusers but they will have to be able to deal with that anyway (for theprocess exit records where CSA and delay

will share a common exit record).

So creating a separate record structure for the "during lifetime"records trades off transmission of a larger structure (relatively cheap)vs. the added complexity of tracking two types of records.

At this point, the tradeoff isn't worth it for us.

Honestly, this taskstats.c layer looks more like something
extracted from delayed accounting than a carefully designed commonground to me.

If you have other specific suggestions about the interface and why itdoesn't meet CSA's needs,

we can work to fix them.

Patch 8/8 is about documentation of delayed
accounting than the common ground for various accounting methods.

True. Patch 8/8 was meant to document delay accounting alone. I'llextract the

taskstats specific parts out.

Can you please present us a documentation of design concept of
such a common layer ?

Well, the design is fairly straightforward and is probably apparent by now.
A common per-task accounting structure called taskstats exists.
Userspace can use a NETLINK_GENERIC interface to send queries for
statistics of a particular pid or tgid during the lifetime of a process.

Specifying the pid gives the stats for just that pid. Specifying thetgid returns

the sum of stats for all threads of the tgid.

Userspace can also choose to open the NETLINK_GENERIC socket inmulticast andlisten for per-pid and per-tgid statistics that are automatically sentfrom the kernel using a whenever a task exits. These stats are sentwhenever there is any listener on the genetlink socket. The per-pid andper-tgiddata are exactly the same as what you would get if a query could be donejust beforea task exited. Sending the per-tgid data at the exit of each pid/tid isnecessary sincethere is no well-defined "tgid exit" point in the kernel (we do notdefine a thread group tocease existence when the thread group leader exits...rather it ceases toexist when thelast thread of the thread group exits). Also, per-tgid accumalation isonly done dynamically in the kernel, not maintained as a separatestatistic (to avoid wasting time and space). So each time a tid from atgid exits, one needs to collect and send the whole tgid's data in caseuserspace is trying to track the stats at a per-tgid level.


The statistic structure contents are documented in include/linux/taskstats.h

and by the accounting subsystem which fills in the fields. Currentlydelay accounting

is the only user so all the fields are of the form
   XXX_count and XXX_delay_total

where the former is a count of number of values added in the latter.Latter is thecumulative "delay", in nanoseconds, seen by a pid waiting for theresource XXX.e.g. cpu_delay_total is the total time spent waiting for a cpu to runon, blkio_delay_total

is the time spent waiting for  sync block I/O to complete etc.

As more per-task accounting packages get added to the kernel, they candefineadditional fields following the instructions ininclude/linux/taskstats.h and define their

own userspace utilities similar to getdelays.c

Querying for data during a task's lifetime is done completelyindependently by all the utilities(using unicast queries and replies) - responses to queries by one arenot seen by the others.The stats sent on task exit are common and multicast to all listeningutilities.



Will add this to a separate taskstats doc in Documentation/.

That would help me. I guess i also need to catch up on genetlink tobetter understand taskstats code.

Please do so soon. The usage of genetlink for taskstats has gone througha detailed review by Jamal etc. so there shouldn't be any genetlinkissues that are pertinent to the potential CSA usage of taskstats.



--Shailabh


Regards.
 - jay

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: [Lse-tech] Re: [Patch 0/8] per-task delay accounting
  - From: Jay Lan <[email protected]>

References:
- Re: [Patch 0/8] per-task delay accounting
  - From: Jay Lan <[email protected]>

Prev by Date: [PATCH] FS: Fix OCFS2 warning when DEBUG_FS is not enabled
Next by Date: [Patch] use after free in drivers/media/video/em28xx/em28xx-video.c
Previous by thread: Re: [Patch 0/8] per-task delay accounting
Next by thread: Re: [Lse-tech] Re: [Patch 0/8] per-task delay accounting
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]