David Brownell wrote:
Interesting. I start to like this shape better; it moves more of the
logic to operating point code, where it can make the sysfs interface
talk in terms of meaningful abstractions, not cryptic numeric offsets.
But it was odd to see the first patch be platform-specific support,
rather than be a neutral framework into which platform-aware code plugs
different kinds of things...
Since it is at a low layer below a number of possible interfaces, and
since there is no generic processing performed at this low layer (it's
pretty much set or get an opaque structure), there isn't any
higher-layer framework to plug into at the moment. If something like
these abstractions of power parameters and operating points are felt to
be a good foundation for a runtime power management stack then turning
our attentions to the next layer up (perhaps cpufreq or a new
embedded-oriented stack) would create that generic structure.
Its worth noting that newer embedded SOCs are coming up with such
complicated clocking structures and rules for setting and switching
operating points that some silicon vendors are starting to provide code
at approximately the PowerOP level for their platforms, to plug into
different upper-layer power management stacks (and possibly different
open source OSes). So there may be some value to settling on common
interfaces for this.
One part I don't like is that the platform would be limited to tweaking
a predefined set of fields in registers. That seems insufficient for
subsystems that may not be present on all boards.
Yes, the code currently assumes it would be tweaked for different
variants of platforms, partly due to the difficulty of implementing a
lean and mean way of integrating the different pieces. It sounds like
registering multiple handlers for multiple sets of power parameters may
be in order, although a single opaque structure shared between upper
layers and the handlers probably won't be sufficient any more. If the
operating point data structure basically goes away and sysfs becomes the
preferred interface then it should be fairly straightforward to discover
what PM capabilities are registered and to get/set the associated power
param attributes. Otherwise in-kernel interfaces might need some
further thought to specify something that routes to the proper handler.
> Plus, to borrow some
terms from cpufreq, it only facilitates "usermode" governor models, never
"ondemand" or any other efficient quick-response adaptive algorithms.
The sysfs interface does not itself handle such schemes, but the PowerOP
layer is fine with inserting beneath in-kernel algorithms. Low-latency,
very frequent adjustments to power parameters are very much in mind for
what I'm trying to do, assuming embedded hardware will increasingly be
able to take advantage of aggressive runtime power management for
battery savings. (Much of this is driven by how embedded hardware can
most aggressively but usefully be power managed, and it would be nice to
get those folks more involved.) What DPM does with approximately the
same type of interface is setup some operating points and policies for
which operating point is appropriate in which situations, and then kick
off a kernel state machine that handles the transitions.
...
Alternatively, the "thing" could implement some adaptive algorithm
using local measurements, predictions, and feedback to adjust any
platform power parameters dynamically. Maybe it'd delegate management
of the ARM clock to "cpufreq", and focus on managing power for other
board components that might never get really reusable code. Switching
between operating points wouldn't require userspace instruction;
call it a "dynamic operating point" selection model.
Interesting, although such close coordination of changing various clocks
and voltages is required on some platforms that it would be hard to
distribute it much among kernel components. To some degree the above is
how DPM functions: some policy instructions are sent to the kernel and
the kernel switches operating points accordingly. Something more
flexible than operating points could be specified in the policy info,
possibly even something as abstract as "battery low", pushing the
interpretation of high-level power policy into kernel components instead
of a userspace app giving the kernel low-level instructions.
The DSP clock might benefit from some support though. I've never
much looked at this, beyond noting that SPUs on CELL should have
similar issues. Wouldn't it be nice to have "ondemand" style
governors for DSPs or SPUs? That's got to be easy. ;)
So far as I understand, Linux-coordinated power management of the DSP
side of dual-core general-purpose + DSP platforms is often handled by a
Linux driver that knows how to talk to whatever it is that runs on the
DSP (such as via shared memory message libs from the silicon vendor).
Soon the other core will be running Linux as well, and the two OSes will
need to coordinate the system power management, which will be an
interesting thing to tackle.
lowpwr 1 = assert ULPD LOW_PWR, voltage scale low
Could you describe the policy effect of this bit? I suspect
a good "PCs don't work like that!" example is lurking here.
That interacts with some other bits, and code ... when would
setting this be good, or bad?
This is how Dynamic Voltage Scaling is done on OMAP1 platforms.
Assuming you've setup an operating point that is validated to work at
the reduced voltage level on your hardware by TI (these are two voltage
levels available), you can optionally specify to run at reduced voltage,
possibly at an increased cost in latency of transitioning between
operating points as voltage ramps up or down. In the case of DPM
running on an OMAP 1610 H2, you could tell the system to run at 1.5V
when not idle and at 1.1V when idle, although depending on the ramp time
(I can't recall for that board, but for some non-OMAPs this can be
significant) and the realtime constraints of your app there could be
missed deadlines under such a policy. If the system isn't running
anything with a tight deadline then it may be fine to stay at 1.1V or
voltage scale between the two.
Other parameters such as DSPMMUDIV, LCDDIV, and ARM_PERDIV might also be
useful.
Again, PERDIV changes would need to involve clock.c to cascade
the changes through the clock tree. Change PERDIV and you'll
need to recalculate the peripheral clocks that derive from it...
better not do it while an I/O operation is actively using it!
On some other platforms this actually becomes necessary, but for OMAP1's
the trouble with doing so probably precludes anybody from using it.
As with TCDIV, that makes a useful example of something that is
clearly not within the "cpufreq" domain.
I'll try to cook up an XScale PXA27x example, which adds multiple memory
and system bus frequencies supported per CPU MHz, quick run vs.turbo
mode switching of CPU MHz, and some other exotic features. It has a
very specific set of "product points" validated by Intel that correspond
to the operating point abstraction. If nothing else, it may be
instructive to consider the variety of ways embedded platforms are being
designed to be power managed.
--
Todd
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
|
|