Re: [RFC PATCH 25/33] Implement timekeeping for Xen

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi John,

I was hoping to catch up with you at OLS to talk about this, but, well, you know...

john stultz wrote:
Interesting. The Andi has been bugging me for a similarly designed
per-cpu TSC clocksource, but just for generic use. I'm a little
skeptical that it will be 100% without error, since anything dealing w/
the TSCs have been nothing but trouble in my mind, but this looks like a
good proving ground for the concept.

I haven't had much direct experience with it, but Ian Pratt says it has been pretty successful in practice. The big problem has been CPUs (typically AMD) which drop into a thermal throttling state without telling anyone (no interrupt, etc). Xen doesn't really support CPU speed switching yet, so that hasn't been a problem.

It was mentioned to me that the clocksource approach helped cleanup some
of the xen time changes (is that really true? :), but there were still
some outstanding issues (time inconsistencies, perhaps?). I'm just
curious if there are any details about the issues there, or if I
misunderstood?

No, the time rework did make it much easier to plug Xen in. The original Xen time patch was basically a complete copy of arch/i386/kernel/time.c with additional hacks. That patch did do more than this once, since I was primarily aiming for basically functional & clean rather than fully functional, on the assumption that the patch would generate this discussion so we could sort out the rest of the issues.

This Xen time patch deals with:

   * setting the system clock at boot
   * ticks
   * lost & stolen ticks
   * tickless idle

In addition to this, we would also like to be able to absolutely slave the guest time to the hypervisor time, so that even if the hypervisor time changes there still a very small (or ideally 0) skew between the guest time.

This does not implement setting the hypervisor clock; nor does it
implement non-independent time, so hypervisor wallclock changes will
not affect the guest.

Hmmm. I'm not sure if I understood that last line or not. I guess I need
to think a bit about CLOCK_REALTIME vs CLOCK_MONOTONIC wrt
hypervisiors.
I guess the question is "who owns time?" the guest OS (does it have its
own CLOCK_REALTIME, independent of other guests?) or does the
hypervisor? What does NTPd running on a guest actually adjust?

In Xen there are two types of guest domain: dom0 and domU. Dom0 (of which there can be only one) can perform privileged hypervisor operations, has direct access to hardware, etc; only dom0 can change hypervisor time. DomU domains are typically not privileged, only have virtual devices, etc. With respect to time, domains can either operate with "dependent time" or "independent time".

With independent time, a guest maintains its own notion of time. It may be initially set from the hypervisor, or it might use something else, like ntp. It may also choose to use the hypervisor as a clock source, or it can use any other clock. This is the easy case, since its the same as a stand-alone kernel.

With dependent time, the guest time is slaved to the hypervisor time. At any point in time, all dependent-time guests should have a very small (ideally 0) skew between each other and the hypervisor itself, regardless of how the hypervisor time changes. It would never make sense to use ntp or a non-hypervisor clocksource in a dependent-time guest. settimeofday and adjtimex would be null operations (I'm not sure if it makes sense to have a dependent-time dom0).

There are two problems in implementing this at present. The first is that the clock abstractions have nicely moved all the maintenance of the system time of day into the core, with no need for the arch-specific code to know about it. This is nice, but it isn't clear to me how we'd implement a dependent-time guest.

The second problem is that at present the only way to set the hypervisor time is with a simple stepwise settimeofday interface, rather than a time-warping adjtimex interface. This means that there's at least time potential for a other guests with slaved clocks could see non-monotonic time. We probably need to have a full time implementation of a clock within Xen itself, so that the dom0 adjtimex calls just call down into it rather than really maintaining a local clock.

The added complication is that we don't want to make a real hypercall into the hypervisor every time to fetch the time. To avoid this there is a shared memory region mapped from the hypervisor into the guests with contains periodically updated time information. This is updated by the hypervisor timer interrupt (100Hz?), and so is fairly low precision. To get more precise time, the guests extrapolate from this using the tsc. The obvious trouble with this extrapolation is that if the hypervisor time slows down, the guest could see non-monotonic time because their extrapolation over-shoots the real hypervisor tick-to-tick time step.


Also, we had quite a few discussions at OLS about introducing a general hypervisor interface layer, and how to handle time is obviously a part of that. I would imagine that most of this is a common problem to any hypervisor-based system rather than something specific to Xen (though obviously the details might vary).

+/* Permitted clock jitter, in nsecs, beyond which a warning will be printed. */
+static unsigned long permitted_clock_jitter = 10000000UL; /* 10ms */
+static int __init __permitted_clock_jitter(char *str)
+{
+	permitted_clock_jitter = simple_strtoul(str, NULL, 0);
+	return 1;
+}
+__setup("permitted_clock_jitter=", __permitted_clock_jitter);

permitted_clock_jitter is a little vague and might get confused w/ the
NTP notion of jitter. Is there a better name, or could we get a xen_
prefix there?

Sure. I copied this over without really looking into it deeply, so I need to work out what this really means. Would "permitted_backstep" might be a better name? Adding "xen_" is a no-brainer though, though it might be something which will be common to all systems with virtualized time.

+/* These are perodically updated in shared_info, and then copied here. */
+struct shadow_time_info {
+	u64 tsc_timestamp;     /* TSC at last update of time vals.  */
+	u64 system_timestamp;  /* Time, in nanosecs, since boot.    */
+	u32 tsc_to_nsec_mul;
+	u32 tsc_to_usec_mul;

Hmmm. Keeping separate cycle->usec and cycle->nsec multipliers is an
interesting optimization. I'd even consider it for the generic
clocksource code, but I suspect recalculating the independent adjustment
factors for both kills the performance benefit.  Have you actually
compaired against the cost of the /1000 going from nsec to usec?

This is also something I copied over. It wasn't obvious to me that it would be a big win either. In fact, this patch doesn't use tsc_to_usec at all, so its completely redundant.

+	int tsc_shift;
+	u32 version;

Errr.. Why is a version value necessary?
This structure is derived from the time structure Xen maintains in shared memory. If the shared memory version matches the local shadow version, we don't need to sync the two (I'm not sure if this is actually used in this patch).

+
+static DEFINE_PER_CPU(struct shadow_time_info, shadow_time);
+
+/* Keep track of last time we did processing/updating of jiffies and xtime. */
+static u64 processed_system_time;   /* System time (ns) at last processing. */
+static DEFINE_PER_CPU(u64, processed_system_time);

Errr. That would confuse me right off. Global and per-cpu values having
the same name?

Yes, I was wondering about that myself.

+/* How much CPU time was spent blocked and how much was 'stolen'? */
+static DEFINE_PER_CPU(u64, processed_stolen_time);
+static DEFINE_PER_CPU(u64, processed_blocked_time);

These seem like more generic accounting structures. Surely other
virtualized arches have something similar? Something that should be
looked into.

Yes.

+	if (shift < 0)
+		delta >>= -shift;
+	else
+		delta <<= shift;

I think there is a shift_right() macro that can avoid this.

OK.

Also I'm not sure I follow why you shift before multiply instead of
multiply before shift? Does that not hurt your precision?

It would seem so.  I'm not sure what the original thought was here.

+#ifdef __i386__
+	__asm__ (
+		"mul  %5       ; "
+		"mov  %4,%%eax ; "
+		"mov  %%edx,%4 ; "
+		"mul  %5       ; "
+		"xor  %5,%5    ; "
+		"add  %4,%%eax ; "
+		"adc  %5,%%edx ; "
+		: "=A" (product), "=r" (tmp1), "=r" (tmp2)
+		: "a" ((u32)delta), "1" ((u32)(delta >> 32)), "2" (mul_frac) );
+#elif __x86_64__
+	__asm__ (
+		"mul %%rdx ; shrd $32,%%rdx,%%rax"
+		: "=a" (product) : "0" (delta), "d" ((u64)mul_frac) );
+#else
+#error implement me!
+#endif
+
+	return product;
+}

I think we need some generic mul_llxl_ll() wrappers here.

OK.

+
+static u64 get_nsec_offset(struct shadow_time_info *shadow)
get_nsec_offset is a little generic for a name. I know xen_ prefixes
everywhere are irritating, but maybe something a little more specific
would be a good idea.

It's static, so it isn't affecting the global namespace. Do you just mean in terms of making it easy to find with tags or backtraces?

+static cycle_t xen_clocksource_read(void)
+{
+	struct shadow_time_info *shadow = &per_cpu(shadow_time, smp_processor_id());
+
+	get_time_values_from_xen();
+
+	return shadow->system_timestamp + get_nsec_offset(shadow);
+}

Does get_time_values_from_xen() really need to be called on every
clocksource_read call?

In principle it shouldn't cost much, if anything. Hm, it doesn't look like it uses the version-comparison optimisation, but even without that it just copies some values with a low likelihood of needing to iterate. The version-comparison test would eliminate the tsc_to_usec divide as well (though that's redundant anyway).

We could call it less and rely on longer extrapolations of time, but I'm not sure it's worth it when traded against the possibility of non-monotonicity.

+static void init_cpu_khz(void)
+{
+	u64 __cpu_khz = 1000000ULL << 32;
+	struct vcpu_time_info *info;
+	info = &HYPERVISOR_shared_info->vcpu_info[0].time;
+	do_div(__cpu_khz, info->tsc_to_system_mul);
+	if (info->tsc_shift < 0)
+		cpu_khz = __cpu_khz << -info->tsc_shift;
+	else
+		cpu_khz = __cpu_khz >> info->tsc_shift;
+}

Err.. That could use some comments.

Yep.

+static struct clocksource xen_clocksource = {
+	.name = "xen",
+	.rating = 400,
+	.read = xen_clocksource_read,
+	.mask = ~0,
+	.mult = 1,		/* time directly in nanoseconds */
+	.shift = 0,
+	.is_continuous = 1
+};

Hmmm. The 1/0 mul/shift pair is interesting. Is it expected that NTP
does not ever adjust this clocksource? If not the clocksource_adjust()
function won't do well with this at all, so you might consider something
like:
#define XEN_SHIFT 22
.mult = 1<<XEN_SHIFT
.shift = XEN_SHIFT

OK. I added the 1/0 pair without really thinking about the implications for ntp. It does make sense to use ntp, so I'll fix that.

+static void init_missing_ticks_accounting(int cpu)
+{
+	struct vcpu_register_runstate_memory_area area;
+	struct vcpu_runstate_info *runstate = &per_cpu(runstate, cpu);
+
+	memset(runstate, 0, sizeof(*runstate));
+
+	area.addr.v = runstate;
+	HYPERVISOR_vcpu_op(VCPUOP_register_runstate_memory_area, cpu, &area);
+
+	per_cpu(processed_blocked_time, cpu) =
+		runstate->time[RUNSTATE_blocked];
+	per_cpu(processed_stolen_time, cpu) =
+		runstate->time[RUNSTATE_runnable] +
+		runstate->time[RUNSTATE_offline];
+}

Again, this accounting seems like it could be generically useful.

OK.

+__init void time_init_hook(void)
+{
+	get_time_values_from_xen();
+
+	processed_system_time = per_cpu(shadow_time, 0).system_timestamp;
+	per_cpu(processed_system_time, 0) = processed_system_time;
+
+	init_cpu_khz();
+	printk(KERN_INFO "Xen reported: %u.%03u MHz processor.\n",
+	       cpu_khz / 1000, cpu_khz % 1000);
+
+	/* Cannot request_irq() until kmem is initialised. */
+	late_time_init = setup_cpu0_timer_irq;
+
+	init_missing_ticks_accounting(0);
+
+	clocksource_register(&xen_clocksource);
+
+	/* Set initial system time with full resolution */
+	xen_get_wallclock(&xtime);
+	set_normalized_timespec(&wall_to_monotonic,
+				-xtime.tv_sec, -xtime.tv_nsec);
+}

Some mention of which functions require to hold what on xtime_lock would
be useful as well (applies to this function as well as the previous ones
already commented on).

This is called from time_init(), which sets xtime without holding the lock. I originally took the lock here, but removed it when I noticed that time_init() didn't bother.

My only thoughts after looking at it: Using nanoseconds as a primary
unit is often easier to work with, but less efficient.  So rather then
keeping a tsc_timestamp + system_timestamp in two different units, why
not keep a calculated TSC base that includes the "cycles since boot"
which is adjusted in the same manner internally to Xen as the
system_timestamp is. Then let the timekeeping code do the conversion for
you.

It's worth considering; we'll need to consider how that changes Xen's interface, but I think we'll need to look at that anyway.

I haven't fully thought about what else it would affect in the above (I
realize stolen_time, etc is in nsecs), but it might be something to
consider.

Am I making any sense or just babbling?

Definitely makes sense. And I'd like to know your thoughts about how we can take more direct control of the system wallclock in a clean and sensible manner (or perhaps we shouldn't; maybe the right answer is that all the guests run an ntp server pointing at dom0, though that has its own downsides).

   J

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux