Re: [PATCH] [15/58] i386: Rewrite sched_clock

* Daniel Walker ([email protected]) wrote:
> On Thu, 2007-07-19 at 11:54 +0200, Andi Kleen wrote:
> > Move it into an own file for easy sharing.
> > Do everything per CPU. This avoids problems with TSCs that
> > tick at different frequencies per CPU.
> > Resync properly on cpufreq changes. CPU frequency is instable
> > around cpu frequency changing, so fall back during a backing
> > clock during this period.
> > Hopefully TSC will work now on all systems except when there isn't a
> > physical TSC. 
> > 
> > And
> > 
> > +From: Jeremy Fitzhardinge <[email protected]>
> > Three cleanups there:
> >  - change "instable" -> "unstable"
> >  - it's better to use get_cpu_var for getting this cpu's variables
> >  - change cycles_2_ns to do the full computation rather than just the
> >    tsc->ns scaling.  It's a simpler interface, and it makes the function
....
> > +/*
> > + * Scheduler clock - returns current time in nanosec units.
> > + * All data is local to the CPU.
> > + * The values are approximately[1] monotonic local to a CPU, but not
> > + * between CPUs.   There might be also an occasionally random error,
> > + * but not too bad. Between CPUs the values can be non monotonic.
> > + *
> > + * [1] no attempt to stop CPU instruction reordering, which can hit
> > + * in a 100 instruction window or so.
> > + *
> > + * The clock can be in two states: stable and unstable.
> > + * When it is stable we use the TSC per CPU.
> > + * When it is unstable we use jiffies as fallback.
> > + * stable->unstable->stable transitions can happen regularly
> > + * during CPU frequency changes.
> > + * There is special code to avoid having the clock jump backwards
> > + * when we switch from TSC to jiffies, which needs to keep some state
> > + * per CPU. This state is protected against parallel state changes
> > + * with interrupts off.
> The comment still says something about interrupts off, but that was
> removed it looks like.
> 

I noticed the same thing about interrupts off when going through the
code. Andi, since you are already playing with per cpu variables, you
could leverage asm/local.h there by declaring last_val as local_t and
use either local_cmpxchg or local_add_return (depending on your needs)
to get both better performances than cli/sti _and_ be really atomic.

See this thread for performance tests:
http://www.ussg.iu.edu/hypermail/linux/kernel/0707.1/0832.html

Mathieu

> > + */
> > +unsigned long long tsc_sched_clock(void)
> > +{
> > +	unsigned long long r;
> > +	struct sc_data *sc = &get_cpu_var(sc_data);
> > +
> > +	if (unlikely(sc->unstable)) {
> > +		r = (jiffies_64 - sc->sync_base) * (1000000000 / HZ);
> > +		r += sc->ns_base;
> 
> Looking further down you aren't using this unstable path when the tsc is
> just outright unstable (i.e. some Cyrix systems IIRC)? An improvement
> over the original code would be to catch the systems that change
> frequencies without cpufreq (like the ones that gave Thomas so much
> trouble).
> 
> > +		/*
> > +		 * last_val is used to avoid non monotonity on a
> > +		 * stable->unstable transition. Make sure the time
> > +		 * never goes to before the last value returned by the
> > +		 * TSC clock.
> > +		 */
> > +		while (r <= sc->last_val) {
> > +			rmb();
> > +			r = sc->last_val + 1;
> > +			rmb();
> > +		}
> > +		sc->last_val = r;
> > +	} else {
> > +		rdtscll(r);
> > +		r = __cycles_2_ns(sc, r);
> > +		sc->last_val = r;
> > +	}
> > +
> > +	put_cpu_var(sc_data);
> > +
> > +	return r;
> > +}
> > +
> > +/* We need to define a real function for sched_clock, to override the
> > +   weak default version */
> > +#ifdef CONFIG_PARAVIRT
> > +unsigned long long sched_clock(void)
> > +{
> > +	return paravirt_sched_clock();
> > +}
> > +#else
> > +unsigned long long sched_clock(void)
> > +	__attribute__((alias("tsc_sched_clock")));
> > +#endif
> > +
> > +static int no_sc_for_printk;
> > +
> > +/*
> > + * printk clock: when it is known the sc results are very non monotonic
> > + * fall back to jiffies for printk. Other sched_clock users are supposed
> > + * to handle this.
> > + */
> > +unsigned long long printk_clock(void)
> > +{
> > +	if (unlikely(no_sc_for_printk))
> > +		return (jiffies_64 - INITIAL_JIFFIES) * (1000000000 / HZ);
> > +	return tsc_sched_clock();
> > +}
> > +
> > +static void resolve_freq(struct cpufreq_freqs *freq)
> > +{
> > +	if (!freq->new) {
> > +		freq->new = cpufreq_get(freq->cpu);
> > +		if (!freq->new)
> > +			freq->new = tsc_khz;
> > +	}
> > +}
> > +
> > +/* Resync with new CPU frequency. Must run on to be synced CPU */
> > +static void resync_freq(void *arg)
> > +{
> > +	struct cpufreq_freqs *freq = (void *)arg;
> > +	struct sc_data *sc = &__get_cpu_var(sc_data);
> > +
> > +	sc->sync_base = jiffies;
> > +	if (!cpu_has_tsc) {
> > +		sc->unstable = 1;
> > +		return;
> > +	}
> > +	resolve_freq(freq);
> > +
> > +	/*
> > +	 * Handle nesting, but when we're zero multiple calls in a row
> > +	 * are ok too and not a bug. This can happen during startup
> > +	 * when the different callbacks race with each other.
> > +	 */
> > +	if (sc->unstable > 0)
> > +		sc->unstable--;
> > +	if (sc->unstable)
> > +		return;
> > +
> > +	/* Minor race window here, but should not add significant errors. */
> > +	sc->ns_base = ktime_to_ns(ktime_get());
> > +	rdtscll(sc->sync_base);
> > +	sc->cyc2ns_scale = (1000000 << CYC2NS_SCALE_FACTOR) / freq->new;
> > +}
> > +
> > +static void resync_freq_on_cpu(void *arg)
> > +{
> > +	struct cpufreq_freqs f = { .new = 0 };
> > +
> > +	f.cpu = get_cpu();
> > +	resync_freq(&f);
> > +	put_cpu();
> > +}
> > +
> > +static int sc_freq_event(struct notifier_block *nb, unsigned long event,
> > +			 void *data)
> > +{
> > +	struct cpufreq_freqs *freq = data;
> > +	struct sc_data *sc = &per_cpu(sc_data, freq->cpu);
> > +
> > +	if (cpu_has(&cpu_data[freq->cpu], X86_FEATURE_CONSTANT_TSC))
> > +		return NOTIFY_DONE;
> > +	if (freq->old == freq->new)
> > +		return NOTIFY_DONE;
> > +
> > +	switch (event) {
> > +	case CPUFREQ_SUSPENDCHANGE:
> > +		/* Mark TSC unstable during suspend/resume */
> > +	case CPUFREQ_PRECHANGE:
> > +		/*
> > +		 * Mark TSC as unstable until cpu frequency change is
> > +		 * done because we don't know when exactly it will
> > +		 * change.  unstable in used as a counter to guard
> > +		 * against races between the cpu frequency notifiers
> > +		 * and normal resyncs
> > +		 */
> > +		sc->unstable++;
> > +		/* FALL THROUGH */
> > +	case CPUFREQ_RESUMECHANGE:
> > +	case CPUFREQ_POSTCHANGE:
> > +		/*
> > +		 * Frequency change or resume is done -- update everything and
> > +		 * mark TSC as stable again.
> > +		 */
> > +		on_cpu_single(freq->cpu, resync_freq, freq);
> > +		break;
> > +	}
> > +	return NOTIFY_DONE;
> > +}
> > +
> > +static struct notifier_block sc_freq_notifier = {
> > +	.notifier_call = sc_freq_event
> > +};
> > +
> > +static int __cpuinit
> > +sc_cpu_event(struct notifier_block *self, unsigned long event, void *hcpu)
> > +{
> > +	long cpu = (long)hcpu;
> > +	if (event == CPU_ONLINE) {
> > +		struct cpufreq_freqs f = { .cpu = cpu, .new = 0 };
> > +
> > +		on_cpu_single(cpu, resync_freq, &f);
> > +	}
> > +	return NOTIFY_DONE;
> > +}
> > +
> > +static __init int init_sched_clock(void)
> > +{
> > +	if (unsynchronized_tsc())
> > +		no_sc_for_printk = 1;
> > +
> > +	/*
> > +	 * On a race between the various events the initialization
> > +	 * might be done multiple times, but code is tolerant to
> > +	 * this .
> > +	 */
> > +	cpufreq_register_notifier(&sc_freq_notifier,
> > +				CPUFREQ_TRANSITION_NOTIFIER);
> > +	hotcpu_notifier(sc_cpu_event, 0);
> > +	on_each_cpu(resync_freq_on_cpu, NULL, 0, 0);
> > +	return 0;
> > +}
> > +core_initcall(init_sched_clock);
> > Index: linux/arch/i386/kernel/tsc.c
> > ===================================================================
> > --- linux.orig/arch/i386/kernel/tsc.c
> > +++ linux/arch/i386/kernel/tsc.c
> > @@ -63,74 +63,6 @@ static inline int check_tsc_unstable(voi
> >  	return tsc_unstable;
> >  }
> >  
> > -/* Accellerators for sched_clock()
> > - * convert from cycles(64bits) => nanoseconds (64bits)
> > - *  basic equation:
> > - *		ns = cycles / (freq / ns_per_sec)
> > - *		ns = cycles * (ns_per_sec / freq)
> > - *		ns = cycles * (10^9 / (cpu_khz * 10^3))
> > - *		ns = cycles * (10^6 / cpu_khz)
> > - *
> > - *	Then we use scaling math (suggested by [email protected]) to get:
> > - *		ns = cycles * (10^6 * SC / cpu_khz) / SC
> > - *		ns = cycles * cyc2ns_scale / SC
> > - *
> > - *	And since SC is a constant power of two, we can convert the div
> > - *  into a shift.
> > - *
> > - *  We can use khz divisor instead of mhz to keep a better percision, since
> > - *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
> > - *  ([email protected])
> > - *
> > - *			[email protected] "math is hard, lets go shopping!"
> > - */
> > -unsigned long cyc2ns_scale __read_mostly;
> > -
> > -#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
> > -
> > -static inline void set_cyc2ns_scale(unsigned long cpu_khz)
> > -{
> > -	cyc2ns_scale = (1000000 << CYC2NS_SCALE_FACTOR)/cpu_khz;
> > -}
> > -
> > -/*
> > - * Scheduler clock - returns current time in nanosec units.
> > - */
> > -unsigned long long native_sched_clock(void)
> > -{
> > -	unsigned long long this_offset;
> > -
> > -	/*
> > -	 * Fall back to jiffies if there's no TSC available:
> > -	 * ( But note that we still use it if the TSC is marked
> > -	 *   unstable. We do this because unlike Time Of Day,
> > -	 *   the scheduler clock tolerates small errors and it's
> > -	 *   very important for it to be as fast as the platform
> > -	 *   can achive it. )
> > -	 */
> > -	if (unlikely(!tsc_enabled && !tsc_unstable))
> > -		/* No locking but a rare wrong value is not a big deal: */
> > -		return (jiffies_64 - INITIAL_JIFFIES) * (1000000000 / HZ);
> > -
> > -	/* read the Time Stamp Counter: */
> > -	rdtscll(this_offset);
> > -
> > -	/* return the value in ns */
> > -	return cycles_2_ns(this_offset);
> > -}
> > -
> > -/* We need to define a real function for sched_clock, to override the
> > -   weak default version */
> > -#ifdef CONFIG_PARAVIRT
> > -unsigned long long sched_clock(void)
> > -{
> > -	return paravirt_sched_clock();
> > -}
> > -#else
> > -unsigned long long sched_clock(void)
> > -	__attribute__((alias("native_sched_clock")));
> > -#endif
> > -
> >  unsigned long native_calculate_cpu_khz(void)
> >  {
> >  	unsigned long long start, end;
> > @@ -238,11 +170,6 @@ time_cpufreq_notifier(struct notifier_bl
> >  						ref_freq, freq->new);
> >  			if (!(freq->flags & CPUFREQ_CONST_LOOPS)) {
> >  				tsc_khz = cpu_khz;
> > -				set_cyc2ns_scale(cpu_khz);
> > -				/*
> > -				 * TSC based sched_clock turns
> > -				 * to junk w/ cpufreq
> > -				 */
> >  				mark_tsc_unstable("cpufreq changes");
> >  			}
> >  		}
> > @@ -380,7 +307,6 @@ void __init tsc_init(void)
> >  				(unsigned long)cpu_khz / 1000,
> >  				(unsigned long)cpu_khz % 1000);
> >  
> > -	set_cyc2ns_scale(cpu_khz);
> >  	use_tsc_delay();
> >  
> >  	/* Check and install the TSC clocksource */
> > Index: linux/arch/i386/kernel/Makefile
> > ===================================================================
> > --- linux.orig/arch/i386/kernel/Makefile
> > +++ linux/arch/i386/kernel/Makefile
> > @@ -7,7 +7,8 @@ extra-y := head.o init_task.o vmlinux.ld
> >  obj-y	:= process.o signal.o entry.o traps.o irq.o \
> >  		ptrace.o time.o ioport.o ldt.o setup.o i8259.o sys_i386.o \
> >  		pci-dma.o i386_ksyms.o i387.o bootflag.o e820.o\
> > -		quirks.o i8237.o topology.o alternative.o i8253.o tsc.o
> > +		quirks.o i8237.o topology.o alternative.o i8253.o tsc.o \
> > +		sched-clock.o
> >  
> >  obj-$(CONFIG_STACKTRACE)	+= stacktrace.o
> >  obj-y				+= cpu/
> > Index: linux/include/asm-i386/timer.h
> > ===================================================================
> > --- linux.orig/include/asm-i386/timer.h
> > +++ linux/include/asm-i386/timer.h
> > @@ -6,7 +6,6 @@
> >  #define TICK_SIZE (tick_nsec / 1000)
> >  
> >  void setup_pit_timer(void);
> > -unsigned long long native_sched_clock(void);
> >  unsigned long native_calculate_cpu_khz(void);
> >  
> >  extern int timer_ack;
> > @@ -18,35 +17,6 @@ extern int recalibrate_cpu_khz(void);
> >  #define calculate_cpu_khz() native_calculate_cpu_khz()
> >  #endif
> >  
> > -/* Accellerators for sched_clock()
> > - * convert from cycles(64bits) => nanoseconds (64bits)
> > - *  basic equation:
> > - *		ns = cycles / (freq / ns_per_sec)
> > - *		ns = cycles * (ns_per_sec / freq)
> > - *		ns = cycles * (10^9 / (cpu_khz * 10^3))
> > - *		ns = cycles * (10^6 / cpu_khz)
> > - *
> > - *	Then we use scaling math (suggested by [email protected]) to get:
> > - *		ns = cycles * (10^6 * SC / cpu_khz) / SC
> > - *		ns = cycles * cyc2ns_scale / SC
> > - *
> > - *	And since SC is a constant power of two, we can convert the div
> > - *  into a shift.
> > - *
> > - *  We can use khz divisor instead of mhz to keep a better percision, since
> > - *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
> > - *  ([email protected])
> > - *
> > - *			[email protected] "math is hard, lets go shopping!"
> > - */
> > -extern unsigned long cyc2ns_scale __read_mostly;
> > -
> > -#define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
> > -
> > -static inline unsigned long long cycles_2_ns(unsigned long long cyc)
> > -{
> > -	return (cyc * cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;
> > -}
> > -
> > +u64 cycles_2_ns(u64 cyc);
> >  
> >  #endif
> > Index: linux/include/asm-i386/tsc.h
> > ===================================================================
> > --- linux.orig/include/asm-i386/tsc.h
> > +++ linux/include/asm-i386/tsc.h
> > @@ -63,6 +63,7 @@ extern void tsc_init(void);
> >  extern void mark_tsc_unstable(char *reason);
> >  extern int unsynchronized_tsc(void);
> >  extern void init_tsc_clocksource(void);
> > +extern unsigned long long tsc_sched_clock(void);
> >  
> >  /*
> >   * Boot-time check whether the TSCs are synchronized across
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Follow-Ups:
- Re: [PATCH] [15/58] i386: Rewrite sched_clock
  - From: Andi Kleen <[email protected]>
- Re: [PATCH] [15/58] i386: Rewrite sched_clock
  - From: Mathieu Desnoyers <[email protected]>
References:
- [PATCH] [0/58] First batch of x86 patches for .23
  - From: Andi Kleen <[email protected]>
- [PATCH] [15/58] i386: Rewrite sched_clock
  - From: Andi Kleen <[email protected]>
- Re: [PATCH] [15/58] i386: Rewrite sched_clock
  - From: Daniel Walker <[email protected]>
Prev by Date: asm-offsets
Next by Date: Re: [patch] Change softlockup trigger limit using a kernel parameter
Previous by thread: Re: [PATCH] [15/58] i386: Rewrite sched_clock
Next by thread: Re: [PATCH] [15/58] i386: Rewrite sched_clock
Index(es):
- Date
- Thread
[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]