linux - clock_gettime()是否适合亚微秒计时？

问题描述

在我们应用程序的Linux构建中，我需要high-resolution计时器用于嵌入式探查器。我们的探查器可测量的范围与单个功能一样小，因此其计时器精度必须高于25纳秒。

以前，我们的实现使用内联汇编和rdtsc操作直接从CPU查询high-frequency计时器，但是this is problematic且需要频繁地重新校准。

因此，我尝试使用clock_gettime函数代替来查询CLOCK_PROCESS_CPUTIME_ID。文档声称这给了我十亿分之一秒的计时，但是我发现单次调用clock_gettime()的开销超过了250ns。这使得不可能为事件设置100ns的时间，并且计时器功能的开销如此之大，严重降低了应用程序的性能，使配置文件失真，超出了价值。 (我们每秒有数十万个分析节点。)

有没有办法调用开销小于¼μs的clock_gettime()？还是有其他方法可以使我可靠地获得< 25ns开销的时间戳计数器？还是我坚持使用rdtsc？

下面是我用来计时clock_gettime()的代码。

// calls gettimeofday() to return wall-clock time in seconds:
extern double Get_FloatTime();
enum { TESTRUNS = 1024*1024*4 };

// time the high-frequency timer against the wall clock
{
    double fa = Get_FloatTime();
    timespec spec; 
    clock_getres( CLOCK_PROCESS_CPUTIME_ID, &spec );
    printf("CLOCK_PROCESS_CPUTIME_ID resolution: %ld sec %ld nano\n", 
            spec.tv_sec, spec.tv_nsec );
    for ( int i = 0 ; i < TESTRUNS ; ++ i )
    {
        clock_gettime( CLOCK_PROCESS_CPUTIME_ID, &spec );
    }
    double fb = Get_FloatTime();
    printf( "clock_gettime %d iterations : %.6f msec %.3f microsec / call\n",
        TESTRUNS, ( fb - fa ) * 1000.0, (( fb - fa ) * 1000000.0) / TESTRUNS );
}
// and so on for CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_THREAD_CPUTIME_ID.

结果：

CLOCK_PROCESS_CPUTIME_ID resolution: 0 sec 1 nano
clock_gettime 8388608 iterations : 3115.784947 msec 0.371 microsec / call
CLOCK_MONOTONIC resolution: 0 sec 1 nano
clock_gettime 8388608 iterations : 2505.122119 msec 0.299 microsec / call
CLOCK_REALTIME resolution: 0 sec 1 nano
clock_gettime 8388608 iterations : 2456.186031 msec 0.293 microsec / call
CLOCK_THREAD_CPUTIME_ID resolution: 0 sec 1 nano
clock_gettime 8388608 iterations : 2956.633930 msec 0.352 microsec / call

这是在标准Ubuntu内核上。该应用程序是Windows应用程序的端口(我们的rdtsc内联程序集可以正常工作)。

附录：

x86-64 GCC是否具有与__rdtsc()相同的内在等效性，所以我至少可以避免内联汇编？

最佳方法

否。您必须使用platform-specific代码来执行此操作。在x86和x86-64上，可以使用’rdtsc’读取Time Stamp Counter。

只需移植您正在使用的rdtsc程序集。

__inline__ uint64_t rdtsc(void) {
  uint32_t lo, hi;
  __asm__ __volatile__ (      // serialize
  "xorl %%eax,%%eax \n        cpuid"
  ::: "%rax", "%rbx", "%rcx", "%rdx");
  /* We cannot use "=A", since this would use %rax on x86_64 and return only the lower 32bits of the TSC */
  __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
  return (uint64_t)hi << 32 | lo;
}

次佳方法

我在系统上运行了一些基准测试，该系统是四核E5645 Xeon，它支持运行内核3.2.54的恒定TSC，结果是：

clock_gettime(CLOCK_MONOTONIC_RAW)       100ns/call
clock_gettime(CLOCK_MONOTONIC)           25ns/call
clock_gettime(CLOCK_REALTIME)            25ns/call
clock_gettime(CLOCK_PROCESS_CPUTIME_ID)  400ns/call
rdtsc (implementation @DavidSchwarz)     600ns/call

因此，在一个相当现代的系统上，(可接受的答案)rdtsc似乎是最糟糕的选择。

参考资料

Is clock_gettime() adequate for submicrosecond timing?