Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measuring cycles and optimized bzero() #1

Open
ghost opened this issue Mar 2, 2018 · 2 comments
Open

Measuring cycles and optimized bzero() #1

ghost opened this issue Mar 2, 2018 · 2 comments

Comments

@ghost
Copy link

ghost commented Mar 2, 2018

The intrinsic function _rdtsc() doesn't serialize the processor, so you'll get even more 'unstable' readings from the timescamp counter... It is prudent to use your own as in:

inline uint64_t __rdtsc(void)
{
   uint32_t a, d;

  __asm__ __volatile__ ( "xorl %%eax,%%eax\t\n
                         "cpuid\t\n"
                         "rdtsc" 
                         : "=a" (a), "=d" (d) : : 
#ifdef __x86_64
                         "rbx", "rcx"
#else
                         "ebx", "ecx"
#endif
  );

  return ((uint64_t)d << 32) + a;
}

In newer processors (Sandy Bridge or superior, if I'm not mistaken), a single REP STOSB is faster than the combination of REP STOSD and REP STOSB... And even faster than using SIMD... So, your bzero() routine can be a single macro as:

#define bzero(ptr, cnt) \
  __asm__ __volatile__ ( \
    "rep; movsb" \
    : : "D" (ptr), "c" (cnt), "a" (0) \
  );

[]s
Fred

@ghost
Copy link
Author

ghost commented Mar 3, 2018

If you like, this is my implementation based on your bzero approach:

#include <stddef.h>

// This is the exported symbol for our function.
void (*_bzero)(void *, size_t);

static void enhanced_bzero(void *ptr, size_t size)
{
  __asm__ __volatile__ (
    "xorb %%al,%%al\n\t"
    "rep; movsb" : : "D" (ptr), "c" (size)
  );
}

static void my_bzero(void *ptr, size_t size)
{
  // Store as many dwords as possible.
  __asm__ __volatile__ (
    "rep; movsl" : "+D" (ptr) : "c" (size & -4), "a" (0)
  );

  // Store the remaining (maximum 3) bytes.
  __asm__ __volatile__ (
    "rep; movsb" : : "D" (ptr), "c" (size & 3), "a" (0)
  );
}

// This will be called only on program initialization, nowhere else.
__attribute__((constructor))
static void bzero_init(void)
{
  int b;

  // The CPU has the REP MOVSB/STOSB enhancement?
  __asm__ __volatile__ (
    "cpuid" : "=b" (b) : "a" (7), "c" (0) :
#ifdef __x86_64
    "rdx"
#else
    "edx"
#endif
  );

  if (b & (1 << 9))
    _bzero = enhanced_bzero;
  else
    _bzero = my_bzero;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant
and others