innercoder.com

A blog for Linux programming enthusiasts

Code Performance Analysis With Assembly & C: Part 1

| Comments

These blog series belong to some old notes that I have and thought that could be useful to redo. It is always good to refresh material and what a better way than to write about it.

These series will consists of presenting an algorithm in C that will iteratively be re-coded for better performance. Assembly code will be shown so as to show what’s going on at the instruction level, gprof, valgrind, and perf will be used to assess quantitavely our coding efforts. All the code presented here is in my github repo free to use.

Before I start I will like to mention that I find assembly language fascinating and one of the reasons I love programming. With assembly you can actually see what the program is doing. And it is all done by the GNU C Compiler.

Reference machine processor: Intel Core i7-5600U

In the first part I will show some code that belongs to the book Computer Systems: A Programmer’s Perspective, it consist of having an abstract data type having as members a scalar, an array, and a variable holding the sum of the array contents.

Here is our abstract data type:

1
2
3
4
5
struct vec {
  long int len;
  long int result;
  int *data;
};

len stores the number of elements of the array, result the addition of all of them and data a pointer to the allocated array.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
int main(int argc, char *argv[])
{
  long int num;
  struct vec *vec_ptr;
  srand(time(NULL));

  if (argc != 2) {
      printf("Usage: %s [ARRAY_ELEMENTS]\n", argv[0]);
      exit(EXIT_SUCCESS);
  }
  
  num = atol(argv[1]);
  if (num <= 0) {
      printf("Invalid len size. Quitting\n");
      exit(EXIT_FAILURE);
  }
  printf("User entered %ld\n", num);
  vec_ptr = new_vec(num);
  combine1(vec_ptr);

  printf("Final result %ld\n", vec_ptr->result);
  
  free(vec_ptr);
  return(EXIT_SUCCESS);
}

The main() function calls the function to initialize our structure and then calls the function to perform the addition. For our test we will start with an array size of 500000000.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
struct vec * new_vec(long int len)
{
  int i;
  int *dat_ptr;

  struct vec *alloc = malloc(sizeof(struct vec));
  if(!alloc) {
      printf("Error allocating\n");
      exit(EXIT_FAILURE);
  }

  alloc->len = len;
  
  /* allocating space for array */
  alloc->data = malloc(len * sizeof(int));
  if (alloc->data == NULL) {
      free((void *) alloc);
      printf("Error allocating data\n");
      exit(EXIT_FAILURE);
  }
  dat_ptr = alloc->data;
  
  /* randomizing its contents */
  for (i = 0; i < len; i++)
      *(dat_ptr + i) = rand() % 10 + 1;

  return alloc;
}

This function has the simple task of allocating an array and randomizing its contents.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
long int vec_length(struct vec* v)
{
  return v->len;
}

int get_vec_element(struct vec * v, long int index, int* dest)
{
  if (index < 0 || index >= v->len)
      return 0;
  *dest = v->data[index];
  return 1;
}



void combine1(struct vec* v)
{
  long int i, dest;
  int val;

  for (i = 0; i < vec_length(v); i++) {
      get_vec_element(v, i, &val);
      dest = dest OP val;
  }
  v->result = dest;
}

Deficiency Analysis: Here is our first version of the function. As you can see it has some very obvious deficiencies like calling vec_length() and get_vec_element() on every iteration. And also, the constant reading from memory, which you will see below. The function vec_length() simply returns the len element of struct vec, and get_vec_element() returns the array value given an index.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
combine1:
  pushq   %rbp
  movq    %rsp, %rbp
  subq    $40, %rsp
  movq    %rdi, -40(%rbp)
  movq    $0, -8(%rbp)
  jmp .L8
.L9:
  leaq    -20(%rbp), %rdx
  movq    -8(%rbp), %rcx
  movq    -40(%rbp), %rax
  movq    %rcx, %rsi
  movq    %rax, %rdi
  call    get_vec_element
  movl    -20(%rbp), %eax
  cltq
  addq    %rax, -16(%rbp)
  addq    $1, -8(%rbp)
.L8:
  movq    -40(%rbp), %rax
  movq    %rax, %rdi
  call    vec_length
  cmpq    -8(%rbp), %rax
  jg  .L9
  movq    -40(%rbp), %rax
  movq    -16(%rbp), %rdx
  movq    %rdx, 8(%rax)
  leave
  ret

This code was created with gcc and flags “-fno-asynchronous-unwind-tables -S”. Here’s a brief analysis of the code above.

  • Line 2-3: save previous frame pointer and save it on the stack. Sets new position for frame pointer.
  • Line 4: creates new stack
  • Line 5: %rdi contains the first argument for the function, struct vec * v which is saved on the stack.
  • Line 6: initializes i to zero
  • Line 7: unconditional jump to label .L8.
  • Line 20-21: loads pointer v from memory to %rax and then onto %rdi for call to vec_length()
  • Line 23: vec_length() returns to%rax which is compared to i.
  • Line 24: if value len at %rax is greater than i at -8(%rbp) jump to label .L9. Else, exit function.
  • Line 9: load pointer address for val to register %rdx as third argument for call to get_vec_element().
  • Line 10: load i from memory again :( to register %rcx.
  • Line 11: load *v from memory to register %rax.
  • Line 12-13: move the recently loaded values to the respective argument variables %rdi 1st, %rsi 2nd, and %rdx 3rd (done previously on line 9).
  • Line 14: with the arguments ready call get_vec_element().
  • Line 15: get_vec_element() modified -20(%rbp), load int value to %eax.
  • Line 16: cltq is equivalent to “movslq %eax, %rax”, sign extending 32-bit %eax to %rax.
  • Line 17: add quad words located %rax (previously obtained) with value at -16(%rbp) and store it at -16(%rbp) (dest) .
  • Line 18: add immediate value of 1 to value pointed at location -8(%rbp) (i).
  • Continue cycle until comparison at Line 23 branches to the exit.

Here is the gprof output. Again with a test value of 500000000.

1
2
3
4
5
6
7
8
9
10
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 31.04      2.23     2.23        1     2.23     2.23  new_vec
 29.35      4.33     2.10 500000000     0.00     0.00  get_vec_element
 23.45      6.01     1.68        1     1.68     4.99  combine1
 16.78      7.22     1.20 500000001     0.00     0.00  vec_length
  0.07      7.22     0.01                             frame_dummy

It is shown that this code spents close to 60% of its time with calls to get_vec_element() and vec_length(). We can easily correct this.

Running with perf with the command perf stat -r 1 -e task-clock,cycles,instructions,cache-references,cache-misses,branches,branch-misses ./psum1 500000000

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
 Performance counter stats for './psum1 500000000' (10 runs):

      15588.601346      task-clock (msec)         #    1.001 CPUs utilized
( +-  0.77% )
    49,206,704,942      cycles                    #    3.157 GHz
( +-  0.73% ) [66.66%]
   114,537,661,031      instructions              #    2.33  insns per cycle
( +-  0.01% ) [83.33%]
        70,821,633      cache-references          #    4.543 M/sec
( +-  0.81% ) [83.34%]
        65,055,051      cache-misses              #   91.858 % of all cache refs
( +-  0.33% ) [83.34%]
    23,677,398,427      branches                  # 1518.892 M/sec
( +-  0.01% ) [83.34%]
        20,650,580      branch-misses             #    0.09% of all branches
( +-  0.55% ) [83.33%]

      15.579928644 seconds time elapsed
( +-  0.77% )

Two important measurements from this output and the ones I am going to refer to are the cycles per instructions (CPI), the lower the CPI the more efficient is the processor at executing the program. And the ratio of cache-misses to instructions, this tell us how well is cache management for the program, the lower the ratio the better our program is with memory access. From my investigations, a good ratio is smaller than 5%. But I think we can do much better that it is right now.

Name CPI cache-misses / instructions (%) instructions
psum1 0.429 0.56798 114,537,661,031


To have a better idea of what instructions are creating the cache misses, I ran valgrind –tool=cachegrind ./psum1 500000000. This program simulates cache usage in a program and can at least give us an idea of where are the branches being missed. After running this application, there will be an output file called cachegrind.out.pid where pid will be a random number obtained at the moment of running. Run the output file like this cg_annotate cachegrind.out.pid. This program help us to get a better idea of what functions are creating cache misses.

  • psum1.c: get_vec_element
    D1mr: 31,250,002
    DLmr: 31,250,002

  • psum1.c: new_vec
    D1mw: 31,250,000
    DLmw: 31,250,000

mr stands for write misses, mr for read misses, DL last level cache, and D1 for data cache, which in our case it is referring to Level 2. The constant calling to get_vec_element() and new_vec() make the program misspredict branches. In our current case new_vec() won’t be able to be reduced due to the need of creating values to our newly created array. More information about cachegrind can be found here and here.

On Part 2, we will test with gcc compiler optimizations. And will also do a another full assembly code analysis with their respective gprof and perf profiles.

Comments