innercoder.com

A blog for Linux programming enthusiasts

Code Performance Analysis With Assembly & C: Part 2

| Comments

In this second part, the optimization code will be shown. Specifically O1, O2, and O3. We are going to be able to see, what performance decisions the gcc compiler makes; for example, we are going to check if the compilers takes the function that calculates the string length calls out of the loop.

O1 Optimization

Here is the command to created the assembly code:

1
gcc -fno-asynchronous-unwind-tables -S -O1 -o psum1_O1.s psum1.c

Code to analyze:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
struct vec {
        long int len;
        long int result;
        int *data;
};

long int vec_length(struct vec* v)
{
        return v->len;
}

void combine1(struct vec* v)
{
  long int i, dest;
  int val;

  for (i = 0; i < vec_length(v); i++) {
      get_vec_element(v, i, &val);
      dest = dest OP val;
  }
  v->result = dest;
}

Assembly output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
combine1:
  pushq   %r12
  pushq   %rbp
  pushq   %rbx
  subq    $16, %rsp
  movq    %rdi, %rbp
  cmpq    $0, (%rdi)
  jle .L7
  movl    $0, %ebx
.L8:
  leaq    12(%rsp), %rdx
  movq    %rbx, %rsi
  movq    %rbp, %rdi
  call    get_vec_element
  movslq  12(%rsp), %rax
  addq    %rax, %r12
  addq    $1, %rbx
  cmpq    0(%rbp), %rbx
  jl  .L8
.L7:
  movq    %r12, 8(%rbp)
  addq    $16, %rsp
  popq    %rbx
  popq    %rbp
  popq    %r12
  ret
  .size   combine1, .-combine1
  .section    .rodata.str1.1,"aMS",@progbits,1
.LC0:
  .string "Error allocating"
.LC1:
  .string "Invalid len size. Quitting"
.LC2:
  .string "Error allocating data"
  .text
  .globl  new_vec
  .type   new_vec, @function

Analysis of the code:

NOTE: %rdi contains the first argument of the function which in our case is struct vec* v.

  • Line 2-4: save register values %r12, %rbx, and rbp.
  • Line 5: stack is created by 16 bytes. %rsp is updated to reflect this.
  • Line 6: 1st argument *v is also moved to %rbp.
  • Line 7: the value of the *v is compared to 0. If the result is less or equal the function exits.
  • Line 9: the initial value of i which is 0, is hold at ebx.
  • Line 11-13: memory address held at %rsp +12, &val, is moved to %rdx, i is moved to %rsi, and *v is moved to %rdi in preparation of call to get_vec_element.
  • Line 15: the value pointed by &val being an int is converted to a quad word and moved to to %rax.
  • Line 16: the dereferenced value of &val is added to dest at %r12 and saved there.
  • Line 17: 1 is added to i.
  • Line 18: the dereferenced value at %rbp + 0 is compared to i, if i is less than this value, the loop is repeated, else the function set %rsp to its previous value, pops saved values from the stack and exits.

O1 Analysis

The optimization done at this level is that the compiler understands that vec_length() function looks at a value already been saved in memory from at a offset from %rbp within combine1. So, instead of calling to this function on every iteration of the for loop, it just looks up the value at is saved structure pointer in %rbp with the offset of 0, which is int len.

Perf output for command: perf stat -r 10 -e task-clock,cycles,instructions,cache-references,cache-misses,branches,branch-misses ./psum1_O1 500000000

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Performance counter stats for './psum1_O1 500000000' (10 runs):

       5499.112338      task-clock (msec)         #    1.001 CPUs utilized
( +-  0.35% )
    17,367,704,415      cycles                    #    3.158 GHz
( +-  0.37% ) [66.63%]
    43,526,536,942      instructions              #    2.51  insns per cycle
( +-  0.01% ) [83.32%]
        66,824,057      cache-references          #   12.152 M/sec
( +-  0.21% ) [83.34%]
        64,574,297      cache-misses              #   96.633 % of all cache refs
( +-  0.03% ) [83.36%]
    10,676,703,525      branches                  # 1941.532 M/sec
( +-  0.01% ) [83.38%]
        20,223,608      branch-misses             #    0.19% of all branches
( +-  0.53% ) [83.33%]

       5.496328323 seconds time elapsed
( +-  0.35% )
Name CPI cache-misses / instructions (%) instructions
psum1 0.429 0.56798 114,537,661,031
psum1 - O1 0.398 1.487459 43,526,536,942


Cache misses per instruction ratio went up due to a decrease in the number of instructions needed this time. The CPI also went down to show this. Cachegrind returns the same cache miss results since we have not made any changes related to these two functions.

  • psum1.c: get_vec_element
    D1mr: 31,250,001
    DLmr: 31,250,001

  • psum1.c: new_vec
    D1mw: 31,250,000
    DLmw: 31,250,000

mr stands for write misses, mr for read misses, DL last level cache, and D1 for data cache, which in our case it is referring to Level 2.

Comments