In this second part, the optimization code will be shown. Specifically O1, O2, and O3. We are going to be able to see, what performance decisions the gcc compiler makes; for example, we are going to check if the compilers takes the function that calculates the string length calls out of the loop.
Here is the command to created the assembly code:
Code to analyze:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
Analysis of the code:
NOTE: %rdi contains the first argument of the function which in our case is struct vec* v.
- Line 2-4: save register values %r12, %rbx, and rbp.
- Line 5: stack is created by 16 bytes. %rsp is updated to reflect this.
- Line 6: 1st argument *v is also moved to %rbp.
- Line 7: the value of the *v is compared to 0. If the result is less or equal the function exits.
- Line 9: the initial value of i which is 0, is hold at ebx.
- Line 11-13: memory address held at %rsp +12, &val, is moved to %rdx, i is moved to %rsi, and *v is moved to %rdi in preparation of call to get_vec_element.
- Line 15: the value pointed by &val being an int is converted to a quad word and moved to to %rax.
- Line 16: the dereferenced value of &val is added to dest at %r12 and saved there.
- Line 17: 1 is added to i.
- Line 18: the dereferenced value at %rbp + 0 is compared to i, if i is less than this value, the loop is repeated, else the function set %rsp to its previous value, pops saved values from the stack and exits.
The optimization done at this level is that the compiler understands that vec_length() function looks at a value already been saved in memory from at a offset from %rbp within combine1. So, instead of calling to this function on every iteration of the for loop, it just looks up the value at is saved structure pointer in %rbp with the offset of 0, which is int len.
Perf output for command: perf stat -r 10 -e task-clock,cycles,instructions,cache-references,cache-misses,branches,branch-misses ./psum1_O1 500000000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
|Name||CPI||cache-misses / instructions (%)||instructions|
|psum1 - O1||0.398||1.487459||43,526,536,942|
Cache misses per instruction ratio went up due to a decrease in the number of instructions needed this time. The CPI also went down to show this. Cachegrind returns the same cache miss results since we have not made any changes related to these two functions.
mr stands for write misses, mr for read misses, DL last level cache, and D1 for data cache, which in our case it is referring to Level 2.