These blog series belong to some old notes that I have and thought that could be useful to redo. It is always good to refresh material and what a better way than to write about it.
These series will consists of presenting an algorithm in C that will iteratively be re-coded for better performance. Assembly code will be shown so as to show what’s going on at the instruction level, gprof, valgrind, and perf will be used to assess quantitavely our coding efforts. All the code presented here is in my github repo free to use.
Before I start I will like to mention that I find assembly language fascinating and one of the reasons I love programming. With assembly you can actually see what the program is doing. And it is all done by the GNU C Compiler.
Reference machine processor: Intel Core i7-5600U
In the first part I will show some code that belongs to the book Computer Systems: A Programmer’s Perspective, it consist of having an abstract data type having as members a scalar, an array, and a variable holding the sum of the array contents.
Here is our abstract data type:
1 2 3 4 5
len stores the number of elements of the array, result the addition of all of them and data a pointer to the allocated array.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
The main() function calls the function to initialize our structure and then calls the function to perform the addition. For our test we will start with an array size of 500000000.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
This function has the simple task of allocating an array and randomizing its contents.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Deficiency Analysis: Here is our first version of the function. As you can see it has some very obvious deficiencies like calling vec_length() and get_vec_element() on every iteration. And also, the constant reading from memory, which you will see below. The function vec_length() simply returns the len element of struct vec, and get_vec_element() returns the array value given an index.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
This code was created with gcc and flags “-fno-asynchronous-unwind-tables -S”. Here’s a brief analysis of the code above.
- Line 2-3: save previous frame pointer and save it on the stack. Sets new position for frame pointer.
- Line 4: creates new stack
- Line 5: %rdi contains the first argument for the function, struct vec * v which is saved on the stack.
- Line 6: initializes i to zero
- Line 7: unconditional jump to label .L8.
- Line 20-21: loads pointer v from memory to %rax and then onto %rdi for call to vec_length()
- Line 23: vec_length() returns to%rax which is compared to i.
- Line 24: if value len at %rax is greater than i at -8(%rbp) jump to label .L9. Else, exit function.
- Line 9: load pointer address for val to register %rdx as third argument for call to get_vec_element().
- Line 10: load i from memory again :( to register %rcx.
- Line 11: load *v from memory to register %rax.
- Line 12-13: move the recently loaded values to the respective argument variables %rdi 1st, %rsi 2nd, and %rdx 3rd (done previously on line 9).
- Line 14: with the arguments ready call get_vec_element().
- Line 15: get_vec_element() modified -20(%rbp), load int value to %eax.
- Line 16: cltq is equivalent to “movslq %eax, %rax”, sign extending 32-bit %eax to %rax.
- Line 17: add quad words located %rax (previously obtained) with value at -16(%rbp) and store it at -16(%rbp) (dest) .
- Line 18: add immediate value of 1 to value pointed at location -8(%rbp) (i).
- Continue cycle until comparison at Line 23 branches to the exit.
Here is the gprof output. Again with a test value of 500000000.
1 2 3 4 5 6 7 8 9 10
It is shown that this code spents close to 60% of its time with calls to get_vec_element() and vec_length(). We can easily correct this.
Running with perf with the command perf stat -r 1 -e task-clock,cycles,instructions,cache-references,cache-misses,branches,branch-misses ./psum1 500000000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Two important measurements from this output and the ones I am going to refer to are the cycles per instructions (CPI), the lower the CPI the more efficient is the processor at executing the program. And the ratio of cache-misses to instructions, this tell us how well is cache management for the program, the lower the ratio the better our program is with memory access. From my investigations, a good ratio is smaller than 5%. But I think we can do much better that it is right now.
|Name||CPI||cache-misses / instructions (%)||instructions|
To have a better idea of what instructions are creating the cache misses, I ran valgrind –tool=cachegrind ./psum1 500000000. This program simulates cache usage in a program and can at least give us an idea of where are the branches being missed. After running this application, there will be an output file called cachegrind.out.pid where pid will be a random number obtained at the moment of running. Run the output file like this cg_annotate cachegrind.out.pid. This program help us to get a better idea of what functions are creating cache misses.
mr stands for write misses, mr for read misses, DL last level cache, and D1 for data cache, which in our case it is referring to Level 2. The constant calling to get_vec_element() and new_vec() make the program misspredict branches. In our current case new_vec() won’t be able to be reduced due to the need of creating values to our newly created array. More information about cachegrind can be found here and here.
On Part 2, we will test with gcc compiler optimizations. And will also do a another full assembly code analysis with their respective gprof and perf profiles.