How to use Kcachegrind
Kcachegrind documentation: http://kcachegrind.sourceforge.net/cgi-bin/show.cgi/KcacheGrindDocs
Kcachegrind Handbook: http://docs.kde.org/development/en/kdesdk/kcachegrind/kcachegrind.pdf
Getting started with kcachegrind
SSH into some server and type:
To get a feeling for how the program works you can open the file:
(currently on blade-test-03, this must moved to better place)
To get information about the file
look in other files the folder
. Go to the following link for more information: Profiling#Information_about_the_run
By looking into the file "info" is this folder we see for example the file
was used as input. For more information about the raw-file goto http://lbrundb.cern.ch/rundb/fill/1185/
. Also notice that
Moore().EvtMax = 300
so only 300 events where profiled. One should always be careful when using a small amount of events, the loading time of the program might be taking a lot of the percentage in the profiler. This is easily checked by first setting Moore().EvtMax = 0.And compare the two ouputs. In the future this will be taken care of automatically by using a certain
option when producing the file to kcachegrind.
Here are some basic things explained:
- The forward and backward functions are very useful
- Incl. is the inclusive cost of a function, this means the function itself and all it's including functions. Self is only the cost of the function itself.
- Types shows the Event types, typically only Hits.
Callers are those functions who is calling the chosen function, in the column "Count" it's easy to tell what function calls the chosen function most times.
All Callers includes also the indirect functions that is calling this function, then a distance is added if distance = 1 the function also exists in Callers. Obviously all functions is called one way or the other from _libc_start_main function.
Source shows the hotspots in the code, this tab is in my opinion the most important one!
Callee Map has a graphical view where the size of the box is most of the time proportional to the time spent in there.
There are some useful commands in the menu that is explained:
- If you for some reason don't have the source-code in the same place as the compiled version of your program, you can easily fix this with kcachegrind.
- Click on "Settings" and choose "Configure KCachegrind..."
- If you want to add a folder with the source code click on "Add.." and after this choice the source-file will be found if possible. Make sure that there is not too many files in the folder you add since the kcachegrind will search through all the files until the wanted file is found.
- Another useful command is to change "Context lines in annotations", this value controls how much of the source-file you will see excluded the function that you marked in the list. This is useful to get an overview what the class is doing in total.
Analyzing a hotspot
So if we know want to analyse the HLT based on the data from
a good way to do it is the following:
- Start sorting by "Self" or "Incl." in the function-list.
- Find the function that takes most time (and one you can modify). For "Self" it is "PatFwdTool::distanceHitToTrack" and for "Incl." it is "PatFwdTool::fitXProjection".
- Choose the one with biggest "Incl." first and as we will see "PatFwdTool::distanceHitToTrack" is used inside "PatFwdTool::fitXProjection".
- This function is called 3135 times out of total 18157 times (total times can be seen in the bottom of the window). This means that this function is used a lot.
- By looking at "Callers" we can see that "PatFwdTool::fitXCandidate" has called "PatFwdTool::fitXProjection" 2778 times. And notice also that "PatFwdTool::fitXCandidate" is very high in the function list.
- Now compare "Incl." with "Self", we see that 13.96% in "Incl." compared to 1.76% in "Self" this is telling you that it is not the function itself that takes a lot of time, but it is the functions that "PatFwdTool::fitXProjection" calls that take a lot of time.
With this in mind we know what we are looking for!
So just by scrolling down a little bit we find "PatFwdTool::distanceForFit" with 1199 calls and 6,60%. Now first you have to realise a few things, in the column "#" there is sometime a number sometime not. This number is the line number in the source file. When a line number is written into the column "#" then the number in the same row at column "Hit" this is how much that contributes to "Self" of the function "PatFwdTool::fitXProjection". When the column "#" is empty at a certain row but a number is shown in "Hit" and also a nice square to the left of "Hit" this number is the "Incl." of the function that is used inside of the line.
This means on line 528: 0.04 is the contribution that adds into "Self" of "PatFwdTool::fitXProjection" and this is small since the only thing that happends is that "double dist" assigns to a value.
- But 6,60% is the "Incl." of "PatFwdTool::distanceForFit" that is used inside of "PatFwdTool::fitXProjection".This is ofcourse not counted into "Self" by definition. Notice that the total "Incl." that "PatFwdTool::distanceForFit" is 6,71% and 6,60% is used inside "PatFwdTool::fitXProjection".
To continue from here is very easy! Just double-click on "PatFwdTool::distanceForFit" and the profiler will take you there. Now again look at "Incl." and "Self" we see "Incl." is 6,71% and "Self" is 1.35%. This means that still it is not the function itself that is the hotspot, it is the functions that is called from within the function. So we find this time that it's "PatFwdTool::distanceHitToTrack" that has 3,7% in "Incl.". Double click on this function we see that 5.46% in "Incl." and 2.30% in "Self", hurray we are getting closer to the source that is taking all our cpu power!
By looking at the code we directly can see that this code is not perfectly optimized. To see how we can optimize read on to next section.
How to optimize PatFwdTool::fitXProjection
From the section above we have found our bottleneck. It is "PatFwdTool::distanceForFit" and "PatFwdTool::distanceHitToTrack", now that we know that how do we optimize the function?
By studying the source code of "PatFwdTool::distanceForFit" we first see that 0.96 of total 1.35% is used by a divison on line 72. Division is the worst operation one can do especially with doubles! It always forces a lot of stalls, this we want to avoid.
So how do we avoid the division?
First we look how the variable "dist" is defined. On line 70 we have the biggest cpu-killer in this function "PatFwdTool::distanceHitToTrack". So lets have a look at this function and on line 82 the mistake appears, here "dist" is multiplied by "track.cosAfter()". But going back to "PatFwdTool::distanceForFit" we see that "dist" is divided by "track.cosAfter()".
An easy fix to this is to:
- Calculate "dx/track.cosAfter()" every time the value of "hit->driftDistance()" is changed.
- Fetch this value on line 83 in "PatFwdTool::distanceHitToTrack" instead of the old one and use this value on all placed where dx is now.
- And then you never need to first multiply "dist" by "track.cosAfter()" and then divide by "track.cosAfter()".
- This can always be done as long as the value of "hit->driftDistance()" is not changed more than "PatFwdTool::distanceForFit" is used.
-- Main.athuress - 03 Aug 2010