The fundamentals of software performance analysis

Sahiladhav
15 min readJun 1, 2022

--

Just as we need to know the speed at which hardware modules execute to design a hardware system, we need to analyze the performance of programs to build complex software systems. Unfortunately, software performance analysis is much more difficult than hardware timing analysis.

Synchronous hardware design restricts the structure of digital logic so that we can accurately bound delay with reasonable effort. Software designers, in contrast, insist on the full power of Turing machines, which makes it much more difficult to find tight bounds on execution time.

We generally measure program execution time from program initiation at presentation of some inputs to termination at the delivery of the last outputs. Several different measures of software performance are of interest.

  • Worst-case execution time (WCET) — the longest execution time for any possible combination of inputs
  • Best-case execution time (BCET) — the shortest execution time for any possible combination of inputs
  • Average-case execution time for typical inputs

Average-case performance is used for very different purposes than are worst-case and best-case execution times. Average performance is used to tune software (and perhaps the underlying hardware platform); average performance data is particularly useful when it is recorded not just for the entire program but for program units. Average-case behavior can be used to find hot spots due to poor algorithms, bad coding, poor choice of instructions, or other causes.

Average performance is typically evaluated using CPU simulators.. Worst-case and best-case execution times, in contrast, are used during schedulability analysis. Many scheduling algorithms and analysis methods guarantee schedulability based on knowledge of execution time.

Many scheduling algorithms assume that the execution time of a program is constant, which is not realistic given the data-dependent behavior of most interesting programs. WCET is often used as a substitute for exact execution time. However, short execution times can cause scheduling problems that do not occur when all programs run to the worst-case limits, so a more helpful strategy is to measure both worst-case and best-case execution times.

We cannot bound execution times through simulation because interesting programs have far too many input combinations to enumerate. We need to analyze the program to determine its timing behavior. When analyzing hardware to determine its maximum clock rate, we rely on the fact that combinational logic is generally acyclic to simplify analysis. When we analyze programs, we must handle arbitrary control flows. As illustrated in Figure 1, performance analysis algorithms must handle not only branches but also loops and other types of cyclic control flow.

Fig 1 : Execution paths through a program

The basic approach to worst-case execution time analysis still followed today was suggested by Shaw [Sha89] and extended by Park and Shaw [Par91]. They broke WCET analysis into two phases:

  1. Analyze the program to find the worst-case path through a program. This problem is sometimes called path analysis.
  2. Measure the execution time along the worst-case path. Let us call this problem path timing.

The initial step in this methodology was to efficiently identify the worst-case execution path through the program. Park and Shaw developed a performance model of a 68000 that gave them good results. Unfortunately, modern high-performance microprocessors are much more complex, and the timing of instruction sequences is much more difficult. Research over the past 20 years has improved both the path identification process and the processor performance models.

Path analysis and path timing interact to some extent for complex machines. If the execution time of an instruction depends on the previous instructions, as is the case for both caches and pipelines, then we cannot easily find long paths without a detailed analysis of the timing of path segments. We can handle this by finding candidate long paths with abstract timing models and then refining the analysis using detailed timing models on a smaller set of paths.

Performance Models

Determining the performance of a program fragment generally requires reducing high-level language code to instructions. Compilers perform many complex, interacting transformations on high-level language programs that make it difficult to directly determine the execution time of high-level language statements.

Given a sequence of instructions and a CPU, we want to determine the WCET of that sequence on the processor. The simplest case would be if all instructions took the same amount of time to execute; we could then simply count instructions and multiply by the execution rate. However, that ideal world has never existed. Even the early computers were designed to take varying amounts of time to execute different instructions; the PDP-8, for example, exhibited large variations between the fastest and slowest instruction.

The performance model Park and Shaw [Par91] developed for the 68000, by today’s standard, was relatively simple. Although it had a large instruction set, it had only a simple pipeline and no cache. The manufacturer provided a table that gave the execution time of the various opcodes. The execution time for a code fragment could be accurately determined by looking up various instructions’ execution times.

Today’s processors do not admit this simple approach because the execution time of a processor depends on the processor state in various ways. The pipeline is one important source of variation. The execution time of an instruction depends in part on the behavior of other instructions in the pipeline. Two instructions that compete for the same resource will execute more slowly when executed together than if they were executed at widely separated times.

The memory system is an even larger source of variation. Caches can introduce an order of magnitude, or more variation in memory accesses, depending on whether the access is a hit or a miss. Parallel memory systems can add further variation, as can DRAM refreshing, page modes, and other details of the types of memory components used.

Wilhelm [Wil04] used the term timing accident to describe the cause for an increase in the instruction time in a program, and timing penalty to describe the amount of increase. Timing accident can come from many different mechanisms at any stage in program execution. Wilhelm points out that the interference between these mechanisms makes for nonintuitive results, in which the best case for one part of the system may lead to longer total execution time.

Path Analysis

The goal of abstract program flow analysis is to bound the set of feasible paths. Because path analysis is equivalent to the halting problem, it cannot find the exact set of paths and may include some paths that are not feasible. Puschner and Koza [Pus89;Pus93] analyzed the execution time of structured programs.

They syntactically analyzed the program and assigned an execution time foreach element in the parse tree. The execution time of the program was computed as the sum of the execution times of the statements. They added a MAX COUNT construct to their programming language to allow programmers to specify the maximum number of times that a loop will be executed.

Today, many WCET methodologies use integer linear programming (ILP) to implicitly solve for paths. A set of constraints describes the structure of the program and some aspects of its behavior. Any solver finds the values for variables that identify the longest path through the program; this is done without enumerating all the paths.

Li and Malik [Li97c] modeled a program with a system of constraints: structural constraints describe conditionals and other control flow. Finiteness and start constraints help bound loop iterations; tightening constraints either come from the user or from an analysis of infeasible paths. (Figure 3–31 in Part 2 of this series illustrates the use of structural constraints.)

Each edge in the CDFG is represented by a variable whose value is equal to the number of times the program control flow passes through that edge. Conservation off low tells us several facts: i 5o, since the conditional is exited the same number of times it is entered; i 5a1b and o5r1s for similar reasons. This implies that a1b5r 1s. A constraint solver can use all the constraints collected for the program to solve for a bound on the number of times each edge is executed.

Figure 2 shows the program flow constraints for a while statement. In this case, we know that i + b = o + t. Since C defines for loops in terms of while loops [Ker78], we can use this construct to build for loops as well. User constraints can easily be added to an integer linear program. The user could, for example, bound the number of iterations of a while loop by writing an inequality involving the while loop’s b variable.

Fig 2 : Program flow statements in a while loop

The objective function for the integer linear program is to minimize the total flow through the network. Li and Malik [Li97c] used standard solvers to solve these systems of constraints. They evaluated their technique by analyzing several benchmark programs to find their worst-case inputs and related execution time bounds.

Figure 3 shows the results of their experiments, in which they compared their analysis method to both hand-analyzed WCET bounds and measured execution times of worst-case inputs on an Intel i960 CPU.

Fig 4 : Experimental evaluation of the accuracy of -based WCET analysis

Liet al. [Li95] added instruction cache behavior into the model. They did this by breaking the program into units of cache line. In the basic model, a block in the CDFG represents straight-line code. The extended model handles direct mapped instruction caches. Each straight-line code segment is broken into one or more units known as l-blocks that correspond to cache lines.

Each l-block has two execution times, one for a cache hit and one for a cache miss. Two cache lines conflict if the execution of one l-block will cause the other l-block to bed is placed from the cache; two l-blocks cannot conflict if they do not map onto the same cache line. A cache conflict graph is constructed for every cache line that has two or more conflicting l-blocks to keep track of the state of the cache lines.

As shown in Figure 4, the graph has a start node s and an end node e, as well as a node for every l-block mapped into the cache line. The graph has an edge from one node to another if a path exists from the first basic block to the second, and the path does not require passing through other l-blocks mapped into the same cache line; in other words, an edge represents a direct conflict caused when one l-block knocks another out of the cache. Constraints from the cache conflict graph are added to the solution.

Fig 4 : A cache conflict graph for two cache blocks

The programmer may have knowledge of the program’s behavior that is difficult for a timing analysis tool to reproduce. Many timing analysis methodologies allow user constraints to be incorporated into their analyses. User constraints can provide tighter bounds, assuming of course that the constraints accurately reflect program behavior. Exactly how these user constraints are incorporated into the analysis depends on the analytical technique used.

Path Timing

Several techniques are used to analyze path timing at different levels of abstraction. Abstract interpretation analyzes the behavior of the program in detail to more accurately understand the execution states of the program. Data flow analysis provides a more detailed view of how the program behaves. The most concrete technique is simulation, which is used to determine the details of how the processor responds to the program.

Bounding the number of iterations taken by a loop is particularly important for WCET analysis given that much of the execution time of most programs is spent in loops. Some loops, such as the FIR filter, are easy to analyze.

But loops with conditionals create problems.

The algorithm of Healy et al [Hea99a] bound loop iterations in four phases:

  1. It first uses an iterative algorithm to identify branches that affect the number of iterations.
  2. It then identifies the loop iteration on which a loop-dependent branch changes direction.
  3. It determines when the branches found in step 1 are reached.
  4. Finally, it uses these results to calculate the iteration bound.

Figure 5 : shows an example of loop iteration bounding from Healy et al [Hea99a] . The loop can be exited by a complex condition.

Fig 5 : An example of loop iteration bounding

In the first step, it identifies the four control flow boxes with jumps as the points at which loop-dependent branches can change direction. It then determines that the first conditional branch, j > 75, occurs on iteration 26 (since j is incremented by 3 on each iteration). The condition j > 300 is taken on iteration 101, as is the condition j < 100. This gives a bound on the number of iterations as a minimum of 26 and a maximum of 101. Vivancos et al. [Viv01] proposed a parametric timing analysis method to determine the timing behavior of programs with indefinite loops. They iteratively test the worst-case execution time of the program, starting with zero iterations and increasing the number of iterations by one at each step. Their testing terminates when the WCET stabilizes. Figure 6 shows a simple conditional, its associated CDFG, and the flow variables assigned to the edges in the CDFG.

Fig 6 : Program flow constraint in an if statement

Ermedahl et al. [Erm05] use a clustering technique to determine which parts of a program must be handled as a unit. They annotate the control flow graph of the program with flow facts, which include the defining scope, a context specifier, and a constraint expression. The constraints may involve execution count variables and constants. As illustrated in Figure 7, they cluster facts together to find the maximum scopes over which flow facts apply, such as in nested loops.

Fig 7 : Example code (top) and flow facts and clusters (bottom)

Healy et al. [Hea99b] use static cache simulation to categorize each instruction’s behavior in the cache. Given the CDFG for the program, they use an iterative algorithm to determine the program lines that may be in the cache during the entry and exit of each basic block. They categorized the behavior of instructions in the instruction cache.

Worst-case categories include the following:

  • Always miss — Not guaranteed to be in the cache when referenced.
  • Always hit — Guaranteed to be in the cache when it is referenced.
  • First miss — Not guaranteed to be in the cache on the first loop iteration but guaranteed to be in the cache on subsequent iterations.
  • First hit — Guaranteed to be in the cache on the first iteration but not guaranteed on subsequent iterations.

Best-case categories include the following.

  • Always miss — Guaranteed not to be in the cache when first referenced.
  • Always hit — May be in the cache every time it is referenced.
  • First miss — Guaranteed not to be in the cache on the first loop iteration but may be on later iterations.
  • First hit — Instruction may be in the cache on the first loop iteration but is guaranteed not to be in the cache for later iterations.

To analyze the behavior of the instructions in the pipeline, Healy et al. used a table to describe, for each type of instruction, worst-case and best-case number of cycles required at every stage of the pipeline.

They analyzed the iterations in a loop iteration or single invocation of a function, which could consist of multiple basic blocks. Their algorithm for determining the worst-case timing along a path is shown in code below.

Abstract interpretation executes a program using symbolic values rather than specific numeric values [The99]. This technique allows us to generalize information about program execution across sets of variable values in a reasonable amount of computation time. Abstract interpretation first determines concrete semantics for a program that describes a particular aspect of the program’s behavior, such as cache or pipeline behavior.

Based on the concrete semantics, an abstract semantic is built that operates on the program’s abstract domain. An element of the abstract domain of the program represents a set of members of the concrete domain. An abstract state represents a set of concrete states; the abstract representation may be conservative in that it includes concrete states that do not actually occur.

The concrete semantics for cache state represent how cache lines map into memory blocks. The abstract cache states map a cache line to a set of cache blocks. Theiling et al. described three types of analysis of this abstract state: must analysis, may analysis, and persistence analysis.

Must analysis looks at the upper bounds of the ages of the memory blocks. When combining two cache states, a memory block stays in the abstract cache only if it is in both predecessors and it receives the larger of their ages. This operation is similar to set intersection.

May analysis looks at the lower bounds on the ages of the memory blocks. A block is in the abstract cache if it is in one predecessor and it receives the youngest of the cache ages. Persistence analysis does not know whether the first access will be a hit or a miss but does guarantee that succeeding accesses will be hits.

Wilhelm[Wil04] argues for a modular approach to timing analysis in which abstract interpretation is used to analyze processor behavior, and integer linear programming is used to analyze program path behavior. Abstract interpretation helps to exclude timing accidents that can not occur and provides the worst-case execution time for basic blocks, within their execution context. ILP determines an upper bound on execution time and the associated path.

Some aspects of program performance require detailed understanding of the processor state. So long as these effects can be isolated to small segments of the program, simulation is a reasonable and valuable tool for detailed WCET analysis. Engblom and Ermedahl [Eng99a] used simulation to analyze pipeline effects. They break the execution time of a code sequence into the individual execution times plus the timing effects for subsequences. Timing effects capture the interactions between instructions.

Engblomand Ermedahl use a cycle accurate simulator to determine the execution time of a sequence of instructions. They do not use static analysis to determine whether combinations of instructions should have timing effects — a CPU can exhibit a large number of inter instruction dependencies that could depend on data values and may not be well documented.

Instead, Engblom and Ermedahl use the simulator to discover timing effects by executing successively longer code sequences. The length of a sequence that is tested is bounded by the length of time, typically equal to the length of the pipeline, in which the CPU can exhibit inter-instruction pipeline effects. Healy et al. [Hea99b]use tables of structural and data hazards to analyze pipeline interactions in instruction sequences.

Engblom and Jonsson[Eng02] developed a model for single-issue in-order pipelines. Given a pipeline with n stages, an instruction takes a certain number of clock cycles in each stage, denoted as

depending on the resources it needs. A sequence of m instructions in the pipeline obeys two constraints, as

(EQ-11)

where is the time at which instruction i enters pipeline stage j ; and

(EQ-12)

which constrains the next instruction from entering a stage until the last instruction has finished. Branch instructions generate dependencies between the stage at which the branch is decided and instruction fetching:

(EQ-13)

Data dependencies generate additional dependencies, as

(EQ-14)

in which the data dependency is between instructions i and k , and the k instruction must finish stage l before the i instruction can proceed. Longest-path algorithms can be used to solve these systems of constraints.

Engblom and Jonsson used this formulation to determine which types of pipelines have long timing effects. They defined a crossing critical path and showed that pipelines with this property do not have long timing effects that increase execution time.

They showed that a single in-order pipeline has the crossing critical path property if each constraint introduced by branching and data dependencies either occurs between adjacent instructions or is subsumed by the basic constraints of (EQ 3–11) and (EQ 3–12).

Pipelines with forks do not in general exhibit the crossing critical path property, and such pipelines are widely used in high-performance processors. For example, most processors with a separate floating-point pipeline do not have the crossing critical path property.

That’s it for this blog.

Do share this blog with your friends to spread the knowledge.

Keep Learning :)

Writers :

  • Aaditi Badgujar
  • Sahil Adhav
  • Aditya Sood
  • Ayush Prasad

--

--

Responses (2)