Technical papers: Automated energy[performance] macromodelling of embedded software

Abstract:

Efficient energy and performance estimation of embedded software is a critical part of any system-level design flow. Macromodeling based estimation is an attempt to speed up estimation by exploiting reuse that is inherent in the design process. Macromodeling involves pre-characterizing reusable software components to construct high-level models, which express the execution time or energy consumption of a sub-program as a function of suitable parameters.During simulation, macromodels can be used instead of detailed hardware models, resulting in orders of magnitude simulation speedup. However, in order to realize this potential, significant challenges
need to be overcome in both the generation and use of macromodels— including how to identify the parameters to be used in the macromodel, how to define the template function to which the macromodel is fitted, etc. This paper presents an automatic methodology to perform characterization-based high-level software macromodeling, which addresses the aforementioned issues.
Given a sub-program to be macromodeled for execution time and/or energy consumption, the proposed methodology automates the steps of parameter identification, data collection through detailed simulation, macromodel template

selection, and fitting. We propose a novel technique to identify potential macromodel parameters and perform data collection, which applicability draws from the concept of data structure serialization used in distributed programming. We utilize symbolic regression techniques to concurrently filter out irrelevant macromodel parameters, construct a macromodel function, and derive the optimal coefficient values to minimize fitting error. Experiments with several realistic benchmarks suggest that the proposed methodology improves estimation accuracy and enables wide of macromodeling to complex embedded software, while realizing its potential for estimation speedup. We describe a case study of how macromodeling can be used to rapidly explore algorithm-level energy tradeoffs, for the zlib data compression library.

Categories and Subject Descriptors -

I.6.5 [Computing Methodologies]: Simulation and Modeling Model development - Modeling methodologies; D.2.8 [Software]:
Software Engineering - Metrics, Performance measures; C.4
[Computer Systems Organization ]: Performance of Systems-
Modeling Techniques
General Terms
Design, Measurement
Keywords
Data Serialization, Embedded Software, Genetic Programming,
Macromodeling, Symbolic Regression

1. INTRODUCTION

Efficient performance and energy estimation are fundamental concerns in the design of embedded software. Simulating the execution of embedded software on models of the underlying processor platform is the most widely used approach for performance and energy estimation. While simulation efficiency has been the subject of significant research effort, rapid growth in the complexity of embedded software (the number of lines of code in a typical embedded application is estimated to double every 10 to 12 months on an average, i.e., even faster than Moore’s law) implies that efficient performance/energy estimation for embedded software will remain a challenge for the foreseeable future.

Our work is based on the observation that large embedded software programs are rarely written from scratch — reliable design, subject to stringent design turnaround time and design cost constraints, mandates substantial reuse. An analysis of the dynamic execution traces of embedded programs reveals that a large fraction of the time consumption arises from reused software components
(including embedded operating systems, middleware, run-time libraries, domain-specific algorithm libraries, etc.). As an example, our experiments with the popular compression utility gzip, showed
that, on an average, 90% of gzip’s execution time is spent in calls to the gzip library1 package, 8% in calls to the standard C library functions, and only 2% in code specific to the gzip program, or what is frequently known as “glue code.” It is hence not surprising that reusable software modules account for a major fraction of simulation or estimation effort.

It is natural to wonder whether reuse, which saves significant design effort, can also be exploited to reduce estimation effort.Characterization-based romodeling takes a step in the above direction by enabling the extraction of fast, higher level models of reusable software components, based on pre-characterization using more detailed, slower models. The effort expended in the construction of macromodels for a software module is amortized over the
large number of applications of the macromodel when the module is simulated in the context of all the programs that include it.

The rest of this paper is organized as follows. We describe the contributions of this paper in Section 1.1 and discuss related work in Section 1.2. In Section 2, we identify the major challenges involved in macromodeling. Section 3 describes in detail how the proposed macromodel generation methodology overcomes
the identified challenges. Our implementation and experimental results are presented in Section 4, and conclusions in Section 5.

1.1 Paper Contributions

The complexity of modern embedded software poses significant challenges for both the generation and use of macromodels. In this work, we identify key limitations of the state-of-the-art in software macromodeling. Notably, significant manual effort is required from the software designer towards the identification of suitable parameters, and a template function on which the macromodel is based. Addressing these problems, while maintaining sufficient
generality in order to handle a wide range of embedded software programs is quite challenging. We propose a methodology to automate the critical steps of parameter identification, data collection through accurate simulation or measurement, and construction of the macromodel function while simultaneously optimizing the values of macromodel coefficients for achieving the best fit. Our work draws inspiration from concepts presented in the fields of distributed programming (automatic data structure serialization), and recent advances in statistical data analysis (symbolic regression). We also demonstrate the practical application of macromodeling to software libraries of significant complexity.
Figure 1: Function bg compute scc and associated data structures

1.2 Related Work

We discuss related work in the areas of macromodeling for hardware power simulation, efficient software performance and energy estimation, and fast instruction set simulation. Macromodels for register-transfer level (RTL) components can be constructed through characterization of their logic-level hardware models and have been used extensively in RTL power estimation. Techniques to speed up cycle-accurate instruction set simulation have received significant attention. Instruction-set simulation can be accelerated with little or no loss of accuracy using compiled simulation , combining compiled and interpreted simulation , or by optimizing the implementation of different functions such as instruction fetch and decode in the instruction set simulator (ISS). Hybrid simulation techniques for energy estimation were proposed . Delay and energy caching (reusing the execution time and energy consumption results from previous simulations of a program segment) are use to speed up estimation .

An alternative approach to embedded software power analysis is to use cycle-accurate and structure-aware architecture simulators, which can identify the architectural blocks activated in each cycle during a program’s execution, and record the stream of inputs seen by them . Software macromodeling at the granularity of functions or sub-programs was explored, demonstrating that orders of magnitude speedup in estimation time could be obtained, while maintaining high estimation accuracy. Performance characterization of library functions using statistical and semantic proerties of function arguments was recently presented.In summary,the importance of embedded software performance and energy estimation has fueled significant research effort but macromodeling for software sub-programs of arbitrary complexity has remained a relatively unexplored area, and many important issues have not been addressed. To the best of our knowledge, this is the first work to automate all the key steps in macromodel generation and demonstrate the applicability of fully automatic macromodeling to software programs of realistic complexity.

2. MOTIVATION

In this section, we describe the key challenges involved in macromodel generation for complex software programs, and illustrate them through the task of constructing an energy macromodel for the bg_compute_scc function taken from a commercial graph data structure library. The C prototype of function bg_compute_sscis shown in Figure 1, along with a description of its input data structures. The bgraph structure contains various dynamically allocated fields, including an adjacency list representation of the graph’s connectivity, and fields to store the identified strongly connected components (SCCs). In addition to the software implementation of the graph data structure library and several testbenches that exercise its functions, we are given a crosscompilation tool chain for the target StrongARM based embedded system, as well as a cycle-accurate ISS that reports energy consumption [14, 15]. Any automated approach to generating a macromodel needs to address the following key challenges:

-Selection of macromodel parameters: In general, macromodel parameters, which are the independent variables used in the macromodel, can include the size or value of any field nested arbitrarily deep within the input or output data structures. The number of candidate parameters can be very large even for simple software functions. However, an efficient and robust macromodel must include only relevant parameters that have an actual impact on energy consumption. For the bg compute scc function, if we consider the values of all nested fields of scalar data types and the sizes of all nested fields of complex types, we can identify 2n+e+s+9potential candidates for macromodel parameters for a graph with n vertices, e edges, and s SCCs. The number of possible relevant subsets of parameters is 22n+e+s+9. While in some cases, human understanding and insight may reveal that only a small subset of parameters may largely determine the execution time or energy consumption, an automatic tool processing the source code does not have the luxury of human insight.
-Data collection: Once a candidate set of macromodel parameters is identified, characterization experiments must be performed to obtain values of the candidate macromodel parameters and the corresponding value of the dependent variable (energy or execution time) for numerous execution instances. Capturing the macromodel parameter values requires runtime tracing of the size of dynamically-created data structures as well as values of nested atomic fields. In practice, this is not a simple task — the number of levels of pointer traversals that have to be performed to access all scalar fields may vary dynamically, and conventional size computation utilities (such as sizeof in the C programming language) do not perform pointer traversal, i.e., they do not include the size of objects pointed to by fields in the given object.
-Macromodel function construction: Given the data gathered from characterization, determining a suitable function to approximate the collected data can be a daunting task. The search space of functions is highly intractable (infinite in the case of real-valued functions). Conventional approaches to macromodeling circumvent this problem by requiring the designer to manually specify a macromodel template. While various templates have been suggested. For example, template identification is in practice an ad hoc and tedious process that demands a detailed understanding of the target function.

We present techniques using type analysis, data serialization concepts and symbolic regression to overcome these challenges, making it possible to significantly extend the applicability of macromodeling to complex software, while greatly simplifying macromodel construction and minimizing the need for human intervention.

Figure 2: Energy estimates from macromodeling and
instruction-level simulation for bg compute scc
To illustrate the utility of our methodology, we used
it to construct an energy consumption macromodel for the
bg compute scc function shown in Figure 1. The resulting
macromodel equation, which relates energy consumption to the
size2 of the input argument bgr and values of its member fields,
is as follows:

Energy = (5.83E - 6) * last * size(bgr) + (0.5934) * no vertices - (0.576) * last + (3.625E - 4) * size(bgr).

A model in terms of function arguments, like the one shown above, also has the additional advantage of being well-suited to automated macromodel application within a larger estimation framework, because the model parameters should be readily available in any software simulator. A comparison of the energy estimates from the use of the macromodel vs. the energy estimates from instruction-level simulation for various input instances, as shown in Figure 2, shows them to be in close agreement with instruction level estimates,with an average estimation error of 0.7%.

3.AUTOMATIC ACROMODELING
METHODOLOGY

Figure 3 presents an overview of the proposed macromodeling methodology. Starting with the source code for the target function to be macromodeled, and a testbench that thoroughly exercises the target function over a wide range of input instances, the methodology consists of a sequence of steps that culminates in the generation of macromodels which approximate the energy consumption or execution time of the function. Two parallel compilation and execution flows are used to collect the data necessary to construct the macromodel. First, the source code is subject to parsing and type analysis, based on which our tool automatically generates data structure traversal and serialization routines and instruments the source code to invoke them at appropriate locations. The instrumented source code, traversal and serialization routines, and testbench are compiled and executed (any functionally accurate execution environment suffices for this step). During execution, the traversal and serialization routines dynamically enumerate and collect the values of candidate macromodel parameters. Crosscompilation and instruction-level simulation of the uninstrumented target source code and testbench is used to obtain the energy consumption and execution time for each execution instance of the target function. The collected instance-by-instance profile database, which contains values for all the independent and dependent variables,
is then fed to the symbolic regression engine to produce the macromodel.

The rest of this section describes the key steps of our methodology, which are highlighted as shaded rectangles in Figure 3.

Figure 3: Overview of the proposed automatic macromodeling
Methodology

3.1 Data Collection

Our data collection tool parses the input C files3 to collect information about data types and function arguments in the program, which is used to suitably instrument the input program.

We use argument sizes and values of the input and output data structures of the target function, as well as their nested fields, as model parameters.
We define argument size of a data structure as the number of bytes it would occupy if it were serialized. Serialization is the process of converting structured data to serial byte streams for the purpose
of storage or transmission, as in a remote procedure call.
We use type analysis to automatically generate code that computes argument sizes. In compiler theory , two types of data types are identified: basic types (e.g, int, char, float) and type constructors (e.g, pointers, arrays, and structures). Our object size calculations utilize rules associated with each basic type and type constructor. The size of basic types can be directly obtained using language facilities. The size of a structure is the sum of all nested fields. Pointers are recursively traversed using indirection
until a non-pointer type is obtained, whose size is then taken as the size of the pointer. Array sizes can be calculated similarly but require knowledge of array bounds at runtime. While C implementations do not in general maintain array bounds, C functions that
have array arguments usually also include other arguments specifying array bounds.

Callee function argument sizes are computed dynamically by code inserted in the caller function that calls the target function, immediately before and after the call. The framework described above enables run-time calculation of object size and other interesting information. For example, the size of a linked list object would be calculated as the sum of the sizes of all elements of the linked list. As a more complex example, consider the bgraph structure shown in Figure 1. Most macromodel templates for bg compute scc would require data about the number of vertices, n, and number of edges, e, in the graph. From the value of field no vertices, n can be obtained directly. Calculating e requires recognizing that vlist (the graph’s adjacency list) is actually an array of size no vertices of LISTPTR objects. Hence, the size of the vlist field ends up serving as a good estimate of e.

3.2 Macromodel intruction Using Symbolic Regression

The data collected through characterization experiments should be used to construct a macromodel relating the target function’s energy or execution time to a subset of the potential macromodel parameters. We perform this critical step through the use of symbolic regression, which was first introduced as an application that combined the fields of statistical data analysis and genetic programming (GP) by Koza . GP is used to evolve formulae containing the identified model parameters and a chosen set of mathematical operators. Based on extensive experimentation, we found the set F = {+,-,×, /, x2, x3, and log(x)} of operators to be quite adequate for our modeling needs.

We used an extended form of Koza’s symbolic regression technique, called Hybrid GP (HGP) [19], to increase the numerical robustness of symbolic regression. HGP extends Koza’s symbolic regression by introducing weights for all additive terms in the genetically derived regression formula. Classical linear regression is used to tune the weights.

4. IMPLEMENTATION AND RESULTS

In this section, we discuss our implementation and present experimental
results demonstrating the benefits of the proposed methodology.

The instrumentation and data collection steps in our methodology were implemented using a YACC based parser and several PERL scripts. Our implementation of symbolic regression is based on the GP kernel gpc++ and libraries for symbolic and numerical manipulation .

We conducted several experiments using a variety of benchmark software programs to demonstrate the utility of our automatic macromodeling framework. Table 1 shows how the macromodels obtained using our framework perform compared to execution times and energy consumption values obtained through a combination of extensive simulations and measurements from real implementations.
The benchmarks have been given descriptive names to indicate their function. For each benchmark, a sample set of 500 input instances (data sets) was used to develop the macromodel. The error associated with a macromodel is defined as the root mean square (RMS) deviation from observed values (obtained through instruction-set simulation or measurement), taken over the entire sample set. The symbolic regression tool was programmed to terminate after fifty GP generations or when the error dropped to less than 1%, whichever occurred sooner.

Table 1: Macromodeling examples

We chose the SimIt-ARM-1.1 cycle-accurate ARM ISS as our measurement platform because of its high simulation speed. The execution time of a code segment was determined as the difference in execution times of two versions of the benchmark, one with the execution of the target function enabled, and the other with it disabled. To compute energy consumption, we extended SimIt-ARM-1.1, to report processor and memory energy estimates.

4.1 Case Study: Energy Tradeoffs during Lossless Data Compression

In this section, we explore the use of macromodels in making algorithmic tradeoffs using the zlib [29] compression library. zlib can be embedded into any software application in order to perform lossless data compression. The compress2() function provided by zlib, whose interface is given by int compress2 (Bytef *dest, uLongf *destLen, uLong sourceLen, int level), allows the user to vary the computational effort expended in compression by using the level function argument that takes values from zero (no compression) to nine (maximum compression).

We developed a macromodel for the energy consumed by the compress2 function using the proposed methodology, and used it to study the tradeoff between energy consumption and the actual compression ratio achieved, for various values of the level parameter, over 300 files of various types ranging in size from 1 byte to 1 MB. It can be seen from the results of this experiment in Figure 4 that the average energy consumption increases monotonically with level but the compression ratio does not, indicating that not all compression levels are Pareto-optimal in terms of the above metrics. The figure also shows that macromodel estimates are in close agreement with energy estimates obtained using SimIt-ARM. Furthermore, the macromodel based approach has the same relative trend as the simulation based estimates, which makes it suitable for high-level design space exploration. The advantage of macromodeling is evident from the fact that estimation using the macromodel for all the input samples required less than a
minute, while the ISS took over a day to complete.

Figure 4: Using macromodeling to explore compression vs. energy
Tradeoffs

5. CONCLUSIONS

We presented a systematic methodology to automate the generation of energy and performance macromodels for embedded software. The proposed methodology radically simplifies macromodel construction, while expanding its applicability to complex embedded software. Furthermore, the use of properties of program data structures, including function arguments, as model parameters simplifies macromodel use, enabling usage in conjunction with any simulation environment. For example, macromodels could be integrated into an instruction-level simulation environment, so that some parts of the code are handled using macromodels, while glue code or parts that are difficult to macromodel are simulated using conventional techniques.

Technical papers

Monday, February 11, 2008

Automated energy[performance] macromodelling of embedded software

No comments:

Blog Archive