# System-level Dynamic Variation Margining in Presence of Monitoring and Actuation

Liangzhen Lai, Member, IEEE, Puneet Gupta, Member, IEEE

Abstract—Adaptive system has become a popular approach to mitigate dynamic variations with the help of monitoring and actuation. However, design margin is still required due to imperfections in the entire adaptation process, including inaccuracy and latency for both monitors and actuators. In this work, we study the system-level margining problem in presence of monitoring and actuation. We first analyze the margining strategies for different types of dynamic variation sources. Case studies are performed with different monitor and actuator types for temperature variations. Our experiment results show that inappropriate design and selection of monitor and actuator can result in up to 2.8X more system margin. Based on the results, some design guidelines are given for system design optimization.

#### I. INTRODUCTION

Dynamic variations affect the circuit performance and threaten system resiliency. Voltage fluctuations, temperature variations and circuit aging are typical dynamic variation sources that can degrade the circuit performance and cause timing errors. To guarantee correct functionality, increasing amounts of design margin are applied for dynamic variations [1].

Monitor-and-actuate approach has been widely adopted as an approach to dynamically mitigate circuit variability and reliability issues [1]–[7]. Typical monitor-and-actuate system consists of three components, i.e., monitor, actuator and controller. The variation signatures, including voltage/temperature fluctuations and wear-out aging, are captured by the monitor at runtime. A controller with certain type of management policy uses the monitor readings and makes the adaptation decision. Actuator, which is the mechanism for achieving adaptation, adjusts the operating conditions as the system's reaction to the measured variations.

Previous work has been proposed to improve the quality of the monitors or the actuators. For example, various types of circuit-level monitors are proposed [5], [8]–[11] with different levels of accuracy, overhead, and measurement latency. Different actuation mechanisms, e.g., clock gating [3], dynamic frequency scaling (DVFS) [6], [12], software-driven DVFS [4], [5], task migration [13], have different levels of granularity, overhead, and actuation latency. Some of the monitor and actuator design quantifies the margin required for its own inaccuracy [5], [8], [9]. However, for dynamic variations, the system must account for both inaccuracy and latency in

The authors are with Electrical Engineering Department, University of California Los Angeles, CA 90095 USA. (e-mail: liangzhen@ucla.edu, puneet@ee.ucla.edu). This work is supported in part by NSF Variability Expedition grant CCF-1029030.

the *entire* adaptation process. System-level dynamic variation margining in presence of monitoring and actuation is still understudied and ad-hoc.

1

In this work, we study the system-level margining problem assuming a simple threshold-triggering policy. Some sophisticated controller policies can reduce the system reaction latency with some prediction capabilities, e.g., through signaturebased [2] or model-based [14] approaches. However, they come with additional prediction overhead and, more importantly, imperfect prediction accuracy. Therefore, they may not be able to reduce the system design margin, which is typically worst-case driven.

The rest of the paper is organized as follows: Section II describes the system margining strategy. Section III presents the study of system margining for different types of dynamic variation sources. Section III-C discusses the design implication of dynamic variation margining. Section IV concludes the paper.

#### **II. SYSTEM MARGINING STRATEGY**

In this section, we study the margining strategy for dynamic variations in presence of system adaptations. The system margin breakdown is described in Section II-A. The exact amount of margin required is analyzed in Section II-B.

#### A. Margin Breakdown

Ideally, no margin is required if the system can adapt to the variations perfectly. Every source of imperfections during the monitor-and-actuate process, including monitor and actuator inaccuracy, adaptation latency etc, calls for margining. For dynamic variations, system margin can be classified as static margin and dynamic margin. Dynamic margin is the additional design guardband required due to the dynamic nature of the variations and adaptation process. Static margin is the additional design guardband required due to the imperfection of each individual components, such as monitor coverage/inaccuracy or actuation granularity.

As mentioned in Section I, we assume a threshold-triggering policy as illustrated in Fig. 1. For certain variation metric (e.g., temperature, voltage droop, delay etc.), the system has a maximum allowed value (marked as the red line in Fig. 1) to ensure correct operations. A threshold-triggering policy has a triggering threshold value (dotted line in Fig. 1) for the monitor reading. The adaptation controller will activate the actuation process if the monitor reading is higher than the threshold.

Due to various latencies in the entire monitor and actuation process, by the time when actuation becomes effective, the

2



Fig. 1. Illustrate of static and dynamic margin, and the threshold-triggering policy.

variation metric may have gone worse from what was monitored (blue dotted line in Fig. 1). This calls for dynamic margin, which will be a function of the system response latency and variation changing rate. Other than dynamic margin for response latency, certain static margin is required for monitor and actuator inaccuracy. As shown in Fig. 1, the total system margin, including both static margin and dynamic margin, is dictated by the difference between the maximum allowed value and the trigger threshold.

Due to the simplicity of threshold-triggering policy, in this work, we assume that the actuation controller response time is much smaller than monitor and actuator latency. The total system response latency consists of only monitor latency and actuator latency. The monitor latency is the time between when variation metric is sampled and when monitor reading is available. This can be approximate by the time it takes to generate a monitor reading. The actuation latency is the time it takes to complete the actuation operation.

#### B. System Margin Analysis

As discussed earlier, the system margin can be classified as static margin and dynamic margin. Static margin is always required, regardless of variation types. Improvement on either monitor accuracy or actuator quality can reduce the required static margin. For example, flexible monitor design methodologies [5], [8], [9] can trade-off its own accuracy with overhead. The static margin  $M_s$  can be calculated as:

$$M_s = Err_m + Err_a \tag{1}$$

where  $Err_m$  and  $Err_a$  are the monitor and actuator errors.

The amount of dynamic margin, however, will depend on the temporal profile of the specific variation type (i.e., how fast the variation changes) and the system response latency. both the monitor and actuator latencies. For example, process variation or circuit aging (e.g., BTI, HCI etc.) requires little dynamic margin because they are either static or slow changing. But temperature and supply voltage fluctuations will need certain amount of dynamic margin, as the temperature increase or voltage droop can be worsening before the system can monitor and effectively react to them. Based on the variation type, the margin required can be represented as a function V(t) of the latency t. The derivation of this function will be discussed later in Section III. The dynamic margin  $M_d$  can be calculated as

$$M_d = V(l_m + l_t) \tag{2}$$

where  $l_m$  and  $l_t$  are the latencies for monitor and actuator. Usually, the required margin increases with the latency. These latency values should be the worst-case latency for both monitor and actuator.

The total system margin M can be calculated as:

$$M = M_s + M_d = Err_m + Err_a + V(l_m + l_t)$$
(3)

This gives us the breakdown and dependence of the system margin in presence of monitor-and-actuate adaptations.

#### **III. CASE STUDY EXPERIMENTS**

In this section, we study the system margining problem for various types of dynamic variation sources. The types of dynamic variation sources will be described in Section III-A. Case studies for temperature variations are presented in Section III-B.

#### A. Dynamic Variation Sources

In this work, we consider mainly three types of dynamic variation sources, i.e., voltage droops, wear-out aging and temperature variations. Out of these three variation sources, voltage droop is the fastest varying one, and wear-out aging is the slowest. We expect the dynamic margin to be the largest for voltage droop and smallest for aging.

The behavior of voltage droop depends on both the workload behavior and power delivery network. Typical assumption for static and dynamic voltage droops is that they should be less than 5% and 15% of supply voltage respectively [15]. Monitor-and-actuate can potentially reduce the total margin by 75%. However, previous work reports that the step-on current can results in a peak voltage drop resonating at about 100MHz [16]. To make the adaptation system effective, the entire monitor-and-actuate system should operate much faster than 100MHz. If we consider clock-gating as the actuator, it is extremely difficult to design the adaption system with short enough latency. Considering the growing complexity of clock network and massive number of clock sinks, the clock distribution latency from clock source to the end clock pins can be up to a few clock cycles [17], [18]. Therefore, we expect little benefit from dynamic adaptation for voltage droops.

Most of wear-out aging mechanisms, unlike voltage droop, varies slowly along the entire hardware lifetime. Given the typical hardware lifetime of a few years, the monitor and actuator latencies are negligible. The system margin for wearout aging will be dominated by the static margin.

Typical monitoring or actuation latencies range from submicron second to tens of milliseconds. Any variation that is faster (e.g., voltage droop) will have the system margin dominated by dynamic margin. The actuation system should always be optimized for latency. Any variation that is slower (e.g., aging) will have mostly static margin. The corresponding actuation system should be optimized for accuracy. Only the dynamic variation that falls within this frequency spectrum (e.g., temperature) needs to consider both static and dynamic margin. Therefore, in this work, we focus on the case of temperature variations.

## B. Case Studies for Temperature Variations

We demonstrate our strategy with some case studies on systems equipped with different thermal monitors and corresponding actuators. We assume the threshold-triggering policy (described in Section II-A) in the experiments. For the system to function correctly, a bound is defined for the temperature variation, i.e., maximum allowed temperature. The system will keep reading from the monitor and activate the actuator if the monitor reading is beyond certain threshold. As illustrated in Fig. 1, due to the imperfection of the monitor and actuator, the threshold must be conservative and tighter than the variation bound. The difference between the threshold and the variation bound represents the margin. For temperature monitoring, there can be two sources of inaccuracy. The monitor itself can be inaccurate and deviate from the actual junction temperature. Meanwhile, the thermal hotspot may not be at the same exact location as the thermal monitor, causing spatial inaccuracy. For temperature variation, the actuation is to mitigate overheating rather than to control the exact temperature. Actuator granularity and inaccuracy will directly impact the actuation quality (e.g., impacts on performance) rather than impact the system margin. Analyzing the actuation quality (e.g., impacts on performance) is beyond the scope of this paper. Therefore, in this case study, the static margin is classified as monitor inaccuracy margin and spatial inaccuracy margin.

The problem of margining for temperature fluctuation is illustrated in Figure 1. For the experiments, we set the temperature limit as 100 °C. By using thermal simulator, HotSpot [19], we can simulate the temperature profile of a given power trace. In this work, we consider a chip with 16X16 homogeneous cores. The baseline power, i.e., Thermal Design Power (TDP), is set so that the highest on-chip temperature reaches the temperature limit 100 °C. Different thermal monitors and actuators are summarized in Table II and Table I, respectively. To be able to cross-compare different type of temperature monitors, the area budget for monitor placement is fixed at 0.4 mm<sup>2</sup>. The monitor inaccuracy due to limited number of thermal monitors are calculated using the equation [20]:

# Spatial $Error = 25 - 4log_2N$

, where N is the number of monitors.

The dynamic margin is derived based on the maximum power consumption  $P_{max}$ . If  $P_{max} = TDP$ , the problem is not interesting, as the maximum temperature will never be higher than the temperature limit. We first simulate the temperature over time as T(t). With  $P_{max} > TDP$ , the temperature will cross the temperature limit (i.e.,  $100 \ ^{\circ}C$ ),

TABLE I LIST OF ACTUATORS

3

| Actuator                  | Latency |
|---------------------------|---------|
| A1: Clock gating          | 10 ns   |
| A2: DFS(with PLL re-lock) | 100 us  |
| A3: DVFS                  | 10 ms   |
| A4: Task migration        | 50 ms   |

TABLE II LIST OF THERMAL MONITORS

| Monitor | samples/s | Accuracy          | Normalized Area (mm <sup>2</sup> ) at 65nm |
|---------|-----------|-------------------|--------------------------------------------|
| M1 [21] | 10000     | $4 \ ^{\circ}C$   | 0.01                                       |
| M2 [22] | 5000      | $2 \ ^{\circ}C$   | 0.04                                       |
| M3 [23] | 1000      | $1 \ ^{\circ}C$   | 0.01                                       |
| M4 [24] | 10        | $0.1 \ ^{\circ}C$ | 0.04                                       |

say at time  $t_a$  with  $T(t_a) = 100$  °C. The function V(t) defined in Section II-B can be calculated as

$$V(t) = T(t_a) - T(t_a - t)$$

The dynamic margin is calculated using Equation (2) with total latency as sum of actuator latency (see Table I) and monitor latency (see Table II).

Some case study results are highlighted in Table III. A few interesting observations can be made from the results:

- Dynamic margin is comparable to or even larger than static margin, depending on  $P_{max}/TDP$  ratio.
- The amount of dynamic margin increases dramatically with increased  $P_{max}/TDP$  ratio. As we enter the dark silicon era, where power density increases much faster than the packaging technology.
- Within the static margin, spatial error margin is considerably higher than the inaccuracy margin. This can be different if we allocate more resources and increase the number of thermal monitors.

Comparing some cases with the same types of actuator, e.g., Case I and Case VI the same monitor area with different monitor types can result in dramatically different amount of static margin and dynamic margin. Inappropriate combination of monitor and actuator types, e.g., unmatched latencies, can results in up to 2.8X increase in system margin from  $7.71^{\circ}C$ of Case I to  $21.33^{\circ}C$  of Case VI.

#### C. System Design Implication

When designing adaptation systems, both monitors and actuators need to be matched to maximize the benefit. However, some parts of the system, especially the actuators, may also be used for other purposes like power management. The choices of them can be affected by many different considerations.

This work aims at giving some guidelines for selecting and designing the adaptation system, especially for the monitors which usually are dedicated for this purpose and have larger design flexibility. A few design guidelines can be made based on our observations from the case studies for temperature variations:

 Monitor inaccuracy, which is the main metric for monitor design, only constitute a fraction of the entire system

4

| Case Monit | Monitor   | nitor Actuator | Total latency | Static Margin                        | Dynamic Margin           | Dynamic Margin          | Dynamic Margin           |
|------------|-----------|----------------|---------------|--------------------------------------|--------------------------|-------------------------|--------------------------|
|            | NIOIIIIOI |                |               | (monitor+spatial)                    | $(P_{max} = 1.25 * TDP)$ | $(P_{max} = 1.5 * TDP)$ | $(P_{max} = 1.75 * TDP)$ |
| Case I     | M1        | A1             | 100 us        | $4 \ ^{\circ}C + 3.71 \ ^{\circ}C$   | $0 \ ^{\circ}C$          | $0 \ ^{\circ}C$         | $0 \ ^{\circ}C$          |
| Case II    | M2        | A2             | 0.6 ms        | $2 \ ^{\circ}C + 11.7 \ ^{\circ}C$   | $0 \ ^{\circ}C$          | $0.02 \ ^{\circ}C$      | $0.05 \ ^{\circ}C$       |
| Case III   | M3        | A3             | 11 ms         | $1 \ ^{\circ}C + 3.71 \ ^{\circ}C$   | $0.11 \ ^{\circ}C$       | $0.41 \ ^{\circ}C$      | $0.90~^\circ C$          |
| Case IV    | M4        | A4             | 150 ms        | $0.1 \ ^{\circ}C + 11.7 \ ^{\circ}C$ | $1.28 \ ^{\circ}C$       | $7.07 \ ^{\circ}C$      | 22.5 $^{\circ}C$         |
| Case V     | M1        | A4             | 50 ms         | $4 \ ^{\circ}C + 3.71 \ ^{\circ}C$   | $0.48 \ ^{\circ}C$       | $1.86 \ ^{\circ}C$      | 4.13 °C                  |
| Case VI    | M4        | A1             | 100 ms        | $0.1 \ ^{\circ}C + 11.7 \ ^{\circ}C$ | $1.01 \ ^{\circ}C$       | $4.16 \ ^{\circ}C$      | $9.53 \ ^{\circ}C$       |

 TABLE III

 CASE STUDY ON DYNAMIC AND STATIC MARGIN

margin. Over-optimizing the monitoring accuracy at the cost of excessive area and latency can sometimes result in larger system margin.

- Monitor latency, which may be overlooked during monitor design, can affect the dynamic margin. This makes the design of faster monitor as important as design accurate ones.
- For spatially distributed phenomena like temperature variations, optimizing for smaller monitor area/overhead is also important from the system design perspective. In some cases, trading-off individual monitor accuracy for more efficient monitor design can result in smaller spatial margin and thus better system margin.

We understand that in real design, the choices of adaptation system can be affected by many different considerations. The final decision can be made by trading-off all metrics, e.g., final design margin, design overhead, and design complexity etc.

### IV. CONCLUSION

In this work, we first study the system margining problem and analyze different types of margin for monitor-and-actuate systems. Margining strategies are studied for different types of dynamic variation sources. Case studies are performed for temperature variations margining. The experiment results show that incorrect design and selection of monitor and actuator can result in less adaptation benefit. Design guidelines are given for optimizing system margin of monitor-and-actuate adaptations.

#### REFERENCES

- [1] P. Gupta *et al.*, "Underdesigned and opportunistic computing in presence of hardware variability," *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, 2012, Keynote Paper.
- [2] V. J. Reddi et al., "Voltage emergency prediction: Using signatures to reduce operating margins," in *High Performance Computer Architecture*, 2009. HPCA 2009. IEEE 15th International Symposium on. IEEE, 2009.
- [3] K. A. Bowman et al., "All-digital circuit-level dynamic variation monitor for silicon debug and adaptive clock control," *Circuits and Systems I: Regular Papers, IEEE Transactions on*, vol. 58, no. 9, 2011.
- [4] Y. Agarwal *et al.*, "Redcooper: Hardware sensor enabled variability software testbed for lifetime energy constrained application," Tech. Rep., http://escholarship.org/uc/item/1c21g217.
- [5] L. Lai et al., "Accurate and inexpensive performance monitoring for variability-aware systems," in *Design Automation Conference (ASP-DAC)*, 2014 19th Asia and South Pacific, Jan 2014.
- [6] C. R. Lefurgy *et al.*, "Active guardband management in power7+ to save energy and maintain reliability," *Micro, IEEE*, vol. 33, no. 4, 2013.
- [7] S. Sarma *et al.*, "Cyberphysical-system-on-chip (cpsoc): a self-aware mpsoc paradigm with cross-layer virtual sensing and actuation," in *IEEE/ACM Design, Automation and Test in Europe*. EDA Consortium, 2015.

- [8] T.-B. Chan et al., "Synthesis and analysis of design-dependent ring oscillator (ddro) performance monitors," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 22, no. 10, Oct 2014.
- [9] L. Lai et al., "Slackprobe: A flexible and efficient in situ timing slack monitoring methodology," Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 33, no. 8, Aug 2014.
- [10] J. Tschanz *et al.*, "Tunable replica circuits and adaptive voltagefrequency techniques for dynamic voltage, temperature, and aging variation tolerance," in *VLSI Circuits, Symposium on*, june 2009.
- [11] A. Drake *et al.*, "A distributed critical-path timing monitor for a 65nm high-performance microprocessor," in *Proc. IEEE International Solid State Circuits Conference*, feb. 2007.
- [12] A. J. Drake *et al.*, "Single-cycle, pulse-shaped critical path monitor in the power7+ microprocessor," in *Low Power Electronics and Design* (ISLPED), 2013 IEEE International Symposium on. IEEE, 2013.
- [13] D. Cuesta *et al.*, "Adaptive task migration policies for thermal control in mpsocs," in *VLSI 2010 Annual Symposium*. Springer, 2011.
- [14] A. K. Coskun *et al.*, "Proactive temperature balancing for low cost thermal management in mpsocs," in *Computer-Aided Design*, 2008. *ICCAD 2008. IEEE/ACM International Conference on*. IEEE, 2008.
- [15] K. Arabi et al., "Power supply noise in socs: Metrics, management, and measurement," *IEEE Design & Test of Computers*, vol. 24, no. 3, 2007.
- [16] M. S. Gupta *et al.*, "Understanding voltage variations in chip multiprocessors using a distributed power-delivery network," in *Design*, *Automation & Test in Europe Conference & Exhibition*, 2007. DATE'07. IEEE, 2007.
- [17] P. Restle *et al.*, "Timing uncertainty measurements on the power5 microprocessor," in *Proc. IEEE International Solid State Circuits Conference*. IEEE, 2004.
- [18] P. J. Restle et al., "A clock distribution network for microprocessors," IEEE Journal of Solid State Circuits, vol. 36, no. 5, 2001.
- [19] W. Huang *et al.*, "Hotspot: A compact thermal modeling methodology for early-stage vlsi design," *Very Large Scale Integration (VLSI) Systems, IEEE Transactions on*, vol. 14, no. 5, 2006.
- [20] S. O. Memik et al., "Optimizing thermal sensor allocation for microprocessors," Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 27, no. 3, 2008.
- [21] C.-C. Chung et al., "An autocalibrated all-digital temperature sensor for on-chip thermal monitoring," Circuits and Systems II: Express Briefs, IEEE Transactions on, vol. 58, no. 2, 2011.
- [22] K. Woo et al., "Dual-dll-based cmos all-digital temperature sensor for microprocessor thermal monitoring," in *Solid-State Circuits Conference-Digest of Technical Papers*, 2009. ISSCC 2009. IEEE International. IEEE, 2009.
- [23] P. Chen et al., "A time-to-digital-converter-based cmos smart temperature sensor," Solid-State Circuits, IEEE Journal of, vol. 40, no. 8, 2005.
- [24] A. L. Aita *et al.*, "A cmos smart temperature sensor with a batchcalibrated inaccuracy of  $\pm 0.25$  c ( $3\sigma$ ) from- 70 c to 130 c," in *Solid-State Circuits Conference-Digest of Technical Papers, 2009. ISSCC 2009. IEEE International.* IEEE, 2009.