Reliability Modeling and Management In Mobile Devices
Optimizing performance in mobiles increases the peak power consumption, causing the device temperature to raise quickly. This dramatically worsens the impact of transistors and interconnects reliability degradation mechanisms, which depend on temperature and voltage stress. Semiconductor devices are subject to degradation mechanisms which harm their performance over time and eventually cause them to fail. CMOS transistors are affected by Time Dependent Dielectric Breakdown (TDDB), Negative Bias Temperature Instability (NBTI) and Hot Carrier Injection (HCI). Metal interconnects are affected by Electromigration (EM) and Thermal Cycling (TC) . With the scaling of CMOS, the impact of degradation worsens as the dimensions of transistors and interconnects shrink, leading them to early failure . The International Technology Roadmap for Semiconductors (ITRS) identifies reliability issues due to aging as a primary concern for integrated circuit . Uncertainties in reliability can lead to performance, cost and time-to-market penalties and can originate field failures that are costly to fix and damaging to reputation. In the past it was possible to mitigate reliability issues with higher design margins, trading reliability with power and timing without affecting performance much. With latest technologies, however, increasing the design margin further would severely jeopardize performance. This is the reason why in the last decade much effort was spent in developing innovative solutions to guarantee high reliability at the process, design, OS, software and application level. Solutions for reliability at the process and design level are difficult to realize in practice. Modifying the manufacturing process may not be affordable, while making chip design more robust can significantly increase the design cost and decrease profit margins. Techniques at the compiler level, instead, introduce memory and performance overhead. They rely on the application/compiler programmers ability to adapt the software and enable reliability-aware execution , . At the application level, proposed techniques guarantee the correctness of execution with no hardware overhead. However, they are often application-specific, and thus they lack portability. All the degradation mechanisms depend on voltage and temperature stress and can be described by a reliability function which at any point in time represents the probability that the device does not fail . Given this, the degradation of devices can be changed at runtime by managing operating conditions that influence voltage and temperature. This requires having model-based estimation of reliability degradation over time . Such strategy is referred to as Dynamic Reliability Management (DRM). The most promising abstraction layer on which to implement DRM strategies is the operating system, in close cooperation with hardware. Modern operating systems,in fact, have power management capabilities which handle factors that impact reliability, power and temperature, with negligible overhead at runtime. A dynamic reliability management policy developed for the operating system is relatively easy to implement, portable to other devices, requires no hardware overhead and is application-independent. However, an effective solution for reliability management requires detailed information about the status of the platform under control (through sensors), and also information about the quality requirements of the running application. Thus, a cross-layer approach is a promising solution for reliability. Reliability can be monitored at runtime thorugh the use of degradation sensors. Unfortunately, such sensors today are not available on commercial devices. Therefore, implementating DRM on real devices requires the online emulation of reliability degradation. This leverages readings from built-it voltage and temperature sensors as inputs for the reliability model and computes the current value. Such measure can be then used in a control loop to adapt the lifetime degradation of the target device. The scaling of CMOS and the consequent increase in degradation rate variability make dynamic reliability management even more important. Because of this, the distribution of circuit lifetimes becomes larger. DRM then is required to balance the degradation of devices over time and avoid early failures.