Problems, Solutions, Limitations

Problems, Solutions, Limitations
Transient and Permanent Faults in Nanoelectronic ICs: Compensation and Repair Problems, Solutions, Limitations H. T. Vierhaus BTU Cottbus Computer Engineering

Outline 1. Introduction: Nanostructure Problems 2. Transient Faults
3. Repair of Permanent Faults 4. Bus Structures and NoCs 5. Diagnostic Test 6. A Lot of Things to do ...

1. Introduction A bunch of new problems from nanostructures ...

Nanoelectronic Problems
Lithography: The wavelength used to „map“ structural information from masks to wafers is larger (4 times of more) than the minimum structural features (193 versus 90 / 65 / 45 nm). Adaptation of layouts for correction of mapping faults Parameter variations: The number of atoms in MOS- transistor channels becomes so small that statistical variations of doping densities have an impact on device parameters such as threshold voltages.

Doping Fluctuations in MOS Transistors
- Substrate n Poly Si doping atom Density and distribution of doping atoms cause shifts in transistor threshold voltages!

Nanostructure Problems
Individual device characteristics such as Vth are more dependent on statistical variations of underlying physical features such as doping profiles. A significant share of basic devices will be „out or specs“ and needs a replacement by backup elements for yield improvement after production. As smaller features mean higher stress (field strength, current density), also early failures „in the field“ are more likely and must be compensated. Transient error recognition and compensation „in time“ is becoming a must due to e. g. charged particles that can discharge circuit nodes.

Key Technologies Fault tolerant computing
Is required to handle intermittent and transient fault effects, e.g. induced by radiation. An old technology that is already heavily used in every day computing (e.g. memory interfaces with ECC- check and correction). Can handle only a limited number of permanent faults! Built-in self test (BIST) and self-repair (BISR) Is required to handle permanent faults by self-repair using redundant elements. State-of-the-art for memories, not for logic. Can handle multiple faults (sequentially) until the resource of redundancy is exhausted. Algorithms that are fully or partially „fault hard“ Most DSP algorithms show an inherent „stability“ and work even under fault conditions with reduced precision. The effect can be „HW-enhanced“.

System-on-a Chip (SoC)
SoCs are heterogeneous systems that require test & repair strategies for: - logic (also in processors) - memory blocks - interconnects analog and D/A components

Fault Tolerant Computing
Works only for transient faults! Software-based fault detection & compensation specific Fault event HW logic & RT-level detection & compensation Typically works for transient and permanent faults! universal Typically works for specific types of transient faults only! very specific Transistor-and switch level compensation

2. Transient Fault Effects

Storage Nodes and Particles
Q / fC 100 Alpha - Part. 10 1 0,35 0,25 0,18 0,09 Technology 1 MeV Alpha - Particle generates 42 fC Charge!

Contribution to Soft-Error Rates
Static combinational logic: % Sequential elements (FFs, Latches): 49 % Unprotected SRAM: % Source: S. Mitra, N. Seifert, M. Zhang, Q. Shi, K. S. Kim, „Robust System Design with Built-In Soft Error Resilience“ IEEE Computer, Vol. 38, No.2, Febr. 2005, pp

Spikes and Clock Rates in Logic
Source: Pulse of 100 ps Charge-/status restoration is possible clock t clock Charge-/status restoration is impossible t Fault probability is digital logic is about proportional to clock frequency!

Logic Structures and Fault Events
Particle- radiation Input - FFs Output FFs Flip-flops need fault tolerance / fault hardening in the first place, logic close-to outputs comes next.

Muller-C-Element

Fault-Tolerant Latch Design
outl1 Latch 1 out in Muller C-Element Latch 2 outl2 If clock is high: out = in CL outl1= in outl1, outl2 latched outl1= in outl2= in outl2= in v(t) clock t

Fault Handling Muller-C-Element:
If both inputs are equal: out = outl1, outl2 If both element are not equal: out = previous (outl1, outl2) Under local fault conditions on the latch outputs (one of 2 latches false), the C-element preserves the output condition from the „charge“ phase of the latch. Essentially 3 latches!

Intel‘s Scan Path Element

Intel‘s Scan Path Element plus Fault Compensation

TMR-Latch / Flip-Flop Works with latches or flip-flops -
in FF1 Out = L1out with cout = 1 MUX Out = L2out with cout = 0 FF2 XOR cout FF3 clock Works with latches or flip-flops - Can compensate static or dynamic faults in latches / FFs! FF1 is untestable (active redundancy)

TMR-Scan-Element

TMR Scan-Element Fault tolerant in functional mode
Fault tolerant in scan-mode Optional support of test strategies that require a specific sequence of 2 input bits!

Fault tolerant Latches and FFs

Fault Compensation in Combinational Logic
Particle - radiation Input - FFs MC D MC D MC D

Fault Compensation in Combinational Logic
fault-free signal V(t) t Signal with glitch V(t) Latch close t Signal with delayed glitch Time left to capture! V(t) t MC capture MC no capture / hold MC capture

3. Repair of Permanents Faults
Compensation of transient faults is not enough. Some technologies for transient compensation can handle permanent faults, too, but not on the long run and with additional transient faults!

Memory Test & Repair spare column columns Read-/ Write lines Lines
address spare column columns

Memory Test & Repair (2) spare column Memory BIST columns controller
Read-/ Write lines Lines Line address spare column Memory BIST controller columns ... is already state-of-the-art!

Logic Self Repair

Granularity of Replacement

Levels of Repair

Replacement in Regular Structures (e.g. for DSP)

Parallel Backup Transistors
VDD VDD out in1 out in1 redundant transistors in2 in2 GND GND Basic gate Gate with redundant transistors

Redundancy by „Active“ Parallel Transistors
Active redundancy is not testable. Therefore there is no way to monitor the status of „available“ redundancy in a logic circuit. Parallel transistors cannot compensate a fault of the „stuck-on“ type (transistor always conducting). Faulty „backup“-transistors may produce additional faults that cannot be corrected! Adding redundancy is not enough, fault isolation is a real problem!

Configuration and Fault Isolation
VDD stuck-on fault in1 out in2 GND

The Gate-Short-Problem
Load 1 Driver Load 2 Gate- short GND-shorts of input gates affect the whole fan-in network and make redundancy obsolete!!

Gate Turn-off

Schematic Layout with VDD/GND Switches
Gate with parallel redundancy Gate with parallel redundancy and fault isolation

Transistor-Level Overhead
Redundancy parallel transistors VDD / GND switches separate gate poly lines Overhead (cells only) 30-40% 60-80 % % estimates stuck-off coverage yes yes yes stuck-on coverage no yes yes gate shorts cov. no no yes control none one wire mult. wires lines

Duplicate Standard Cells
VDD Switch VDD - Switch control Gate 2 Gate 1 VDD1 VDD2 out out in1 in1 in2 in2 GND GND

Again: Fault Isolation
VDD Switch VDD - Switch control Gate 2 Gate 1 VDD1 VDD2 out out in1 in1 in2 in2 GND GND Gate input short Output VDD / GND short

Administrated Duplicate Cells
VDD power switches 1 X VDD1 X 1 VDD2 gate in gate in gate gate out out Gate 1 Gate 2 Gate short GND1 0 X X 0 GND2 0 1 1 0 Act 1 Act 2 GND switches 1 0 GND

Features Use „normal“ cell designs Four states of operation:
Config. 1: Gate 1 active, Gate 2 isolated Config. 2: Gate 2 active, Gate 1 isolated Config. 3: Both Gates active operating in parallel Config. 4: Both Gates isolated from VDD / GND Operations like „high / low power“ possible. Cells can be put to temporary „sleep“ for stress relieve. Permanent repair functions. Active cell output is connected only to „floating“ outputs of the other cell. If twin tubs are used and cell-internal tubs are also disconnected, gate input / GND short prohibited.

Bistable Switching Cell
VDD 0 1 1 0 Output separation 1 0 0 1 Gate 1 Gate 2 1 1 1 0 Act 0 1 GND

Cell Duplication and Power Switch
Possible for all types of cells (also flip-flops). Granularity of partitioning for replacements (single gates, blocks) can be selected upon demand. Combination with dynamic circuit optimization is favorably possible. Good coverage potential for transistor faults. Significant overhead (above 100 %), but most likely below Triple Modular Redundancy (TMR). Redundancy may become exhausted and requires a further level of redundancy!

Gate - Replacement Gate- backup- fault cell Std cells (gates)
Insertion of replacement cell

Regular Logic Wiring logic gates next cell drive next cell backup cell
link drive feed next cell Config Block backup cell link next cell

Faults on Irregular Interconnects
Routing tree C signal source S C C single fault (line break) C

Redundant Wiring Routing tree with loops .. plus double vias! C S C C
extra wire .. plus double vias! C signal source S C C single fault (line break) C Problem: classic delay calculation works well on trees only!

4. Bus Structures and „Networks on Chip“ (NoCs)
Technology forecasts predict that nano-wires may become the most vulnerable and unreliable circuit elements ...

Buses versus NoCs Regular network structure Irregular bus structure
node NoC node NoC node Bus master Bus master NoC node NoC node NoC node Bus master Bus master Bus master NoC node NoC node NoC node Irregular bus structure (SoC) Regular network structure (NoC)

Faults on Bus Structures
BM 1 BM 3 BM 5 BM 2 BM 4 BM 6 Local defect affecting the total network

Bus Fault Conditions Technology forecasts predict a reliability problem with interconnects (nano-wires) in nano-technologies. A single permanent fault on a bus may affect the bus as a whole. Fault detection and compensation by methods developed for transient faults (Hamming code, ECC-checks) can handle static faults, but are relatively expensive. Capabilities of handling transient faults on top of permanent faults are limited.

Bus Segmentation BM 1 BM 3 BM 5 SC SC SC segment couplers S C S C S C
2 BM 4 BM 6 Structure the bus into segments that can be repaired individually!

The Switching Problem n n+k n k p switches contr. states n backup 1 p
8 1 1 16 9 16 1 1 32 33 32 2 2 128 65

Faults and Repair Actions
1. Line- break: Section of a line is interrupted use spare wire! 2. Line- short to GND: Section of a line is connected to GND use spare wire! 3. Dynamic coupling between adjacent line: a. Re-allocate lines in bundle b. Insert grounded line for decoupling 4. Bridge between lines: a. Feed both lines with same signal b. Make one line „floating“

Single Line Replacement
Fault Signal s0 s1 s2 s3 s4 (k-1) Bachup b0 b1 b2 Overhead: 2k switches, (k+1) logic states for 1 backup line 2pk switches, p (k+1) logic states for p backup lines

Inserting Lines for Decoupling
fault Signal s0 s1 s2 s3 s4 (k-1) Backup b0 b1 b2 Multiple line insertion for de-coupling requires multiple Shifts of lines, multiple switches and states!

Repair Mechanisms Buses with „extra“ backup lines that need specific configuration for repair generate high cost in terms of switches and administration due to many „logic states“ of the bus section. Such repair schemes are not suited to re-organize neighborhood relations on buses for de-coupling of lines. Try to cover all relevant fault conditions by a small set of states using permutation of lines!

Reconfiguration for De-Coupling
2-Way Switches may be used! i i k k i i k k … can help to minimize dynamic coupling faults!

Characteristics of 6 / 8 Wire Bundles
Given a bundle of 6 or 8 bus lines: Are there any permutations that create all-new neighbors for every single line in order to eliminate coupling faults? NNP6 NNP NNP82 NNP83 6 lines lines 0 - 2 0 - 2 0 - 3 0 - 5 1 - 6 1 - 5 1 - 7 1 - 4 2 - 0 2 - 7 2 - 4 2 - 0 3 - 5 3 - 0 3 - 6 3 - 5 4 - 7 4 - 6 4 - 2 4 - 1 5 - 3 5 - 1 5 - 0 6 - 1 6 - 4 6 - 3 5 - 3 7 - 4 7 - 2 7 - 1

6 Wires: Permutations and Replacement
Administration: Input wire 4 logic states for 2 sw.-columns mapping 1st switching column 2nd switching column 2 extra. wires 3rd switching column 6 logic states for Selected backup lines 3 sw.-columns Replacement possible by lines # (2 sw. col.) 1 extra. wire Line selected for backup

Selection of Permutations
All single faults must be repairable by selecting a minimum set of permutations. Those lines that can act as replacement for most of the others are selected for „backup lines“. By permutation, also non-faulty functional lines are re-arranged. No permutation used for repair must map a functional line to a faulty line.

Permutations for 8-Wire-Bundles
New-neighborhood Pair-wise symmetrical NNP1 NNP2 NNP3 PW1 PW2 PW3 0 - 2 0 - 3 0 - 5 0 - 1 0 - 6 0 - 4 1 - 6 1 - 5 1 - 7 1 - 0 1- 3 1 - 7 2 - 0 2 - 7 2 - 4 2 - 3 2 - 4 2 - 5 3 - 5 3 - 0 3 - 6 3 - 2 3 - 1 3 - 6 4 - 7 4 - 6 4 - 2 4 - 5 4 - 2 4 - 0 5 - 3 5 - 1 5 - 0 5 - 4 5 - 7 5 - 2 6 - 1 6 - 4 6 - 3 6 - 7 6 - 0 6 - 3 7 - 4 7 - 2 7 - 1 7 - 6 7 - 5 7 - 1

Selected backup Selected backup wires 2 lines selected for backup!

4 lines selected for backup!

Overhead / Coverage for 6-Line-Bundle
Spare. lines / Switches Faults 0/ / / 24 Single line fault - + + Dyn. coupl. faults + + + Double line faults - - 50%

Overhead / Coverage for 8-Line-Bundle
Spare Lines (out of 8) / Switches Faults 0/ / / 32 3 / 32 4/ 32 Single fine fault - + + + + Dyn. coupl. fault + + + + ++ ++ Double line faults - - 20% 30% 100 % Note: The number of switches is reduced by a factor of 2 if full 2-way-switches with 2 inputs / 2 outputs are used!

Results Bus segments can favorably be organized into bundles
of 8 lines for reconfiguration. Wider bundles require even more columns of switches. In a bundle of 8 lines, all single faults can be repaired either by one backup line and 3 columns of switches or two backup lines and 2 columns with 6 / 4 logic states. Two columns with 4 states also allow for two alternative modes of changing neighborhood relations for de-coupling. It also covers a fraction of double-line faults. A full coverage of double-line-faults requires 4 backup lines and 2 columns of switches or 2 backup lines and 4 columns.

Administration Scheme
lines in / Switches Switches out 0‘ out 1 1‘ 1 2 2‘ 2 3 3‘ 3 4 A B 4‘ B A 4 5 5‘ 5 6 6‘ 6 7 7‘ 7 Decode Config - bits Decode C1 C2 C2 C1 Matching Config - Config - Logic Logic

Processor-Based Bus Test
Master Bus reflector data lines Bus Bus Master Master reflector select Test Processor invert control clock

Test and Fault Diagnosis
S C BM S C S C BM S C BM S C S C BM BM S C S C S C S C BM Test Processor Segment Status List

Test Procedure & Fault Management
Upcoming: Test Procedure & Fault Management Test-Processor can „reset“ control of bus sections. Test processor runs diagnostic test to identify faulty lines. In case of faults, „trial and error“ test to identify faulty line segment(s). Test Processor keeps „fault list“ for redundancy management & supervision.

Summary A simple scheme of re-arranging bus sections for repair of
permanent faults. Simple control scheme based on few logic states. Modular approach based on bundles of lines is scalable to cover wider buses. Should work well with NoCs. Compatibility with regular schemes for bus test based on a dedicated test processor device. The number and the electrical effect of switches in complex bus systems may still cause problems.

5. Diagnostic Tests Fault diagnosis by diagnostic (self-) test is possibly the real bottleneck in logic BISR!

Fault Diagnosis Memory cells are either to diagnose in case of faults
affecting single cells. BIST is possible. Diagnostic tests of buses that have to discover a single faulty line are straightforward. They can easily find which wires are affected, but not where the fault is. Detecting a fault gate or even transistor in a logic block is a much more challenging problem. Diagnosis must be compatible with methods of test response compaction used in scan testing. Intelligent encoding for test responses! ... such as done by U. Potsdam and Infineon!

Combinational Logic Fault Diagnosis
Input - FFs Output FFs Faults can occur within specific gates, on interconnects, or in a „distributed“ manner. Identifying a specific fault gate or line is not easy at best and sometimes close-to impossible by logic testing.

Logic Test Comb. Logic (pseudo-) (pseudo-) inputs outputs Output Input
vector Input vector

Scan Path Technology Comb. Logic Scan-in Scan-out (pseudo-) inputs
outputs ff Comb. Logic ff ff ff ff ff ff ff ff ff ff Scan-in Scan-out Output vector Input vector

Test response compactor
Scan-based Logic Test Compacted / encoded test information De-compactor C L C L Coding Test response compactor Diagnosis

Fault Diagnosis on Compacted Output Data
Scan input Generator (De - Compactor ) scan clock d - value d d d d d d d 1 2 3 4 5 6 storage & & & & & & & MISR Ref. MISR MISR clock : k * scan - clock compare * patented , U. Potsdam and Infineon Technologies AG

6. A Lot of Work to Do Logic fault diagnosis
Efficient logic self repair Redundancy supervision and management Resource management under fault conditions Repair functions for interconnects Overall system-level fault management

Problems, Solutions, Limitations

Ähnliche Präsentationen

Präsentation zum Thema: "Problems, Solutions, Limitations"— Präsentation transkript:

Ähnliche Präsentationen

Über Projekt

Feedback

Anmelden

Anmeldung über soziales Netzwerk:

Problems, Solutions, Limitations

Ähnliche Präsentationen

Präsentation zum Thema: "Problems, Solutions, Limitations"— Präsentation transkript:

Ähnliche Präsentationen

Über Projekt

Feedback