The authors intend to develop an FPGA-based processor architecture that has the capacity to handle errors and defects during runtime. The particular emphasis is on single-event upset errors that cause a change in the state of memory cells or bits when they are exposed to radiation. The redundancy-based fault detection techniques that are in vogue can only identify the error but have limited capacity in terms of runtime error correction. The authors examine the use of dynamic partial reconfiguration capability of FPGAs, which helps perform runtime-based memory scrubbing and reconfiguration.
The processor is divided into different reconfigurable modules. Some space in each partition is left unused for the purpose of error detection. Each module is not only duplicated, but a few additional resources are also attached to the duplicated module. The duplicate modules are initially not placed in the reconfigurable area. When one of the modules in the processor’s core becomes defective, a controller detects the error. The duplicate module is shifted to the reconfigurable area where the additional resources that were attached to it enable the configuration process. In other words, the next configuration of the module is deployed to replace the existing faulty configuration. This allows System-on-Chip error detection as well as correction. The error detection techniques employed in the hardware include Logic-Level and Architectural Techniques. These are redundancy techniques used in combinational modules.
These enable the detection of errors by checking the state of memory cells in each module. As soon as an error is detected, the reconfiguration is implemented by the Reconfiguration Controller (RC). The RC identifies the faulty module through comparison with the Embedded System Register (ESR). The ESR contains the original values of the module. A mismatch between the defected module and ESR values shows that an error has occurred. If the fault is transient, it is eliminated through partial scrubbing and reconfiguration. However, it is pertinent to mention that this approach has two distinct drawbacks. First, the processor incurs a loss of memory and processing space as the defective area of the FPGA is completely discarded. Second, system performance is adversely affected due to the increase in the ‘interconnection overhead’ in the overall architecture. Nonetheless, the approach allows the runtime repair of any number of transient errors as well as the correction of a permanent fault in a duplicate module.
One of the most critical lessons learnt from this research is that FPGA allows for runtime error correction. Thus, considerable time and resources are saved through the use of FPGA. By dividing the module into two parts, preconfigured modules can be utilized to resolve unlimited errors in runtime. However, the limitation is that only one permanent error can be resolved per module. Another aspect that we must keep in mind is the loss of memory and performance during the replacement of a faulty module. Another hallmark of this process is that it does not require checkpointing and rollback mechanisms. These mechanisms help recover processes in case of an error and continue the process as it would have without the error.