Machine Check and Recovery on Prodigy FPGA
Tachyum added Machine Check and Recovery (MCR) capabilities with the Linux Error Detection and Correction (EDAC) subsystem to the Prodigy® Universal Processor with successful deployment demonstrated as part of the FPGA emulation system.
MCR with Linux EDAC driver is essential for data center applications, with the pair working together to provide critical information to predict and mitigate failures in the field. By detecting and seamlessly correcting errors caused by external events in the CPU’s internal memory blocks and attached DDR modules, Prodigy can run prolonged workflows without interruption to maintain and improve uptime of systems deployed at scale. When the degree of Static Random-Access Memory (SRAM) damage is beyond repair, the error detection allows affected computations to be abandoned rather than provide incorrect results.
Error injection is an essential part of testing. Prodigy contains an error-injection module that can inject both correctable and uncorrectable errors into relevant CPU blocks and either a limited number or continuous stream of errors with programmable intervals to ensure the Prodigy architecture meets and exceeds data center requirements. Prodigy provides Double Error Correction and Triple Error Detection (DECTED), which is a key feature to improving uptime, and is complemented by EDAC to enable preventative maintenance.
[ad_2]
source