Building Resilience into Real-time Systems

Resilient computing systems can withstand powerful threats and recover from attacks quickly. The question that many are asking themselves today is how to apply this level of resilience to real-time computing scenarios, or other equally extreme use cases.

It can be costly in lives and livelihoods when aeroplanes, self-driving cars, energy networks, financial services, and other critical information infrastructures fail due to malicious attacks or technological error. That is why Professor Marcus Völp is leading a team of “professionally paranoid” researchers, investigating resilience in “extreme computing” scenarios. His team, the Critical and Extreme Security and Dependability Research Group (CritiX), at the University of Luxembourg’s Interdisciplinary Centre for Security, Reliability and Trust (SnT) are looking for a novel approach to ensure very low computing failure rates even in the most dynamic and hostile environments. “We are working to construct systems that can operate safely even when under multiple different kinds of attack, even to the point of semi-autonomously repairing themselves,” explained Prof Völp, who is following on from the pioneering work of the CritiX’s group’s founder Paulo Esteves-Veríssimo.

We are working to construct systems that can operate safely even when under multiple different kinds of attack

Marcus Völp, professor and head of SnT’s CritiX research group – Critical and Extreme Security and Dependability Research Group, SnT

“More precisely, we are designing resilient computing systems that can tolerate faults long enough to ensure continuous minimum functionality, while giving time for the full system to be restored,” he explained. “Our response is so fast in shooting down an attack and then bringing the system up again, the hacker can't gain a foothold for long enough to sustain their efforts.”

Resilient modular and distributed computing is the model the team are working on. “This means that even if individual computers fail, taken as a whole, the system will continue to work,” he said. An analogy is the way that cloud computing draws its processing power from servers located around the globe. An outage in one server, one data centre, or even one country can be compensated for by operations in different parts of the world.

This is an example of a large modular and distributed system, but these can also be designed at a small scale. “One of our recent projects was transforming a distributed systems algorithm in a system with multiple cores, and each core is an individual computer unit,” Prof Völp explained. Technology at this scale can be deployed in aircraft and autonomous vehicles.

For example, in a project being run in partnership with colleagues in Germany and Singapore, they are looking at what can happen following an IT outage in a driverless car. This could lead to a failure to recognise the road situation, leading to the system issuing potentially dangerous commands.

“We triplicated the components in this driving stack, and if two of them suggest following the correct path, and the other one tries to steer the car into on-coming traffic, the latter will be ignored,” he said. Not only that, a repair process is triggered frequently, at the latest when the malfunction is detected.

“We are creating building blocks and techniques which can be applied to all kinds of applications, both large and small,” Prof Völp said. Having multiple replicated units that communicate and provide support in the case of full or partial failure is a key technique. However, these replicas must function in diverse ways so that, for example, a malicious attack team would find it difficult to take down each unit simultaneously.

These systems with multiple redundancies allow individual units to be more easily repaired, restored and new protection layers applied, but this than makes it problematic to deploy them in certain scenarios.

“A little bit challenging and also fun,” is how Prof Völp describes ensuring that the interactions within these complex systems happen in real time. This also requires making speedy readjustments in response to persistent threats.

Rethinking Systems Design with SnT’s CritiX

“This amounts to a paradigm shift, moving to a comprehensive approach to facing extreme challenges,” he said. In his CritiX research group they revisit even first principles, as well as reviewing the architecture and design of systems, while developing appropriate middleware, algorithms and protocols. This enables protection to be provided automatically and incrementally, as adaptations are made to threats that differ in terms of scale, severity, and persistence. Some of these attacks may even be completely novel.

Our ambition is to become known for our excellence in research applicable to systems that face difficult or extreme situations

Marcus Völp, professor and head of SnT’s CritiX research group – Critical and Extreme Security and Dependability Research Group, SnT

“Our ambition is to become known for our excellence in research applicable to systems that face difficult or extreme situations, be they environmental, operational, and so forth, and putting our work to practical use” said Prof Völp. “We operate in scenarios when computer science and engineering are pushed to the functional extremes, and to fuel our research we need to partner with Luxembourgish industry, such as the finance and space sectors, to learn about the real-world challenges they face in building resilient systems.”

Interested in partnering with the CritiX team or another SnT Research Group? Contact the SnT Partnership Programme at