This WG will explore approaches of continuous execution in the presence of failures. The partners will work on innovative techniques to deal with hardware and system software failures or intentional changes within the complex system environment: resilient, reactive schedulers that can survive errors at the node and/or the cluster-level, cluster-level monitoring and assessment of failures with pro-active actions to remedy failures before they actually occur, and malleable applications that can adapt their resource usage at run-time.
Key objectives: monitoring and assessment of failures in Ultra-large-scale systems; Going beyond fail-stop errors to manage hard, transient, and failures in the SW stack; fault-tolerance at the ultrascale system level, devising integrated design approaches to get continuous service in the presence of continuous streams of errors; and understanding HW & SW dependencies and monitoring changes and their impact within complex systems.
Topics: fault tolerance in partitioned global address space (e.g. PGAS, MPI, hybrid) and federated cooperative environments; proactive actions based on efficient and reliable fault-prediction mechanisms. Exposing errors to the whole system and collective decision making process; dynamic replication of data and/or behaviour; supporting resilience at the application-level: developing malleable applications; support resilience at the infrastructure level: developing resilient schedulers at the node (shared-memory) or cluster (across nodes) level; automatic HW&SW context and dependency detection; and methods for verifying (qualitative) process performance against predefined requirements in dynamic environments.
|Resilience within Ultrascale Computing System: Challenges and Opportunities from Nesus Project.<||Download|