Fault-Tolerance Techniques for High-Performance Computing (Record no. 58344)
[ view plain ]
000 -LEADER | |
---|---|
fixed length control field | 03801nam a22005055i 4500 |
001 - CONTROL NUMBER | |
control field | 978-3-319-20943-2 |
005 - DATE AND TIME OF LATEST TRANSACTION | |
control field | 20200421112542.0 |
008 - FIXED-LENGTH DATA ELEMENTS--GENERAL INFORMATION | |
fixed length control field | 150701s2015 gw | s |||| 0|eng d |
020 ## - INTERNATIONAL STANDARD BOOK NUMBER | |
ISBN | 9783319209432 |
-- | 978-3-319-20943-2 |
082 04 - CLASSIFICATION NUMBER | |
Call Number | 004.24 |
245 10 - TITLE STATEMENT | |
Title | Fault-Tolerance Techniques for High-Performance Computing |
300 ## - PHYSICAL DESCRIPTION | |
Number of Pages | IX, 320 p. 113 illus. |
490 1# - SERIES STATEMENT | |
Series statement | Computer Communications and Networks, |
505 0# - FORMATTED CONTENTS NOTE | |
Remark 2 | Part I: General Overview -- Fault-Tolerance Techniques for High-Performance Computing -- Part II: Technical Contributions -- Errors and Faults -- Fault-Tolerant MPI -- Using Replication for Resilience on Exascale Systems -- Energy-Aware Check pointing Strategies. |
520 ## - SUMMARY, ETC. | |
Summary, etc | This timely text/reference presents a comprehensive overview of fault tolerance techniques for high-performance computing (HPC). The text opens with a detailed introduction to the concepts of checkpoint protocols and scheduling algorithms, prediction, replication, silent error detection and correction, together with some application-specific techniques such as algorithm-based fault tolerance. Emphasis is placed on analytical performance models. This is then followed by a review of general-purpose techniques, including several checkpoint and rollback recovery protocols. Relevant execution scenarios are also evaluated and compared through quantitative models. Topics and features: Includes self-contained contributions from an international selection of preeminent experts Provides a survey of resilience methods and performance models Examines the various sources for errors and faults in large-scale systems, detailing their characteristics, with a focus on modeling, detection and prediction Reviews the spectrum of techniques that can be applied to design a fault-tolerant message passing interface Investigates different approaches to replication, comparing these to the traditional checkpoint-recovery approach Discusses the challenge of energy consumption of fault-tolerance methods in extreme-scale systems, proposing a methodology to estimate such energy consumption This authoritative volume is essential reading for all researchers and graduate students involved in high-performance computing. Dr. Thomas Herault is a Research Scientist in the Innovative Computing Laboratory (ICL) at the University of Tennessee Knoxville, TN, USA. Dr. Yves Robert is a Professor in the Laboratory of Parallel Computing at the Ecole Normale Sup�erieure de Lyon, France, and a Visiting Research Scholar in the ICL. |
650 #0 - SUBJECT ADDED ENTRY--SUBJECT 1 | |
General subdivision | Reusability. |
700 1# - AUTHOR 2 | |
Author 2 | Herault, Thomas. |
700 1# - AUTHOR 2 | |
Author 2 | Robert, Yves. |
856 40 - ELECTRONIC LOCATION AND ACCESS | |
Uniform Resource Identifier | http://dx.doi.org/10.1007/978-3-319-20943-2 |
942 ## - ADDED ENTRY ELEMENTS (KOHA) | |
Koha item type | eBooks |
264 #1 - | |
-- | Cham : |
-- | Springer International Publishing : |
-- | Imprint: Springer, |
-- | 2015. |
336 ## - | |
-- | text |
-- | txt |
-- | rdacontent |
337 ## - | |
-- | computer |
-- | c |
-- | rdamedia |
338 ## - | |
-- | online resource |
-- | cr |
-- | rdacarrier |
347 ## - | |
-- | text file |
-- | |
-- | rda |
650 #0 - SUBJECT ADDED ENTRY--SUBJECT 1 | |
-- | Computer science. |
650 #0 - SUBJECT ADDED ENTRY--SUBJECT 1 | |
-- | Computer software |
650 #0 - SUBJECT ADDED ENTRY--SUBJECT 1 | |
-- | Computer system failures. |
650 #0 - SUBJECT ADDED ENTRY--SUBJECT 1 | |
-- | Numerical analysis. |
650 14 - SUBJECT ADDED ENTRY--SUBJECT 1 | |
-- | Computer Science. |
650 24 - SUBJECT ADDED ENTRY--SUBJECT 1 | |
-- | System Performance and Evaluation. |
650 24 - SUBJECT ADDED ENTRY--SUBJECT 1 | |
-- | Performance and Reliability. |
650 24 - SUBJECT ADDED ENTRY--SUBJECT 1 | |
-- | Numeric Computing. |
830 #0 - SERIES ADDED ENTRY--UNIFORM TITLE | |
-- | 1617-7975 |
912 ## - | |
-- | ZDB-2-SCS |
No items available.