{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2023,9,14]],"date-time":"2023-09-14T00:10:47Z","timestamp":1694650247912},"reference-count":9,"publisher":"World Scientific Pub Co Pte Lt","issue":"02","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Parallel Process. Lett."],"published-print":{"date-parts":[[2011,6]]},"abstract":"<jats:p> An alternative to classical fault-tolerant approaches for large-scale clusters is failure avoidance, by which the occurrence of a fault is predicted and a preventive measure is taken. We develop analytical performance models for two types of preventive measures: preventive checkpointing and preventive migration. We instantiate these models for platform scenarios representative of current and future technology trends. We find that preventive migration is the better approach in the short term by orders of magnitude. However, in the longer term, both approaches have comparable merit with a marginal advantage for preventive checkpointing. We also develop an analytical model of the performance for fault tolerance based on periodic checkpointing and compare this approach to both failure avoidance techniques. We find that this comparison is sensitive to the nature of the stochastic distribution of the time between failures, and that failure avoidance is likely inferior to fault tolerance in the long term. Regardless, our result show that each approach is likely to achieve poor utilization for large-scale platforms (e.g., 2<jats:sup>20<\/jats:sup> nodes) unless the mean time between failures is large. We show how bounding parallel job size improves utilization, but conclude that achieving good utilization in future large-scale platforms will require a combination of techniques. <\/jats:p>","DOI":"10.1142\/s0129626411000126","type":"journal-article","created":{"date-parts":[[2011,6,23]],"date-time":"2011-06-23T11:25:37Z","timestamp":1308828337000},"page":"111-132","source":"Crossref","is-referenced-by-count":16,"title":["PREVENTIVE MIGRATION VS. PREVENTIVE CHECKPOINTING FOR EXTREME SCALE SUPERCOMPUTERS"],"prefix":"10.1142","volume":"21","author":[{"given":"FRANCK","family":"CAPPELLO","sequence":"first","affiliation":[{"name":"INRIA-Illinois Joint Laboratory for Petascale Computing, Urbana-Champain, Illinois, U.S.A."}]},{"given":"HENRI","family":"CASANOVA","sequence":"additional","affiliation":[{"name":"Information and Computer Sciences Dept., University of Hawai'i at Manoa, Honolulu, U.S.A."}]},{"given":"YVES","family":"ROBERT","sequence":"additional","affiliation":[{"name":"Laboratoire de l'Informatique du Parall\u00e9lisme, \u00c9cole Normale Sup\u00e9rieure de Lyon, France"}]}],"member":"219","published-online":{"date-parts":[[2011,11,21]]},"reference":[{"key":"rf1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342009347714"},{"key":"rf4","doi-asserted-by":"publisher","DOI":"10.1177\/1094342009347767"},{"key":"rf5","volume":"78","author":"Schroeder B.","journal-title":"Journal of Physics: Conference Series"},{"key":"rf7","doi-asserted-by":"publisher","DOI":"10.1147\/rd.452.0311"},{"key":"rf12","doi-asserted-by":"publisher","DOI":"10.1145\/511399.511362"},{"key":"rf14","doi-asserted-by":"publisher","DOI":"10.1016\/S0743-7315(03)00108-4"},{"key":"rf18","doi-asserted-by":"publisher","DOI":"10.1145\/361147.361115"},{"key":"rf19","doi-asserted-by":"publisher","DOI":"10.1016\/j.future.2004.11.016"},{"key":"rf20","doi-asserted-by":"publisher","DOI":"10.1016\/S0377-0427(00)00409-X"}],"container-title":["Parallel Processing Letters"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.worldscientific.com\/doi\/pdf\/10.1142\/S0129626411000126","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2019,8,7]],"date-time":"2019-08-07T13:30:43Z","timestamp":1565184643000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.worldscientific.com\/doi\/abs\/10.1142\/S0129626411000126"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2011,6]]},"references-count":9,"journal-issue":{"issue":"02","published-online":{"date-parts":[[2011,11,21]]},"published-print":{"date-parts":[[2011,6]]}},"alternative-id":["10.1142\/S0129626411000126"],"URL":"https:\/\/doi.org\/10.1142\/s0129626411000126","relation":{},"ISSN":["0129-6264","1793-642X"],"issn-type":[{"value":"0129-6264","type":"print"},{"value":"1793-642X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2011,6]]}}}