000 | 00000nam c2200205 c 4500 | |
001 | 000045828232 | |
005 | 20230718170752 | |
007 | ta | |
008 | 150107s2015 ulkad bmAC 000c eng | |
040 | ▼a 211009 ▼c 211009 ▼d 211009 | |
085 | 0 | ▼a 0510 ▼2 KDCP |
090 | ▼a 0510 ▼b 6YD36 ▼c 288 | |
100 | 1 | ▼a 김영필 |
245 | 1 0 | ▼a Reducing system failures for robust cloud computing / ▼d Youngpil Kim |
260 | ▼a Seoul : ▼b Graduate School, Korea University, ▼c 2015 | |
300 | ▼a xiv, 96장 : ▼b 삽화, 도표 ; ▼c 26 cm | |
500 | ▼a 지도교수: 유혁 | |
500 | ▼a 부록: 1. Proof of lemma in jobtracker model, 2. Proof of theorem in jobtracker model, 3. MTTF and downtime values of LANL clusters | |
502 | 1 | ▼a 학위논문(박사)-- ▼b 고려대학교 대학원, ▼c 컴퓨터학과, ▼d 2015. 2 |
504 | ▼a 참고문헌: 장 86-96 | |
530 | ▼a PDF 파일로도 이용가능; ▼c Requires PDF file reader(application/pdf) | |
653 | ▼a System software robustness ▼a System failure prediction ▼a Cloud computing ▼a Operating system ▼a Solid state drives | |
776 | 0 | ▼t Reducing system failures for robust cloud computing ▼w (DCOLL211009)000000056933 |
900 | 1 0 | ▼a Kim, Young-pil, ▼e 저 |
900 | 1 0 | ▼a 유혁, ▼g 柳爀, ▼d 1960-, ▼e 지도교수 ▼0 AUTH(211009)153486 |
900 | 1 0 | ▼a Yoo, Hyuck, ▼e 지도교수 |
945 | ▼a KLPA |
소장정보
No. | 소장처 | 청구기호 | 등록번호 | 도서상태 | 반납예정일 | 예약 | 서비스 |
---|---|---|---|---|---|---|---|
No. 1 | 소장처 과학도서관/학위논문서고/ | 청구기호 0510 6YD36 288 | 등록번호 123051213 | 도서상태 대출가능 | 반납예정일 | 예약 | 서비스 |
컨텐츠정보
초록
In this dissertation, we study how to improve the robustness of cloud computing. Specifically, we investigate three problems: 1) how to improve SSD lifetime, 2) how to reduce the failure of device drivers 3) how to enhance the durability of Hadoop. First, we analyze the storage failures. Especially, we focus on Solid state drives (SSDs) which are based on NAND flash memory, and have high performance with energy efficiency. Recent studies for speeding up cloud server and applications has suggested that flash memory can enhance the performance of key-value store by using block-level cache or data storage. However, SSDs are not reliable due to bit errors. To overcome it, periodic remapping has been suggested, but it has a problem of additional lifetime loss. To mitigate the problem, we propose conditional remapping invocation method (CRIM). CRIM uses probability-based threshold to determine the condition of invoking remapping operation. Second, we analyze the prediction quality for device drivers failures. The device driver failure is a critical problem that degrades the reliability of operating system kernel. Failure prediction can be a solution for reducing device driver failures because we can avoid failures and the penalties if the prediction methods work well. The first step of device driver failure prediction is to find appropriate prediction method. So, we review previous prediction methods, and design a new device driver failure prediction method, and we evaluate our method by comparing with other prediction methods in terms of the failure prediction quality. Third, we analyze the failure penalty in cloud job processing. Specifically, we focus on the performance impact of JobTracker failure in Hadoop. A JobTracker failure is a serious problem which affects the overall job processing performance. We describe the cause of failure and the system behaviors due to failed job processing in the Hadoop. Based on the analysis, we build a job completion time model that reflects failure effects. Our model is based on a stochastic process with a node crash probability. With our model, we run simulation of performance impact with very credible failure data available from USENIX called computer failure data repository (USENIX CFDR) that have been collected for past 9 years.
목차
Table of Contents v List of Tables vi List of Figures vii Abstract viii Acknowledgements x 1 Introduction 1 1.1 Research goals and scope . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 Disk block failures in solid state drives . . . . . . . . . . . . . 5 1.2.2 Device driver failures in operating system kernel . . . . . . . . 6 1.2.3 Node failures in cloud infrastructure . . . . . . . . . . . . . . 7 1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Background 9 2.1 Concept of cloud computing . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.2 Service model . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.3 Deployment model . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.4 Reference model . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Flash memory and cloud server storage . . . . . . . . . . . . . . . . . 13 2.2.1 Flash memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Cloud server storage . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Block failures handling in SSDs . . . . . . . . . . . . . . . . . . . . . 16 2.3.1 Bit errors and the remapping method . . . . . . . . . . . . . . 16 2.3.2 Lifetime evaluation in SSDs . . . . . . . . . . . . . . . . . . . 17 2.4 Device driver failures . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.1 Related work in device driver failures . . . . . . . . . . . . . . 18 2.4.2 Device driver failure handling mechanisms . . . . . . . . . . . 20 2.5 Cloud infrastructure and Hadoop . . . . . . . . . . . . . . . . . . . . 21 2.5.1 Related work in cloud node failure . . . . . . . . . . . . . . . 21 2.5.2 Hadoop overview . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5.3 Failure handling in Hadoop job processing . . . . . . . . . . . 25 2.6 System reliability, robustness and and failure prediction . . . . . . . . 27 2.6.1 System reliability . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.6.2 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.6.3 Failure prediction techniques . . . . . . . . . . . . . . . . . . . 28 3 Conditional remapping invocation method 32 3.1 Bit error model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Conditional remapping invocation method . . . . . . . . . . . . . . . 35 3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4 Applying Bayesian to Runtime Failure Prediction 41 4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1.2 Device driver failure model . . . . . . . . . . . . . . . . . . . . 42 4.2 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2.1 Failure prediction problem . . . . . . . . . . . . . . . . . . . . 45 4.2.2 Probabilistic representation of the prediction problem . . . . . 46 4.3 Suggested device driver failure predictor . . . . . . . . . . . . . . . . 48 4.3.1 Na¨ıve Bayes classifier for failure prediction . . . . . . . . . . . 48 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.4.1 Metrics for predictor quality . . . . . . . . . . . . . . . . . . . 49 4.4.2 Experimental details . . . . . . . . . . . . . . . . . . . . . . . 52 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5 Performance impact of JobTracker failure in Hadoop 55 5.1 Stochastic model of the failure of Hadoop . . . . . . . . . . . . . . . . 56 5.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.1.2 Expected job completion time . . . . . . . . . . . . . . . . . . 57 5.1.3 Node crash probability model . . . . . . . . . . . . . . . . . . 58 5.2 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3 Model validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4 Reducing System Failure Approach . . . . . . . . . . . . . . . . . . . 73 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6 Concluding remark 75 Appendices 78 A Appendix 79 A.1 Proof of Lemma in Jobtracker model . . . . . . . . . . . . . . . . . . 79 A.2 Proof of Theorem in Jobtracker model . . . . . . . . . . . . . . . . . 81 A.3 MTTF and downtime values of LANL clusters . . . . . . . . . . . . . 83 Bibliography 86