HOME > 상세정보

상세정보

Reducing system failures for robust cloud computing

Reducing system failures for robust cloud computing

자료유형
학위논문
개인저자
김영필
서명 / 저자사항
Reducing system failures for robust cloud computing / Youngpil Kim
발행사항
Seoul :   Graduate School, Korea University,   2015  
형태사항
xiv, 96장 : 삽화, 도표 ; 26 cm
기타형태 저록
Reducing system failures for robust cloud computing   (DCOLL211009)000000056933  
학위논문주기
학위논문(박사)-- 고려대학교 대학원, 컴퓨터학과, 2015. 2
학과코드
0510   6YD36   288  
일반주기
지도교수: 유혁  
부록: 1. Proof of lemma in jobtracker model, 2. Proof of theorem in jobtracker model, 3. MTTF and downtime values of LANL clusters  
서지주기
참고문헌: 장 86-96
이용가능한 다른형태자료
PDF 파일로도 이용가능;   Requires PDF file reader(application/pdf)  
비통제주제어
System software robustness, System failure prediction, Cloud computing, Operating system, Solid state drives,,
000 00000nam c2200205 c 4500
001 000045828232
005 20230718170752
007 ta
008 150107s2015 ulkad bmAC 000c eng
040 ▼a 211009 ▼c 211009 ▼d 211009
085 0 ▼a 0510 ▼2 KDCP
090 ▼a 0510 ▼b 6YD36 ▼c 288
100 1 ▼a 김영필
245 1 0 ▼a Reducing system failures for robust cloud computing / ▼d Youngpil Kim
260 ▼a Seoul : ▼b Graduate School, Korea University, ▼c 2015
300 ▼a xiv, 96장 : ▼b 삽화, 도표 ; ▼c 26 cm
500 ▼a 지도교수: 유혁
500 ▼a 부록: 1. Proof of lemma in jobtracker model, 2. Proof of theorem in jobtracker model, 3. MTTF and downtime values of LANL clusters
502 1 ▼a 학위논문(박사)-- ▼b 고려대학교 대학원, ▼c 컴퓨터학과, ▼d 2015. 2
504 ▼a 참고문헌: 장 86-96
530 ▼a PDF 파일로도 이용가능; ▼c Requires PDF file reader(application/pdf)
653 ▼a System software robustness ▼a System failure prediction ▼a Cloud computing ▼a Operating system ▼a Solid state drives
776 0 ▼t Reducing system failures for robust cloud computing ▼w (DCOLL211009)000000056933
900 1 0 ▼a Kim, Young-pil, ▼e
900 1 0 ▼a 유혁, ▼g 柳爀, ▼d 1960-, ▼e 지도교수 ▼0 AUTH(211009)153486
900 1 0 ▼a Yoo, Hyuck, ▼e 지도교수
945 ▼a KLPA

전자정보

No. 원문명 서비스
1
Reducing system failures for robust cloud computing (46회 열람)
PDF 초록 목차

소장정보

No. 소장처 청구기호 등록번호 도서상태 반납예정일 예약 서비스
No. 1 소장처 과학도서관/학위논문서고/ 청구기호 0510 6YD36 288 등록번호 123051213 도서상태 대출가능 반납예정일 예약 서비스 B M

컨텐츠정보

초록

In this dissertation, we study how to improve the robustness of cloud computing.
Specifically, we investigate three problems: 1) how to improve SSD lifetime, 2) how to reduce the failure of device drivers 3) how to enhance the durability of Hadoop.
First, we analyze the storage failures. Especially, we focus on Solid state drives (SSDs) which are based on NAND flash memory, and have high performance with energy efficiency. Recent studies for speeding up cloud server and applications has suggested that flash memory can enhance the performance of key-value store by using block-level cache or data storage. However, SSDs are not reliable due to bit errors.
To overcome it, periodic remapping has been suggested, but it has a problem of additional lifetime loss. To mitigate the problem, we propose conditional remapping invocation method (CRIM). CRIM uses probability-based threshold to determine the condition of invoking remapping operation. 
Second, we analyze the prediction quality for device drivers failures. The device driver failure is a critical problem that degrades the reliability of operating system kernel. Failure prediction can be a solution for reducing device driver failures because we can avoid failures and the penalties if the prediction methods work well. The first step of device driver failure prediction is to find appropriate prediction method. So, we review previous prediction methods, and design a new device driver failure prediction method, and we evaluate our method by comparing with other prediction methods in terms of the failure prediction quality.
Third, we analyze the failure penalty in cloud job processing. Specifically, we focus on the performance impact of JobTracker failure in Hadoop. A JobTracker failure is a serious problem which affects the overall job processing performance. We describe the cause of failure and the system behaviors due to failed job processing in the Hadoop. Based on the analysis, we build a job completion time model that reflects failure effects. Our model is based on a stochastic process with a node crash probability. With our model, we run simulation of performance impact with very credible failure data available from USENIX called computer failure data repository (USENIX CFDR) that have been collected for past 9 years.

목차

Table of Contents v
List of Tables vi
List of Figures vii
Abstract viii
Acknowledgements x
1 Introduction 1
1.1 Research goals and scope . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Disk block failures in solid state drives . . . . . . . . . . . . . 5
1.2.2 Device driver failures in operating system kernel . . . . . . . . 6
1.2.3 Node failures in cloud infrastructure . . . . . . . . . . . . . . 7
1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Background 9
2.1 Concept of cloud computing . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Service model . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Deployment model . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.4 Reference model . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Flash memory and cloud server storage . . . . . . . . . . . . . . . . . 13
2.2.1 Flash memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Cloud server storage . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Block failures handling in SSDs . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Bit errors and the remapping method . . . . . . . . . . . . . . 16
2.3.2 Lifetime evaluation in SSDs . . . . . . . . . . . . . . . . . . . 17
2.4 Device driver failures . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1 Related work in device driver failures . . . . . . . . . . . . . . 18
2.4.2 Device driver failure handling mechanisms . . . . . . . . . . . 20
2.5 Cloud infrastructure and Hadoop . . . . . . . . . . . . . . . . . . . . 21
2.5.1 Related work in cloud node failure . . . . . . . . . . . . . . . 21
2.5.2 Hadoop overview . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.3 Failure handling in Hadoop job processing . . . . . . . . . . . 25
2.6 System reliability, robustness and and failure prediction . . . . . . . . 27
2.6.1 System reliability . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6.2 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6.3 Failure prediction techniques . . . . . . . . . . . . . . . . . . . 28
3 Conditional remapping invocation method 32
3.1 Bit error model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Conditional remapping invocation method . . . . . . . . . . . . . . . 35
3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4 Applying Bayesian to Runtime Failure Prediction 41
4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.2 Device driver failure model . . . . . . . . . . . . . . . . . . . . 42
4.2 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.1 Failure prediction problem . . . . . . . . . . . . . . . . . . . . 45
4.2.2 Probabilistic representation of the prediction problem . . . . . 46
4.3 Suggested device driver failure predictor . . . . . . . . . . . . . . . . 48
4.3.1 Na¨ıve Bayes classifier for failure prediction . . . . . . . . . . . 48
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4.1 Metrics for predictor quality . . . . . . . . . . . . . . . . . . . 49
4.4.2 Experimental details . . . . . . . . . . . . . . . . . . . . . . . 52
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5 Performance impact of JobTracker failure in Hadoop 55
5.1 Stochastic model of the failure of Hadoop . . . . . . . . . . . . . . . . 56
5.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.1.2 Expected job completion time . . . . . . . . . . . . . . . . . . 57
5.1.3 Node crash probability model . . . . . . . . . . . . . . . . . . 58
5.2 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3 Model validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4 Reducing System Failure Approach . . . . . . . . . . . . . . . . . . . 73
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6 Concluding remark 75
Appendices 78
A Appendix 79
A.1 Proof of Lemma in Jobtracker model . . . . . . . . . . . . . . . . . . 79
A.2 Proof of Theorem in Jobtracker model . . . . . . . . . . . . . . . . . 81
A.3 MTTF and downtime values of LANL clusters . . . . . . . . . . . . . 83
Bibliography 86