CONTENTS
Preface = xiii
CHAPTER 1 Introduction = 1
Introduction = 1
1.1 What is parallel Computing? = 2
1.2 The Scope of Parallel Computing = 3
1.3 Issues in Parallel Computing = 4
1.4 Organization and Contents of the Text = 5
1.5 Bibliographic Remarks = 8
Problems = 9
References = 10
CHAPTER 2 Models of Parallel Computers = 15
2.1 A Taxonomy of Parallel Architectures = 16
2.1.1 Control Mechanism = 16
2.1.2 Address-Space Organization = 18
2.1.3 Interconnection Networks = 22
2.1.4 Processor Granularity = 22
2.2 An Idealized Parallel Computer = 23
2.3 Dynamic Interconnection Networks = 24
2.3.1. Crossbar Switching Networks = 24
2.3.2 Bus-Based Networks = 25
2.3.3 Multistage Interconnection Networks = 26
2.4 Static Interconnection Networks = 30
2.4.1 Types of Static Interconnection Networks = 30
2.4.2 Evaluating Static Interconnection Networks = 37
2.5 Embedding Other Networks into a Hypercube = 39
2.5.1 Embedding a Linear Array into a Hypercube = 39
2.5.2 Embedding a Mesh into a Hypercube = 41
2.5.3 Embedding a Binary Tree into a Hypercube = 42
2.6 Routing Mechanisms for Static Networks = 42
2.7 Communication Costs in Static Interconnection Networks = 45
2.7.1 Store-and-Forward Routing = 45
2.7.2 Cut-Through Routing = 46
2.8 Cost-Performance Tradeoffs = 48
2.9 Architectural Models for Parallel Algorithm Design = 49
2.10 Bibliographic Remarks = 51
Problems = 54
References = 60
CHAPTER 3 Basic Communication Operations = 65
3.1 Simple Message Transfer between Two Processors = 66
3.2 One-to-All Broadcast = 66
3.2.1 Store-and-Forward Routing = 67
3.2.2 cut-Through Routing = 75
3.3. All-to-all Broadcast, Reduction, and Prefix Sums = 77
3.3.1 Store-and-Forward Routing = 78
3.3.2 Cut-Through Routing = 86
3.4 One-to-All Personalized Communication = 88
3.5 All-to-All Personalized Communication = 90
3.5.1 Store-and-Forward Routing = 90
3.5.2 Cut-Through Routing = 95
3.6 Circular Shift = 98
3.6.1 Store-and-Forward Routing = 98
3.6.2 Cut-Through Routing = 101
3.7 Faster Methods for Some Communication Operations = 101
3.7.1 Routing Messages in Parts = 102
3.7.2 All-Port Communication = 104
3.7.3 Special Hardware for Global Operations = 105
3.8 Summary = 106
3.9 Bibliographic Remarks = 107
Problems = 108
References = 114
CHAPTER 4 Performance and Scalability of Parallel Systems = 117
4.1 Performance Metrics for Parallel Systems = 117
4.1.1 Run Time = 117
4.1.2 Speedup = 118
4.1.3 Efficiency = 120
4.1.4 Cost = 120
4.2 The Effect of Granularity and Data Mapping on Performance = 121
4.3 The Scalability of Parallel Systems = 126
4.4 The Isoefficiency Metric of Scalability = 128
4.4.1 Problem Size = 129
4.4.2 The Overhead Function = 129
4.4.3 The Isoefficiency Function = 130
4.4.4 Cost-Optimality and the Isoefficiency Function = 133
4.4.5 A Lower Bound on the Isoefficiency Function = 134
4.4.6 The Degree of Concurrency and the Isoefficiency Function = 134
4.5 Sources of Parallel Overhead = 135
4.5.1 Interprocessor Communication = 135
4.5.2 Load Imbalance = 135
4.5.3 Extra Computation = 136
4.6 Minimum Execution Time and Minimum Cost-Optional Execution Time = 136
4.7 Other Scalability Metrics and Bibliographic Remarks = 139
Problems = 141
References = 146
CHAPTER 5 Dense Matrix Algorithms = 151
5.1 Mapping Matrices onto Processors = 151
5.1.1 Striped Partitioning = 151
5.1.2 Checkerboard Partitioning = 152
5.2 Matrix Transposition = 153
5.2.1 Checkerboard Partitioning = 154
5.2.2 Striped Partitioning = 159
5.3 Matrix-Vector Multiplication = 160
5.3.1 Rowwise Striping = 160
5.3.2 Checkerboard Partitioning = 163
5.4. Matrix Multiplication = 168
5.4.1 A Simple Parallel Algorithm = 169
5.4.2 Cannon's Algorithm = 171
5.4.3 Fox's Algorithm = 173
5.4.4 The DNS Algorithm = 174
5.5 Solving a System of Linear Equations = 178
5.5.1 A Simple Gaussian Elimination Algorithm = 179
5.5.2 Gaussian Elimination with Partial Pivoting = 192
5.5.3 Solving a Triangular System :Back-Substitution = 195
5.5.4 Numerical Considerations in Solving Systems of Linear Equations = 196
5.6 Bibliographic Remarks = 197
Problems = 198
References = 204
CHAPTER 6 Sorting = 209
6.1 Issues in Sorting on Parallel Computers = 210
6.1.1 Where the Input and Output Sequences Are Stored = 210
6.1.2 How Comparisons Are Performed = 210
6.2 Sorting Networks = 212
6.2.1 Bitonic Sort = 214
6.2.2 Mapping bitonic Sort onto a Hypercube and a Mesh = 216
6.3 Bubble Sort and Its Variants = 224
6.3.1 Odd-Even Transposition = 225
6.3.2 Shellsort = 227
6.4 Quicksort = 229
6.4.1 Parallelizing Quicksort = 231
6.4.2 A Comparison of Quicksort Formulations = 243
6.5 Other Sorting Algorithms = 243
6.5.1 Enumeration Sort = 243
6.5.2 Bucket Sort = 244
6.5.3 Sample Sort = 245
6.5.4 Radix Sort = 247
6.6 Bibliographic Remarks = 247
Problems = 250
References = 254
CHAPTER 7 Graph Algorithms = 257
7.1 Definitions and Representation = 257
7.2 Minimum Spanning tree :Prim's Algorithm = 260
7.3 Single-Source Shortest Paths :Dijkstra's Algorithm = 265
7.4 All-Pairs Shortest Paths = 266
7.4.1 Matrix-Multiplication Based Algorithm = 266
7.4.2 Dijkstra's Algorithm = 268
7.4.3 floyd's algorithm = 271
7.4.4 Performance Comparisons = 276
7.5 Transitive Closure = 276
7.6 Connected Components = 278
7.6.1 A Depth-First Search Based Algorithm = 278
7.7 Algorithms for Sparse Graphs = 281
7.7.1 Single-Source Shortest Paths = 284
7.8 Bibliographic Remarks = 290
Problems = 293
References = 295
CHAPTER 8 Search Algorithms for Discrete Optimization Problems = 299
8.1 Definitions and Examples = 299
8.2 Sequential Search Algorithms = 304
8.2.1 Depth-First Search Algorithms = 304
8.2.2 Best-First Search Algorithms = 308
8.3 Search Overhead Factor = 308
8.4 Parallel Depth-First Search = 310
8.4.1 Important Parameters of Parallel DFS = 313
8.4.2 A General Framework for Analysis of Parallel DFS = 315
8.4.3 Analysis of Load-Balancing Schemes for Hypercubes = 318
8.4.4 Analysis of Load-Balancing Schemes for a Network of Workstations = 320
8.4.5 Termination Detection = 321
8.4.6 Experimental Results = 325
8.4.7 Parallel Formulations of Depth-First Branch-and-Bound Search = 328
8.4.8 Parallel Formulations of IDA* = 328
8.4.9 Parallel DFS on SIMD Computers = 329
8.5 Parallel Best-First Search = 332
8.6 Speedup Anomalies in Parallel Search Algorithms = 336
8.6.1 Analysis of Average Speedup in Parallel DFS = 338
8.7 Bibliographic Remarks = 340
Problems = 343
References = 348
CHAPTER 9 Dynamic Programming = 355
9.1 Serial Monadic DP Formulations = 357
9.1.1 The Shortest-Path Problem = 358
9.1.2 The 0/1 Knapsack Problem = 360
9.2 Nonserial Monadic DP Formulations = 362
9.2.1 The Longest-Common-Subsequence Problem = 363
9.3 Serial Polyadic DP Formulations = 365
9.3.1 Floyd's All-Pairs Shortest-Paths Algorithm = 365
9.4 Nonserial Polyadic DP Formulations = 366
9.4.1 The Optimal Matrix-Parenthesization Problem = 366
9.5 Summary and Discussion = 369
9.6 Bibliographic Remarks = 370
Problems = 371
References = 375
CHAPTER 10 Fast Fourier Transform = 377
10.1 The Serial Algorithm = 377
10.2 The Binary-Exchange Algorithm = 382
10.2.1 Hypercube = 383
10.2.2 Mesh = 388
10.2.3 Extra Computations in Parallel FFT = 390
10.3 The Transpose algorithm = 393
10.3.1 Two-Dimensional Transpose Algorithm = 393
10.3.2 The Generalized Transpose Algorithm = 396
10.4 Cost-Effectiveness of Meshes and Hypercubes for FFT = 400
10.5 Bibliographic Remarks = 403
Problems = 404
References = 405
CHAPTER 11 Solving Sparse Systems of Linear Equations = 407
11.1 Basic Operations = 409
11.1.1 Storage Schemes for Sparse Matrices = 409
11.1.2 Vector Inner Product = 412
11.1.3 Sparse Matrix-Vector Multiplication = 413
11.2 Iterative Methods for Sparse Linear Systems = 426
11.2.1 Jacobi Iterative Method = 427
11.2.2 Gauss-Seidel and SOR Methods = 429
11.2.3 The Conjugate Gradient Method = 433
11.3 Finite Element Method = 446
11.4 Direct Methods for Sparse Linear systems = 454
11.4.1 Ordering = 455
11.4.2 Symbolic Factorization = 458
11.4.3 Numerical Factorization = 458
11.4.4 Solving a Triangular system = 468
11.5 Multigrid Methods = 468
11.6 Bibliographic Remarks = 474
Problems = 477
References = 482
CHAPTER 12 Systolic Algorithms and their Mapping onto Parallel Computers = 491
12.1 Examples of Systolic Systems = 493
12.1.1 Convolution = 493
12.1.2 Banded Matrix-Vector Multiplication = 496
12.1.3 Matrix Multiplication = 499
12.1.4 Optimal Matrix Parenthesization = 501
12.2 General Issues in Mapping Systolic Systems onto Parallel Computers = 505
12.2.1 Architectural Differences between Systolic Arrays and Parallel Computers = 505
12.2.2 Absolute Efficiency of Systolic Systems = 506
12.3 Mapping One-Dimensional Systolic Arrays = 507
12.3.1 Virtual Processors = 508
12.3.2 Block-Striped Mapping = 508
12.3.3 Other One-Dimensional Mappings = 511
12.4 Mapping Two-Dimensional Systolic Arrays = 513
12.4.1 Block-Checkerboard Mapping = 514
12.4.2 Cyclic-Checkerboard Mapping = 516
12.4.3 Summary of Two-Dimensional Mappings = 518
12.5 Bibliographic Remarks = 518
Problems = 520
References = 521
CHAPTER13 Parallel Programming = 525
13.1 Parallel Programming Paradigms = 525
13.1.1 Explicit versus Implicit Parallel Programming = 525
13.1.2 Shared-Address-Space versus Message-Passing = 526
13.1.3 Data Parallelism versus Control Parallelism = 527
13.2 Primitives for the Message-Passing Programming Paradigm = 529
13.2.1 Basic Extensions = 529
13.2.2 nCUBE2 = 531
13.2.3 iPSC860 = 532
13.2.4 CM-5 = 533
13.2.5 Example Program = 535
13.3 Data-Parallel Languages = 538
13.3.1 Data Partitioning and Virtual Processors = 539
13.3.2 C* = 540
13.3.3 CM Fortran = 549
13.4 Primitives for the Shared-Address-Space Programming Paradigm = 551
13.4.1 Primitives to Allocate Shared Variables = 552
13.4.2 Primitives for Mutual Exclusion and Synchronization = 552
13.4.3 Primitives for Creating Processes = 553
13.4.4 Sequent Symmetry = 553
13.4.5 Example Programs = 554
13.5 Fortran D = 558
13.5.1 Problem Mapping = 559
13.5.2 Machine Mapping = 561
13.5.3 Example Program = 562
13.6 Bibliographic Remarks = 564
References = 566
APPENDIX A Complexity of Functions and Order Analysis = 571
Author Index = 575
Subject Index = 583