CMSC714 High Performance Computing (Fall 2022)

Readings - CMSC714: High Performance Computing

Note: for each class (after the intro material), 3 students will be responsible for emailing me (als@cs.umd.edu) with ~4 discussion question on the reading(s) for that day by 6PM the day before the class, and be prepared to ask those questions and help explain the paper to the rest of the class.

Introduction - What and Why?

8/30 Parallel Computing and Parallel Computers

from Lecture Notes

9/1 Applications of Parallel Computing

from Lecture Notes

Programming Models

9/6 Expressing Parallelism (Explicit Control - Message Passing and MPI)

J. J. Dongarra, S. W. Otto, M. Snir, and D. Walker, "A message passing standard for MPP and workstations," Communications of the ACM, 39(7), 1996, pp. 84-90. [PDF]
R. Thakur and W. Gropp, "Open Issues in MPI Implementation," In: L. Choi, Y. Paek, S. Cho (eds), Advances in Computer Systems Architecture. ACSAC 2007, 2007. [PDF]

9/8 Advanced MPI

R. Thakur, R. Rabenseifner, and W. Gropp, "Optimization of Collective Communication Operations in MPICH," The International Journal of High Performance Computing Applications, 19(1), 2005. [PDF]
T. Saif and M. Parashar, "Understanding the Behavior and Performance of Non-blocking Communications in MPI," In: Danelutto, M., Vanneschi, M., Laforenza, D. (eds), Euro-Par 2004 Parallel Processing Conference, 2004. [PDF]

9/13 Expressing Parallelism (Implicit Control) - M. Arace, P. Arias Juarez, J. Davis

William W. Carlson et. al, "Introduction to UPC and Language Specification," CCS-TR-99-157. [PDF]
Brent Leback, Michael Wolfe, and Douglas Miles "The PGI Fortran and C99 OpenACC Compilers", In Proceedings of Cray User Group (CUG) meeting, 2012. [PDF]

9/15 Expressing Parallelism (Implicit Control, cont.) - C. Evuru, N. Froehlinger, A. Jayaweera

L. Dagum and R. Menon, "OpenMP: An Industry-Standard API for Shared-Memory Programming," IEEE Computational Science & Engineering, 5(1), 1998. [PDF]
B.R. de Supinski et. al, " The Ongoing Evolution of OpenMP ", Proceedings of the IEEE, 106(11), 2018. [PDF]

9/20 Expressing Parallelism (Hybrids and Frameworks) - M. Johnson, J. Kitson, N. Koutsoheras

Steve W. Bova et. al, "Parallel Programming with Message Passing and Directives", Computing in Science & Engineering, 3(5), 2001. [PDF]
S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith, "Efficient Management of Parallelism in Object Oriented Numerical Software Libraries", In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern Software Tools in Scientific Computing, Birkhäuser Press, 1997. [PDF]

9/22 Profiling Programs - D. Muir, K. Oh, J. Schweitzer

S.L. Graham, P.B. Kessler, and M.K. McKusick, "gprof: a Call Graph Execution Profiler," Proceedings of the SIGPLAN '82 Symposium on Compiler Construction, ACM SIGPLAN Notices, Vol. 17, No. 6, 1982. [PDF]
L. Adhianto, S. Banerjee, M. Fagan, M. Krentel,G. Marin, J. Mellor-Crummey,and N.R.Tallent, "HPCTOOLKIT: tools for performance analysis of optimized parallel programs," Concurrency and Computation: Practice and Experience, Vol. 22, 2010. [PDF]

Architectures

9/27 Shared Memory - K. Marepally, Y. Wang, Y. Xu

J. Laudon and D. Lenoski, "The SGI Origin: a ccNUMA highly scalable server," In Proceedings of 1997 International Symposium on Computer Architecture (ISCA '97), May 1997. [PDF]
SGI, "Technical Advances in the SGI® UV Architecture™," SGI White paper, 2012. [PDF]

9/29 Custom Machines - D. Yuan, M. Arace, P. Arias

S.R. Alam, J.A. Kuehn, R.F. Barrett, J.M. Larkin, M.R. Fahey, R. Sankaran, P.H. Worley, "Cray XT4: An Early Evaluation for Petascale Scientific Simulation", In Proceedings of SC'07, Nov. 2007. [PDF]
A. Gara, et. al, "Overview of the Blue Gene/L system architecture", IBM Journal of Research and Development, 49(2/3), Fall 2005. [PDF]

10/4 GPUs - J. Davis, C. Evuru, N. Froehlinger

W.J. Dally, S.W. Keckler, D.B. Kirk, "Evolution of the Graphics Processing Unit (GPU)," IEEE Micro, (41)6, 2021. [PDF]
V.W. Lee et. al, "Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU", In Proceedings of 2010 International Symposium on Computer Architecture (ISCA), May 2010. [PDF]

10/6 High Performance Networks - A. Jayaweera, M. Johnson, J. Kitson

Mellanox Technologies white paper, "Introduction to InfiniBand.", 2003. [PDF]
R. Alverson, D. Roweth, L.Kaplan, "The Gemini System Interconnect," In Proceedings of the 18th Symposium on High Performance Interconnects, Aug. 2010. [PDF]

10/11 Clouds - N. Koutsoheras, K. Marepally, D. Muir

Jeffrey Dean and Sanjay Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters", In Proceedings of OSDI'04, pp. 137-150 [PDF]
Michael Stonebraker, Daniel Abadi, David J. DeWitt, Sam Madden, Erik Paulson, Andrew Pavlo, Alexander Rasin, "MapReduce and Parallel DBMSs: Friends or Foes?", Communications of the ACM, 53(1), Jan. 2010, pp. 64-71. [PDF]

10/13 Clouds, cont. - K. Oh, J. Schweitzer, Y. Wang

M. Zaharia, R. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, and I. Stoica, "Apache Spark: A Unified Engine for Big Data Processing,", Communications of the ACM, 59(11), Nov. 2016. [PDF]
B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A.D. Joseph, R. Katz, S. Shenker and I. Stoica, " Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center", In Proceedings of 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI), USENIX, March 2011. [PDF]

Tools

10/18 Event Ordering and Race Detection - Y. Xu, D. Yuan, M. Arace

L. Lamport, "Time, Clocks, and the Ordering of Events in a Distributed System", Communications of the ACM, 21(7), 1978, pp. 558-564. [PDF]
C. von Praun, "Race Detection Techniques", In D. Padua (eds) Encyclopedia of Parallel Computing, Springer, 2011, pp. 1697-1706. [PDF]

10/20 Data Collection and Instrumentation - P. Arias, J. Davis, C. Evuru

Nicholas Nethercote and Julian Seward, "Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation", In Proceedings of the 2007 ACM/SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2007. [PDF]
B. R. Buck and J.K. Hollingsworth , "An API for Runtime Code Patching," International Journal of High Performance Computing Applications, 14(4), Winter 2000, pp. 317-329. [PDF]

10/25 Runtime Parallelization - N. Froehlinger, A. Jayaweera, M. Johnson

J. Nieplocha et. al, "Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit", International Journal of High Performance Computing Applications, 20(6), 2006. [PDF]
G. Agrawal, A. Sussman, and J. Saltz, "An Integrated Runtime and Compile-time Approach for Parallelizing Structured and Block Structured Applications", IEEE Transactions on Parallel and Distributed Systems, 6(7), 1995. [PDF]

10/27 Autotuning - J. Kitson, N. Koutsoheras, K. Marepally

P. Balaprakash et. al, "Autotuning in High-Performance Computing Applications", Proceedings of the IEEE, 106(11), November 2018. [PDF]
R. Clint Whaley, Antoine Petitet and Jack J. Dongarra, "Automated empirical optimizations of software and the ATLAS project ", Parallel Computing, 27(1), 2001. [PDF]

Systems Issues

11/1 Finding Idle Cycles - D. Muir, K. Oh, J. Schweitzer

M. Litzkow, M. Livny, and M. Mutka, "Condor - A Hunter of Idle Workstations", In Proceedings of International Conference on Distributed Computing Systems, June 1988, pp. 104-111. [PDF]
- D. Thain, T. Tannenbaum, and M. Livny " Distributed Computing in Practice: The Condor Experience", Concurrency and Computation: Practice and Experience , Vol. 17, Nos. 2-4, 2005. [PDF]
David P. Anderson, Carl Christensen and Bruce Allen, "Designing a Runtime System for Volunteer Computing", In Proceedings of SC'06, November 2006. [PDF]
- David P. Anderson, "BOINC: A Platform for Volunteer Computing", Journal of Grid Computing, Vol. 18, 2020. [PDF]

11/3 Job Scheduling - Batch Queues - Y. Wang, Y. Xu, D. Yuan

Robert L. Henderson, "Job scheduling under the Portable Batch System", In Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), 1995. [PDF]
Y. Zhang, H. Franke, J.E. Moreira, A. Sivasubramaniam, "Improving Parallel Job Scheduling by Combining Gang Scheduling and Backfilling Techniques ", In Proceedings 14th International Parallel and Distributed Processing Symposium (IPDPS), 2000. [PDF]

11/8 Parallel I/O - M. Arace, P. Arias, J. Davis

Terry Jones, Alice Koniges, and R. Kim Yates, "Performance of the IBM General Parallel File System", In Proceedings of 14th International Parallel and Distributed Processing Symposium (IPDPS'00), April 2000. [PDF]
Wahid Bhimji et. al, "Accelerating Science with the NERSC Burst Buffer Early User Program", In Proceedings of the 2016 Cray Users Group (CUG) Conference,May 2016. [PDF]

11/10 Midterm Exam

11/15 No Class - SC22 - work on research projects!

11/17 Topology Aware Mapping - C. Evuru, N. Froehlinger, A. Jayaweera

G. Bhanot, A. Gara, P. Heidelberger, E. Lawless, J. C. Sexton, R. Walkup, "Optimizing Task Layout on the Blue Gene/L Supercomputer", IBM Journal of Research and Development, Vol. 49, Nos. 2/3, 2005. [PDF]
Abhinav Bhatele, William D. Gropp, Nikhil Jain, Laxmikant V. Kale, "Avoiding Hot-spots on Two-level Direct Networks", In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC11), Nov. 2011. [PDF]

Applications

11/22 Molecular Dynamics - M. Johnson, J. Kitson, N. Koutsoheras

J.C. Phillips, Gengbin Zheng, S. Kumar, L.V. Kale, "NAMD: Biomolecular Simulation on Thousands of Processors", In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing (SC02), Nov. 2002. [PDF]
David E. Shaw et. al, "Millisecond-scale molecular dynamics simulations on Anton", In Proceedings of the 2009 Conference on High Performance Computing Networking, Storage and Analysis (SC09), Nov. 2009. [PDF]

11/24 Thanksgiving

11/29 Data Intensive Applications - K. Marepally, D. Muir, K. Oh

U. Catalyurek, M. Beynon, C. Chang, T. Kurc, A. Sussman, and J. Saltz, "The Virtual Microscope", IEEE Transactions on Information Technology in Biomedicine, Vol. 7, No. 4, 2003. [PDF]
Hisashi Yashiro et. al, "A 1024-Member Ensemble Data Assimilation with 3.5-Km Mesh Global Weather Simulations", In Proceedings of the 2020 Conference on High Performance Computing Networking, Storage and Analysis (SC20), Nov. 2020. [PDF]

12/1 Machine Learning and HPC - J. Schweitzer, Y. Wang, Y. Xu, D. Yuan

A. Bhatele, J.J. Thiagarajan, T. Groves, R. Anirudh, S.A. Smith, B. Cook, D. K. Lowenthal, "The Case of Performance Variability on Dragonfly-based Systems", In Proceedings of 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2020. [PDF]
Shen Li et. al, "PyTorch distributed: experiences on accelerating data parallel training", Proceedings of the VLDB Endowment, Vol. 13, No. 12, 2020. [PDF]

12/6 Project Demos

Hybrid Computing Framework for Laplace Equation - M. Arace, P. Arias, K. Marepally, D. Muir
GPU n-body Simulation with Compute Shaders & Ray Tracing - C. Evuru, N. Froehlinger, J. Schweitzer, M. Johnson
Real-time Face Swapping - D. Yuan, Y. Xu
The OpenMP Cluster Programming Model on Zaratan - J. Davis, J. Kitson, K. Oh
A Realistic Parallel Simulation of Particles Moving About a Fixed Grid - N. Koutsoheras, Y. Wang, A. Jayaweera

12/8 Project Demos (cont.) and SC21 Gordon Bell award finalist

David E. Shaw et. al, "Anton 3: twenty microseconds of molecular dynamics simulation before lunch", In Proceedings of 2021 International Conference for High Performance Computing, Networking, Storage and Analysis (SC21), Nov. 2021. [PDF]