CMSC714 High Performance Computing (Spring 2025)

Readings - CMSC714: High Performance Computing

Note: for each class (after the intro material), 4 students will be responsible for emailing me (asussman@umd.edu) with ~4 discussion question on the reading(s) for that day by 6PM the day before the class, and be prepared to ask those questions and help explain the paper to the rest of the class.

Introduction - What and Why?

1/28 Parallel Computing and Parallel Computers

from Lecture Notes

1/30 Applications of Parallel Computing

from Lecture Notes

Programming Models

2/4 Expressing Parallelism (Implicit Control)

L. Dagum and R. Menon, "OpenMP: An Industry-Standard API for Shared-Memory Programming," IEEE Computational Science & Engineering, 5(1), 1998. [PDF]
B.R. de Supinski et. al, " The Ongoing Evolution of OpenMP ", Proceedings of the IEEE, 106(11), 2018. [PDF]

2/6-2/11 Expressing Parallelism (Implicit Control, cont.) - A. Borhani, C. Charles, A. Coppens, S. Cui

B.L Chamberlain, Chapel chapter in Programming Models for Parallel Computing, edited by Pavan Balaji, MIT Press, 2015.[PDF]
J. Bezanson, A. Edelman, S. Karpinski, and V.B. Shah,, "Julia: A Fresh Approach to Numerical Computing", SIAM Review, Vol. 59, No. 1, 2017. [PDF]

2/13 Expressing Parallelism (Explicit Control - Message Passing and MPI)

J. J. Dongarra, S. W. Otto, M. Snir, and D. Walker, "A message passing standard for MPP and workstations," Communications of the ACM, 39(7), 1996, pp. 84-90. [PDF]
R. Thakur and W. Gropp, "Open Issues in MPI Implementation," In: L. Choi, Y. Paek, S. Cho (eds), Advances in Computer Systems Architecture. ACSAC 2007, 2007. [PDF]

2/18 Advanced MPI - J. Dai, T. Dao, A. Frolov, L. Hough

R. Thakur, R. Rabenseifner, and W. Gropp, "Optimization of Collective Communication Operations in MPICH," The International Journal of High Performance Computing Applications, 19(1), 2005. [PDF]
T. Saif and M. Parashar, "Understanding the Behavior and Performance of Non-blocking Communications in MPI," In: Danelutto, M., Vanneschi, M., Laforenza, D. (eds), Euro-Par 2004 Parallel Processing Conference, 2004. [PDF]

2/20-25 Expressing Parallelism (Hybrids and Frameworks) - N. Kadawedduwa, J. Kim, J. Liu, Z. Lu

Steve W. Bova et. al, "Parallel Programming with Message Passing and Directives", Computing in Science & Engineering, 3(5), 2001. [PDF]
S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith, "Efficient Management of Parallelism in Object Oriented Numerical Software Libraries", In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern Software Tools in Scientific Computing, Birkhäuser Press, 1997. [PDF]

2/27 CUDA and GPUs

No readings assigned, slides only

3/4 Profiling Programs - R. Misra, A. Namjoo, Q. Nguyen, R. Parsons

S.L. Graham, P.B. Kessler, and M.K. McKusick, "gprof: a Call Graph Execution Profiler," Proceedings of the SIGPLAN '82 Symposium on Compiler Construction, ACM SIGPLAN Notices, Vol. 17, No. 6, 1982. [PDF]
L. Adhianto, S. Banerjee, M. Fagan, M. Krentel,G. Marin, J. Mellor-Crummey,and N.R.Tallent, "HPCTOOLKIT: tools for performance analysis of optimized parallel programs," Concurrency and Computation: Practice and Experience, Vol. 22, 2010. [PDF]

3/6 Scientific Workflows - C. Hutton, X. Qi, M. Sukanya, X. Tian

E. Deelman, K. Vahi, et. al, "Pegasus, a workflow management system for science automation," Future Generation Computer Systems, Vol. 46, May 2015. [PDF]
J.M. Wozniak, T.G. Armstrong, M. Wilde, D. Katz, E. Lusk, and I.T. Foster "Swift/T: Large-scale application composition via distributed-memory data flow processing," Proceedings of 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), , May 2013. [PDF]

Architectures

3/11 Shared Memory - J. Umeike, R. Vishnoi, F. H.-Y. Yeh, A. Borhani

J. Laudon and D. Lenoski, "The SGI Origin: a ccNUMA highly scalable server," In Proceedings of 1997 International Symposium on Computer Architecture (ISCA '97), May 1997. [PDF]
SGI, "Technical Advances in the SGI® UV Architecture™," SGI White paper, 2012. [PDF]

3/13 Custom Machines - C. Charles, A. Coppens, S. Cui, J. Dai

S.R. Alam, J.A. Kuehn, R.F. Barrett, J.M. Larkin, M.R. Fahey, R. Sankaran, P.H. Worley, "Cray XT4: An Early Evaluation for Petascale Scientific Simulation", In Proceedings of SC'07, Nov. 2007. [PDF]
A. Gara, et. al, "Overview of the Blue Gene/L system architecture", IBM Journal of Research and Development, 49(2/3), Fall 2005. [PDF]

3/25 GPUs - T. Dao, A. Frolov, L. Hough, C. Hutton

W.J. Dally, S.W. Keckler, D.B. Kirk, "Evolution of the Graphics Processing Unit (GPU)," IEEE Micro, (41)6, 2021. [PDF]
V.W. Lee et. al, "Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU", In Proceedings of 2010 International Symposium on Computer Architecture (ISCA), May 2010. [PDF]

3/27 High Performance Networks - N. Kadawedduwa, J. Kim, J. Liu, R. Misra

Mellanox Technologies white paper, "Introduction to InfiniBand.", 2003. [PDF]
R. Alverson, D. Roweth, L.Kaplan, "The Gemini System Interconnect," In Proceedings of the 18th Symposium on High Performance Interconnects, Aug. 2010. [PDF]

4/1 Clouds - Z. Lu, A. Namjoo, Q. Nguyen, R. Parsons

Jeffrey Dean and Sanjay Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters", In Proceedings of OSDI'04, pp. 137-150 [PDF]
Michael Stonebraker, Daniel Abadi, David J. DeWitt, Sam Madden, Erik Paulson, Andrew Pavlo, Alexander Rasin, "MapReduce and Parallel DBMSs: Friends or Foes?", Communications of the ACM, 53(1), Jan. 2010, pp. 64-71. [PDF]

4/3 Clouds, cont. - X. Qi, M. Sukanya, X. Tian, J. Umeike

M. Zaharia, R. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, and I. Stoica, "Apache Spark: A Unified Engine for Big Data Processing,", Communications of the ACM, 59(11), Nov. 2016. [PDF]
B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A.D. Joseph, R. Katz, S. Shenker and I. Stoica, "Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center", In Proceedings of 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI), USENIX, March 2011. [PDF]

Tools

4/8 Event Ordering and Race Detection -

L. Lamport, "Time, Clocks, and the Ordering of Events in a Distributed System", Communications of the ACM, 21(7), 1978, pp. 558-564. [PDF]
C. von Praun, "Race Detection Techniques", In D. Padua (eds) Encyclopedia of Parallel Computing, Springer, 2011, pp. 1697-1706. [PDF]

4/10 Data Collection and Instrumentation - R. Vishnoi, H.-Y. Yeh, A. Borhani, A. Coppens

Nicholas Nethercote and Julian Seward, "Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation", In Proceedings of the 2007 ACM/SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2007. [PDF]
B. R. Buck and J.K. Hollingsworth , "An API for Runtime Code Patching," International Journal of High Performance Computing Applications, 14(4), Winter 2000, pp. 317-329. [PDF]

4/15 Midterm Exam

4/17 Runtime Parallelization - S. Cui, J. Dai, T.Dao, A. Frolov

J. Nieplocha et. al, "Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit", International Journal of High Performance Computing Applications, 20(6), 2006. [PDF]
G. Agrawal, A. Sussman, and J. Saltz, "An Integrated Runtime and Compile-time Approach for Parallelizing Structured and Block Structured Applications", IEEE Transactions on Parallel and Distributed Systems, 6(7), 1995. [PDF]

4/22 Autotuning - L. Hough, C.Hutton, N. Kadawedduwa, J. Kim

P. Balaprakash et. al, "Autotuning in High-Performance Computing Applications", Proceedings of the IEEE, 106(11), November 2018. [PDF]
R. Clint Whaley, Antoine Petitet and Jack J. Dongarra, "Automated empirical optimizations of software and the ATLAS project ", Parallel Computing, 27(1), 2001. [PDF]

Systems Issues

4/24 Finding Idle Cycles - J. Liu, Z. Lu, R. Misra, A. Namjoo

M. Litzkow, M. Livny, and M. Mutka, "Condor - A Hunter of Idle Workstations", In Proceedings of International Conference on Distributed Computing Systems, June 1988, pp. 104-111. [PDF]
- D. Thain, T. Tannenbaum, and M. Livny " Distributed Computing in Practice: The Condor Experience", Concurrency and Computation: Practice and Experience , Vol. 17, Nos. 2-4, 2005. [PDF]
David P. Anderson, Carl Christensen and Bruce Allen, "Designing a Runtime System for Volunteer Computing", In Proceedings of SC'06, November 2006. [PDF]
- David P. Anderson, "BOINC: A Platform for Volunteer Computing", Journal of Grid Computing, Vol. 18, 2020. [PDF]

4/29 Parallel I/O - Q. Nguyen, R. Parsons, X. Qi, M. Sukanya

Terry Jones, Alice Koniges, and R. Kim Yates, "Performance of the IBM General Parallel File System", In Proceedings of 14th International Parallel and Distributed Processing Symposium (IPDPS'00), April 2000. [PDF]
Wahid Bhimji et. al, "Accelerating Science with the NERSC Burst Buffer Early User Program", In Proceedings of the 2016 Cray Users Group (CUG) Conference,May 2016. [PDF]

Applications

5/1 HPC Applications - X. Tian, J. Umeike, R. Vishnoi, H.-Y.Yeh

David E. Shaw et. al, "Millisecond-scale molecular dynamics simulations on Anton", In Proceedings of the 2009 Conference on High Performance Computing Networking, Storage and Analysis (SC09), Nov. 2009. [PDF]
U. Catalyurek, M. Beynon, C. Chang, T. Kurc, A. Sussman, and J. Saltz, "The Virtual Microscope", IEEE Transactions on Information Technology in Biomedicine, Vol. 7, No. 4, 2003. [PDF]

5/6 Project Demos

CPU vs GPU: A Modern Update - S. Cui
3D Gaussian Splatting Parallelization - N. Kadawedduwa, Q. Nguyen, T.D. Dao
GAN Training with Distributed Memory Parallelism - Z. Lu, X. Qi
A Comparative Study of Parallel Programming Frameworks for HPC Architectures - A. Borhani, A. Coppens, A. Namjoo, J. Umeike

5/8 Project Demos (cont.)

Massively Parallel Computational Model (MPC) - C. Hutton
Whole Genome Alignment - R. Parsons, J. Dai, J. Liu
Parallelization of Magnetic Models using MPI, OpenMP, CUDA, and x86 SIMD Instructions - D. Hough, C. Charles, R. Vishnoi, A. Frolov
Parallelization of Gather and Reduce in Recommendation Systems - H.-Y. Yeh, M. Sukanya
Evaluating modern CPU vs. GPU performances on common parallel applications - J. Kim, R. Misra, X. Tian

5/13 Gordon Bell Award Applications

S. Singh et. al, "Democratizing AI: Open-source Scalable LLM Training on GPU-based Supercomputers", In Proceedings of 2024 International Conference for High Performance Computing, Networking, Storage and Analysis (SC24), Nov. 2024. [PDF]
David E. Shaw et. al, "Anton 3: twenty microseconds of molecular dynamics simulation before lunch", In Proceedings of 2021 International Conference for High Performance Computing, Networking, Storage and Analysis (SC21), Nov. 2021. [PDF]