CMSC714 High Performance Computing (Fall 2023)

Readings - CMSC714: High Performance Computing

Note: for each class (after the intro material), 3 students will be responsible for emailing me (als@cs.umd.edu) with ~4 discussion question on the reading(s) for that day by 6PM the day before the class, and be prepared to ask those questions and help explain the paper to the rest of the class.

Introduction - What and Why?

8/29 Parallel Computing and Parallel Computers

from Lecture Notes

8/31 Applications of Parallel Computing

from Lecture Notes

Programming Models

9/5 Expressing Parallelism (Explicit Control - Message Passing and MPI)

J. J. Dongarra, S. W. Otto, M. Snir, and D. Walker, "A message passing standard for MPP and workstations," Communications of the ACM, 39(7), 1996, pp. 84-90. [PDF]
R. Thakur and W. Gropp, "Open Issues in MPI Implementation," In: L. Choi, Y. Paek, S. Cho (eds), Advances in Computer Systems Architecture. ACSAC 2007, 2007. [PDF]

9/7 Advanced MPI

R. Thakur, R. Rabenseifner, and W. Gropp, "Optimization of Collective Communication Operations in MPICH," The International Journal of High Performance Computing Applications, 19(1), 2005. [PDF]
T. Saif and M. Parashar, "Understanding the Behavior and Performance of Non-blocking Communications in MPI," In: Danelutto, M., Vanneschi, M., Laforenza, D. (eds), Euro-Par 2004 Parallel Processing Conference, 2004. [PDF]

9/12-14 Expressing Parallelism (Implicit Control) - V. Agarwal, H. Bauman, E. Bennewitz

L. Dagum and R. Menon, "OpenMP: An Industry-Standard API for Shared-Memory Programming," IEEE Computational Science & Engineering, 5(1), 1998. [PDF]
B.R. de Supinski et. al, " The Ongoing Evolution of OpenMP ", Proceedings of the IEEE, 106(11), 2018. [PDF]

9/19 Expressing Parallelism (Implicit Control, cont.) - B. Colelough, Q. De Man, R. Dhakal

B.L Chamberlain, Chapel chapter in Programming Models for Parallel Computing, edited by Pavan Balaji, MIT Press, 2015.[PDF]
J. Bezanson, A. Edelman, S. Karpinski, and V.B. Shah,, "Julia: A Fresh Approach to Numerical Computing", SIAM Review, Vol. 59, No. 1, 2017. [PDF]

9/21 Expressing Parallelism (Hybrids and Frameworks) - H. Hosseini, A. Marwah, A. Movsesyan

Steve W. Bova et. al, "Parallel Programming with Message Passing and Directives", Computing in Science & Engineering, 3(5), 2001. [PDF]
S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith, "Efficient Management of Parallelism in Object Oriented Numerical Software Libraries", In E. Arge, A. M. Bruaset, and H. P. Langtangen, editors, Modern Software Tools in Scientific Computing, Birkhäuser Press, 1997. [PDF]

9/26 Profiling Programs - B. Moyer, C. Saborio, P. Singhania

S.L. Graham, P.B. Kessler, and M.K. McKusick, "gprof: a Call Graph Execution Profiler," Proceedings of the SIGPLAN '82 Symposium on Compiler Construction, ACM SIGPLAN Notices, Vol. 17, No. 6, 1982. [PDF]
L. Adhianto, S. Banerjee, M. Fagan, M. Krentel,G. Marin, J. Mellor-Crummey,and N.R.Tallent, "HPCTOOLKIT: tools for performance analysis of optimized parallel programs," Concurrency and Computation: Practice and Experience, Vol. 22, 2010. [PDF]

9/28 Scientific Workflows - P. Sivaraman, L. Swaisgood, O. Taylor

E. Deelman, K. Vahi, et. al, "Pegasus, a workflow management system for science automation," Future Generation Computer Systems, Vol. 46, May 2015. [PDF]
J.M. Wozniak, T.G. Armstrong, M. Wilde, D. Katz, E. Lusk, and I.T. Foster "Swift/T: Large-scale application composition via distributed-memory data flow processing," Proceedings of 13th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), , May 2013. [PDF]

Architectures

10/3 Shared Memory - R. Van Voorhis, S. Yanamaddi, N. Yu

J. Laudon and D. Lenoski, "The SGI Origin: a ccNUMA highly scalable server," In Proceedings of 1997 International Symposium on Computer Architecture (ISCA '97), May 1997. [PDF]
SGI, "Technical Advances in the SGI® UV Architecture™," SGI White paper, 2012. [PDF]

10/5 Custom Machines - V. Agarwal, H. Bauman, E. Bennewitz

S.R. Alam, J.A. Kuehn, R.F. Barrett, J.M. Larkin, M.R. Fahey, R. Sankaran, P.H. Worley, "Cray XT4: An Early Evaluation for Petascale Scientific Simulation", In Proceedings of SC'07, Nov. 2007. [PDF]
A. Gara, et. al, "Overview of the Blue Gene/L system architecture", IBM Journal of Research and Development, 49(2/3), Fall 2005. [PDF]

10/10 GPUs - B. Colelough, Q. De Man, R. Dhakal

W.J. Dally, S.W. Keckler, D.B. Kirk, "Evolution of the Graphics Processing Unit (GPU)," IEEE Micro, (41)6, 2021. [PDF]
V.W. Lee et. al, "Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU", In Proceedings of 2010 International Symposium on Computer Architecture (ISCA), May 2010. [PDF]

10/12 High Performance Networks - H. Hosseini, A. Marwah, A. Movsesyan

Mellanox Technologies white paper, "Introduction to InfiniBand.", 2003. [PDF]
R. Alverson, D. Roweth, L.Kaplan, "The Gemini System Interconnect," In Proceedings of the 18th Symposium on High Performance Interconnects, Aug. 2010. [PDF]

10/17 Clouds - B. Moyer, C. Saborio, P. Singhania

Jeffrey Dean and Sanjay Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters", In Proceedings of OSDI'04, pp. 137-150 [PDF]
Michael Stonebraker, Daniel Abadi, David J. DeWitt, Sam Madden, Erik Paulson, Andrew Pavlo, Alexander Rasin, "MapReduce and Parallel DBMSs: Friends or Foes?", Communications of the ACM, 53(1), Jan. 2010, pp. 64-71. [PDF]

10/19 Clouds, cont. - P. Sivaraman, L. Swaisgood, O. Taylor

M. Zaharia, R. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, and I. Stoica, "Apache Spark: A Unified Engine for Big Data Processing,", Communications of the ACM, 59(11), Nov. 2016. [PDF]
B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A.D. Joseph, R. Katz, S. Shenker and I. Stoica, "Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center", In Proceedings of 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI), USENIX, March 2011. [PDF]

Tools

10/24 Event Ordering and Race Detection - R. Van Voorhis, S. Yanamaddi, N. Yu

L. Lamport, "Time, Clocks, and the Ordering of Events in a Distributed System", Communications of the ACM, 21(7), 1978, pp. 558-564. [PDF]
C. von Praun, "Race Detection Techniques", In D. Padua (eds) Encyclopedia of Parallel Computing, Springer, 2011, pp. 1697-1706. [PDF]

10/26 Data Collection and Instrumentation - V. Agarwal, H. Bauman, E. Bennewitz

Nicholas Nethercote and Julian Seward, "Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation", In Proceedings of the 2007 ACM/SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2007. [PDF]
B. R. Buck and J.K. Hollingsworth , "An API for Runtime Code Patching," International Journal of High Performance Computing Applications, 14(4), Winter 2000, pp. 317-329. [PDF]

10/31 Runtime Parallelization - B. Colelough, Q. De Man, R. Dhakal

J. Nieplocha et. al, "Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit", International Journal of High Performance Computing Applications, 20(6), 2006. [PDF]
G. Agrawal, A. Sussman, and J. Saltz, "An Integrated Runtime and Compile-time Approach for Parallelizing Structured and Block Structured Applications", IEEE Transactions on Parallel and Distributed Systems, 6(7), 1995. [PDF]

11/2 Autotuning - H. Hosseini, A. Marwah, A. Movsesyan

P. Balaprakash et. al, "Autotuning in High-Performance Computing Applications", Proceedings of the IEEE, 106(11), November 2018. [PDF]
R. Clint Whaley, Antoine Petitet and Jack J. Dongarra, "Automated empirical optimizations of software and the ATLAS project ", Parallel Computing, 27(1), 2001. [PDF]

Systems Issues

11/7 Finding Idle Cycles - B. Moyer, C. Saborio, P. Singhania

M. Litzkow, M. Livny, and M. Mutka, "Condor - A Hunter of Idle Workstations", In Proceedings of International Conference on Distributed Computing Systems, June 1988, pp. 104-111. [PDF]
- D. Thain, T. Tannenbaum, and M. Livny " Distributed Computing in Practice: The Condor Experience", Concurrency and Computation: Practice and Experience , Vol. 17, Nos. 2-4, 2005. [PDF]
David P. Anderson, Carl Christensen and Bruce Allen, "Designing a Runtime System for Volunteer Computing", In Proceedings of SC'06, November 2006. [PDF]
- David P. Anderson, "BOINC: A Platform for Volunteer Computing", Journal of Grid Computing, Vol. 18, 2020. [PDF]

11/9 Job Scheduling - Batch Queues - P. Sivaraman, L. Swaisgood, O. Taylor

Robert L. Henderson, "Job scheduling under the Portable Batch System", In Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), 1995. [PDF]
Y. Zhang, H. Franke, J.E. Moreira, A. Sivasubramaniam, "Improving Parallel Job Scheduling by Combining Gang Scheduling and Backfilling Techniques ", In Proceedings 14th International Parallel and Distributed Processing Symposium (IPDPS), 2000. [PDF]

11/14 No Class - SC23 - work on research projects!

11/16 Midterm Exam

11/21 No Class - Thanksgiving

11/23 No Class - Thanksgiving

11/28 Parallel I/O - R. Van Voorhis, S. Yanamaddi, N. Yu

Terry Jones, Alice Koniges, and R. Kim Yates, "Performance of the IBM General Parallel File System", In Proceedings of 14th International Parallel and Distributed Processing Symposium (IPDPS'00), April 2000. [PDF]
Wahid Bhimji et. al, "Accelerating Science with the NERSC Burst Buffer Early User Program", In Proceedings of the 2016 Cray Users Group (CUG) Conference,May 2016. [PDF]

Applications

11/30 HPC Applications

David E. Shaw et. al, "Millisecond-scale molecular dynamics simulations on Anton", In Proceedings of the 2009 Conference on High Performance Computing Networking, Storage and Analysis (SC09), Nov. 2009. [PDF]
U. Catalyurek, M. Beynon, C. Chang, T. Kurc, A. Sussman, and J. Saltz, "The Virtual Microscope", IEEE Transactions on Information Technology in Biomedicine, Vol. 7, No. 4, 2003. [PDF]

12/5 Project Demos

Parallelization of Discrete Voronoi Decomposition - B. Moyer, E. Bennewitz
Evaluating OpenMP 4.5 Support on Compilers - R. Dhakal, P. Sivaraman, S. Yanamaddi
Lateness Detection in Pipit Trace Analysis Tool - P. Singhania, A. Marwah, A. Movsesyan
Accelerating Drone’s Mapping System Using Parallel Tools - N. Yu

12/7 Project Demos (cont.) and SC21 Gordon Bell award finalist

Grid Adaptation - GridOptiMesh - H. Bauman, B. Colelough, R. Van Voorhis
Semisorting with ParlayLib - H. Hosseini, Q. De Man, V. Agarwal
Parallel Implementation of an Elastic Rotor Blade Coupled Trim Tool - L. Swaisgood, C. Saborio
David E. Shaw et. al, "Anton 3: twenty microseconds of molecular dynamics simulation before lunch", In Proceedings of 2021 International Conference for High Performance Computing, Networking, Storage and Analysis (SC21), Nov. 2021. [PDF]