### Systems for Machine Learning (CMSC828G)







Abhinav Bhatele, Daniel Nichols

# Getting started with zaratan

- Over 360 nodes with AMD Milan processors (128 cores/node, 512 GB memory/ node)
- 20 nodes with four NVIDIA AI00 GPUs (40 GB per GPU)
- 8 nodes with four NVIDIA H100 GPUs (80 GB per GPU)

ssh username@login.zaratan.umd.edu



Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616)



### Data center / HPC cluster

- A set of nodes or processing elements connected by a network.
- Compute node: A shared-memory unit (optionally has GPUs)





Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616)



## Cores, sockets, nodes

- Core: a single execution unit that has a private LI cache and can execute instructions independently
- Processor: several cores on a single Integrated Circuit (IC) or chip are called a multi-core processor
- Socket: physical connector into which an IC/chip or processor is inserted.
- Node: a packaging of sockets motherboard or printed circuit board (PCB) that has multiple sockets





https://hpc-wiki.info/hpc/HPC-Dictionary

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616)

### Shared memory architecture



### **Uniform Memory Access**

https://computing.llnl.gov/tutorials/parallel\_comp/#SharedMemory



Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616)



### Shared memory architecture



### **Uniform Memory Access**

https://computing.llnl.gov/tutorials/parallel\_comp/#SharedMemory



Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616)

### **Non-uniform Memory Access (NUMA)**



# Hopper H100 SM

- CUDA Core
  - Single serial execution unit
- Each H100 Streaming Multiprocessor (SM) has:
  - 128 FP32 cores
  - 64 INT32 cores
  - 64 FP64 cores
  - 84 Tensor cores
- CUDA capable device or GPU
  - Collection of SMs



Abhinav Bhatele (CMSC416 / CMSC616)



# Hopper H100 SM

- CUDA Core
  - Single serial execution unit
- Each H100 Streaming Multiprocessor (SM) has:
  - 128 FP32 cores
  - 64 INT32 cores
  - 64 FP64 cores
  - 84 Tensor cores
- CUDA capable device or GPU
  - Collection of SMs



Abhinav Bhatele (CMSC416 / CMSC616

### SM

|                                |     |       | L0 li  | nstruc  | tion C | ache            |       |         |
|--------------------------------|-----|-------|--------|---------|--------|-----------------|-------|---------|
| Warp Scheduler (32 thread/clk) |     |       |        |         |        |                 |       |         |
|                                |     | Di    | enatel | h Llnit | /32 fb | road/c          |       |         |
|                                |     |       | spate  |         | (əz ui | reau/(          | лкј   |         |
|                                |     | Reg   | ister  | File (' | 16,384 | <b>1</b> x 32   | -bit) |         |
|                                |     |       |        |         | , r    |                 |       |         |
| INT32                          | FP3 | 32 FF | 32     | FP6     | \$4    |                 |       |         |
| INT32                          | FP3 | 32 FP | 32     | FP6     | 54     |                 |       |         |
| INT32                          | FP3 | 32 FP | 32     | FP6     | 54     |                 |       |         |
| INT32                          | FP3 | 32 FP | 32     | FP®     | \$4    |                 |       |         |
| INT32                          | FP3 | 32 FP | 32     | FP64    |        |                 |       |         |
| INT32                          | FP3 | 32 FP | 32     | FP64    |        |                 |       |         |
| INT32                          | FP3 | 32 FP | 32     | FP64    |        |                 |       |         |
| INT32                          | FP3 | 32 FF | 32     | FP64    |        | TE              | NSO   | R CORE  |
| INT32                          | FP3 | 32 FP | 32     | FP64    |        | 4 <sup>tn</sup> | GEN   | ERATION |
| INT32                          | FP3 | 32 FP | 32     | FP6     | 54     |                 |       |         |
| INT32                          | FP3 | 32 FP | 32     | FP6     | 54     |                 |       |         |
| INT32                          | FP3 | 32 FP | 232    | FP64    |        |                 |       |         |
| INT32                          | FP3 | 32 FF | 32     | FP64    |        |                 |       |         |
| INT32                          | FP3 | 32 FF | 32     | FP6     | 54     |                 |       |         |
| INT32                          | FP3 | 32 FF | 32     | FP      | 54     |                 |       |         |
| INT32                          | FP3 | 32 FP | 32     | FPE     | 54     |                 |       |         |
| LD/                            | LD/ | LD/   | LD/    | LD/     | LD/    | LD/             | LD/   | SEU     |
| ST                             | ST  | ST    | ST     | ST      | ST     | ST              | ST    | 0.0     |

| Dispatch Unit (32 thread/clk) |           |         |          |         |         |           |            |                   |          |
|-------------------------------|-----------|---------|----------|---------|---------|-----------|------------|-------------------|----------|
|                               |           | R       | Reg      | ist     | er      | File ('   | 16,384     | 4 x 32            | -bit     |
| INT32                         | FP3       | 32      | FP       | 32      | Г       | FP6       | i4         |                   |          |
| INT32                         | FP3       | 32      | FP       | 32      |         | FP6       | i4         |                   |          |
| INT32                         | FP3       | 32      | FP       | 32      |         | FP6       | <b>i</b> 4 |                   |          |
| INT32                         | FP3       | 32      | FP       | 32      | Г       | FP6       | <b>;</b> 4 |                   |          |
| INT32                         | FP3       | 32      | FP       | 32      | Г       | FP6       | 4          |                   |          |
| INT32                         | FP3       | 32      | FP       | 32      |         | FP6       | <b>;</b> 4 |                   |          |
| INT32                         | FP3       | 32      | FP       | 32      |         | FP6       | i4         |                   |          |
| INT32                         | FP3       | 32      | FP       | 32      |         | FP6       | i4         | TE                | NS       |
| INT32                         | FP3       | 32      | FP       | 32      |         | FP6       | i4         | 4 <sup>th</sup> ( | GE       |
| INT32                         | FP3       | 32      | FP       | 32      |         | FP6       | <b>i</b> 4 |                   |          |
| INT32                         | FP3       | 32      | FP       | 32      |         | FP6       | <b>i</b> 4 |                   |          |
| INT32                         | FP3       | 32      | FP       | 32      |         | FP6       | i4         |                   |          |
| INT32                         | FP3       | 32      | FP       | 32      |         | FP6       | i4         |                   |          |
| INT32                         | FP3       | 32      | FP       | 32      |         | FP6       | i4         |                   |          |
| INT32                         | FP3       | 32      | FP       | 32      |         | FP6       | i4         |                   |          |
| INT32                         | FP3       | 32      | FP       | 32      |         | FP6       | <b>i</b> 4 |                   |          |
| LD/<br>ST                     | LD/<br>ST | LI<br>S | D/<br>iT | LL<br>S | D/<br>T | LD/<br>ST | LD/<br>ST  | LD/<br>ST         | LD<br>ST |

truction Cache

FP64

LD/ ST

L0 Instruction Cache

| .0       | Instruction        | Cache                      |           |           |           | L0        | h |  |
|----------|--------------------|----------------------------|-----------|-----------|-----------|-----------|---|--|
| S        | cheduler (32       | thread/clk)                | Warp Sch  |           |           |           |   |  |
| at       | tch Unit (32 t     | hread/clk)                 |           | Dispatcl  |           |           |   |  |
| te       | er File (16,3      | 34 x 32-bit)               | Register  |           |           |           |   |  |
| 1        | FP64               |                            | INT32     | FP3       | 2 FP      | 32        |   |  |
|          | FP64               |                            | INT32     | FP3       | 2 FP      | 32        |   |  |
|          | FP64               |                            | INT32     | FP3       | 2 FP      | 32        |   |  |
|          | FP64               |                            | INT32     | FP3       | 2 FP      | 32        |   |  |
|          | FP64               |                            | INT32     | FP3       | 2 FP      | 32        |   |  |
|          | FP64               |                            | INT32     | FP3       | 2 FP      | 32        |   |  |
|          | FP64               |                            | INT32     | FP3       | 2 FP      | 32        |   |  |
|          | FP64               | TENSOR CORE                | INT32     | FP3       | 2 FP      | 32        |   |  |
|          | FP64               | 4 <sup>th</sup> GENERATION | INT32     | FP3       | 2 FP      | 32        |   |  |
|          | FP64               |                            | INT32     | FP3       | 2 FP      | 32        |   |  |
|          | FP64               |                            | INT32     | FP3       | 2 FP      | 32        |   |  |
|          | FP64               |                            | INT32     | FP3       | 2 FP      | 32        |   |  |
|          | FP64               |                            | INT32     | FP3       | 2 FP      | 32        |   |  |
|          | FP64               |                            | INT32     | FP3       | 2 FP      | 32        |   |  |
|          | FP64               |                            | INT32     | FP3       | 2 FP      | 32        |   |  |
|          | FP64               | INT32                      | FP3       | 2 FP      | 32        |           |   |  |
| .D<br>ST | I LDI LDI<br>ST ST | LD/ LD/<br>ST ST SFU       | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST |   |  |
| 51       | SI SI              | 31 51                      |           | SI        | SI        | s         |   |  |

L1 Instruction Cache

| Tensor | Memory | v Acce | lerator |
|--------|--------|--------|---------|
|        |        |        | erator  |

### 256 KB L1 Data Cache / Shared Memory

Tex

Rea

INT32 FP32 FP3

NT32 INT32

INT32

INT32

INT32

INT32

NT32

INT32

INT32

INT32

INT32

NT32 INT32

INT32

INT32

FP32 FP3

FP32 FP3

FP32 FP3

FP32 FP3

FP32 FP3

FP32 FP

FP32 FP3

FP32 FP

FP32 FP3

LD/ LD/ LD/ ST ST ST

Tex

Tex



## NVIDIA H100 chip



## H100 tensor cores

- Tensor cores are specialized cores for matrix multiply accumulate operations
- Operate in parallel across all SMs
- Multiply two 4 x 4 FPI6 matrices and add to a 4 x 4 FPI6 or FP32 matrix
- Mixed precision

https://resources.nvidia.com/en-us-tensor-core







# **Nodes with GPUs**

- NIC: Network interface card that connects the node to the network
- PCle: high-speed interface often used to connect CPUs and GPUs
- NVLink: NVIDIA's high-speed interface often used between GPUs





# Alternative node diagram







# Alternative node diagram





<sup>3</sup> DEPARTMENT OF COMPUTER SCIENCE



### **Distributed memory architecture**

- Groups of processors/cores have access to their local memory
- Writes in one group's memory have no effect on another group's memory



### Shared memory (NUMA)





### **Distributed memory**

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616)

### **Distributed memory architecture**

- Groups of processors/cores have access to their local memory
- Writes in one group's memory have no effect on another group's memory



### Shared memory (NUMA)





### **Distributed memory**

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616)

### **Distributed memory architecture**

- Groups of processors/cores have access to their local memory
- Writes in one group's memory have no effect on another group's memory



### Shared memory (NUMA)





### **Distributed memory**

Abhinav Bhatele, Alan Sussman (CMSC416 / CMSC616)

### A realistic cluster









# **Google's Tensor Processing Unit**

- TPU is an ASIC (Application-specific Integrated Circuit)
- Co-processor just like GPUs
- Each TPU can have one or multiple MMUs
- TPU Pod is a collection of TPUs



Abhinav Bhatele, Daniel Nichols (CMSC828G)



### Network components

- Network interface controller or card
- Router or switch
- Network cables: copper or optical



Abhinav Bhatele (CMSC416 / CMSC616)







| Source |  |
|--------|--|
| Source |  |
| Source |  |
| Source |  |
| Source |  |

Message origin points : destination, frequency, size, etc. determined by application I micro sec - 10s of sec



Abhinav Bhatele (CMSC416 / CMSC616)



destination, frequency, size, etc. determined by application I micro sec - 10s of sec



Abhinav Bhatele (CMSC416 / CMSC616)



Message origin points : destination, frequency, size, etc. determined by application I micro sec - 10s of sec



Abhinav Bhatele (CMSC416 / CMSC616)





Abhinav Bhatele (CMSC416 / CMSC616)





Abhinav Bhatele (CMSC416 / CMSC616)



## Parallel file system or I/O sub-system





## Parallel file system or I/O sub-system





# **Group Projects**

- Self form into groups of 2-3
- Project will be ideally at the intersection of systems + ML
  - Using parallel systems to optimize an ML workload
- Timeline (all deadlines are midnight):
  - Group formation and project proposal: March 4
  - Interim report: April 17
  - Final presentation: May 6-13
  - Final report and code: May 15









# UNIVERSITY OF MARYLAND