#### EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011

#### Lecture 18 – GPUs (III)

#### Mattan Erez



The University of Texas at Austin



EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 18 (c) Mattan Erez

#### Make the Compute Core The Focus of the 2 Architecture

- Photostata and a set of the set
- Sterbuild the perturne adeuspot bit prolisessor omputing



ECE 498AL, University of Illinois, Urbana-Champaign

Computer Architecture, Fall 2011 -- Lecture 18 (c) Mattan Erez

# Streaming Multiprocessor (SM)

- Streaming Multiprocessor (SM)
  - 8 Streaming Processors (SP)
  - 2 Super Function Units (SFU)
- Multi-threaded instruction dispatch
  - Vectors of 32 threads (warps)
  - Up to 16 warps per thread block
    - HW masking of inactive threads in a warp
  - Threads cover latency of texture/memory loads
- 20+ GFLOPS
- 16 KB shared memory
- 32 KB in registers
- DRAM texture and memory access



## **Thread Life Cycle in HW**

Kernel is launched on the SPA Host Device Kernels known as grids of thread blocks Grid 1 Thread Blocks are serially distributed Kernel **Block** 1 to all the SM's (0, 0)Potentially >1 Thread Block per SM **Block** At least 96 threads per block (0, 1)Each SM launches Warps of Thread Grid 2 2 levels of parallelism Kernel 2 SM schedules and executes Warps that are ready to run **Block (1, 1)** As Warps and Thread Blocks Thread Thread Thread complete, resources are freed (0, 0)(1, 0)(2, 0)SPA can distribute more Thread Blocks Thread Thread Thread (1, 1)(0, 1)(2.1)Thread Thread Thread (0, 2)(1, 2)(2, 2)

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign **Block** 

(1, 0)

**Block** 

(1, 1)

Thread

(3, 0)

Thread

(3.1)

Thread

(3, 2)

Thread

(4, 0)

Thread

(4, 1)

Thread

(4, 2)

**Block** 

(2, 0)

**Block** 

(2, 1)

#### **SM Executes Blocks**



Urbana-Champaign

Threads are assigned to SMs in Block granularity

Blocks

- Up to 8 Blocks to each SM as resource allows
- SM in G80 can take up to 768 threads
  - Could be 256 (threads/block) \* 3 blocks
  - Or 128 (threads/block) \* 6 blocks, etc.
- Threads run concurrently
  - SM assigns/maintains thread IDs
  - SM manages/schedules thread execution

#### Make the Compute Core The Focus of the 6 Architecture



ECE 498AL, University of Illinois,

Urbana-Champaign

Computer Architecture, Fall 2011 -- Lecture 18 (c) Mattan Erez

## Thread Scheduling/Execution

- Each Thread Block is divided into 32-thread Warps
  - This is an implementation decision
- Warps are scheduling units in SM
- If 3 blocks are assigned to an SM and each Block has 256 threads, how many Warps are there in an SM?
  - Each Block is divided into 256/32 = 8 Warps
  - There are 8 \* 3 = 24 Warps
  - At any point in time, only one of the 24 Warps will be selected for instruction fetch and execution.



## **SM Warp Scheduling**

- SM hardware implements zerooverhead Warp scheduling
  - Warps whose next instruction has its operands ready for consumption are eligible for execution
  - All threads in a Warp execute the same instruction when selected
  - Scoreboard scheduler
- 4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80
  - If one global memory access is needed for every 4 instructions
  - A minimal of 13 Warps are needed to fully tolerate 200-cycle memory latency



SM multithreaded Warp scheduler

#### time warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95 warp 8 instruction 12 warp 3 instruction 12 warp 3 instruction 96

ECE 498AL, University of Illinois, Urbana-Champaign

#### SM Instruction Buffer – Warp Scheduling

- Fetch one warp instruction/cycle
  - from instruction L1 cache
  - into any instruction buffer slot
- Issue one "ready-to-go" warp instruction/cycle
  - from any warp instruction buffer slot
  - operand scoreboarding used to prevent hazards
- Issue selection based on round-robin/age of warp
- SM broadcasts the same instruction to 32 Threads of a Warp



## Scoreboarding

- All register operands of all instructions in the Instruction Buffer are scoreboarded
  - Status becomes ready after the needed values are deposited
  - prevents hazards
  - cleared instructions are eligible for issue
- Decoupled Memory/Processor pipelines
  - any thread can continue to issue instructions until scoreboarding prevents issue
  - allows Memory/Processor ops to proceed in shadow of Memory/Processor <u>ops</u>



### Granularity and Resource Considerations

11

- For Matrix Multiplication, should I use 8X8, 16X16 or 32X32 tiles (1 thread per tile element)?
  - For 8X8, we have 64 threads per Block. Since each SM can take up to 768 threads, it can take up to 12 Blocks.
    However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM!
  - For 16X16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations overrule.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

- For 32X32, we have 1024 threads per Block. Not even one Kirk/NVIDIA and ei W. Hwu, 2007 to an SM!

Computer Architecture, Fall 2011 -- Lecture 18 (c) Mattan Erez

#### **SM Memory Architecture**



## **SM Register File**

- Register File (RF)
  - 32 KB (1 Kword per SP)
  - Provides 4 operands/clock
- TEX pipe can also read/write RF
  - 2 SMs share 1 TEX
- Load/Store pipe can also read/write RF



## **Programmer View of Register File**

- There are 8192 registers in each SM in G80
  - This is an implementation decision, not part of CUDA
  - Registers are dynamically partitioned across all
    Blocks assigned to the SM
  - Once assigned to a Block, the register is NOT accessible by threads in other Blocks

#### - Each thread in the same

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign IV ACCESS



3 blocks



### Matrix Multiplication Example

- If each Block has 16X16 threads and each thread uses 10 registers, how many thread can run on each SM?
  - Each Block requires 10\*256 = 2560 registers
  - 8192 = **3** \* 2560 + change
  - So, three blocks can run on an SM as far as registers are concerned
- How about if each thread increases the use of registers by 1?
  - Each Block now requires 11\*256 = 2816 registers
  - 8192 < 2816 \*3
  - Only two Blocks can run on an SM, 1/3 reduction of parallelism!!!

## More on Dynamic Partitioning

- Dynamic partitioning gives more flexibility to compilers/programmers
  - One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each
    - This allows for finer grain threading than traditional CPU threading models.
  - The compiler can tradeoff between instruction-level parallelism and thread level parallelism

#### ILP vs. TLP Example

- Assume that a kernel has 256-thread Blocks, 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, global loads have 200 cycles
  - 3 Blocks can run on each SM
- If a Compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory load
  - Only two can run on each SM
  - However, one only needs 200/(8\*4) = 7 Warps to tolerate the memory latency
  - Two Blocks have 16 Warps. The performance can actually be higher!

### **SM Memory Architecture**



#### Constants

- Immediate address constants
- Indexed address constants
- Constants stored in DRAM, and cached on chip
  - L1 per SM
- A constant value can be broadcast to all threads in a Warp
  - Extremely efficient way of accessing a value that is common for all threads in a Block!



#### **Textures**

- Textures are 2D arrays of values stored in global DRAM
- Textures are cached in L1 and L2
- Read-only access
- Caches optimized for 2D access:
  - Threads in a warp that follow 2D locality will achieve better memory performance



#### **SM Memory Architecture**



### **Shared Memory**

- Each SM has 16 KB of Shared Memory
  - 16 banks of 32bit words
- CUDA uses Shared Memory as shared storage visible to all threads in a thread block

read and write access

- Not used explicitly for pixel shader programs
  - we dislike pixels talking to each other

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign

 $\bigcirc$ 

