EE382V (17325): Principles in Computer Architecture Parallelism and Locality Fall 2007

Lecture 23 - Data Parallel Memory Systems

Mattan Erez



The University of Texas at Austin

# Outline

- DRAM technology
- Impact on memory system
  - Stream architecture review
- DRAM organization, mechanism, and trends
- Memory system organization and design option
- A few results

Most slides courtesy Jung Ho Ahn, HP Labs

### **DRAM** is Expensive

- \$/bit is almost nothing
  - \$2 x 10<sup>-9</sup>
- Memory in system is expensive
  - %10 50 of system cost
- Rule of thumb the more memory the better





### DRAM Cell

- Capacity → density → 3D
  - Recessed Channel Array Transistor (capacitor on top)
    - Samsung, Hynix, Elpida, Micron
  - Trench capacitor
    - Infineon, Nanya, ProMos, Winbond
    - IBM, Toshiba in embedded-DRAM



Hynix 80nm



Infineon 80nm

# DRAM Array



### DRAM Sense Amplifier



# DRAM Array



# DRAM Array



### **DRAM Optimization**

- The more memory the better -> optimize capacity
- Also need to worry about power and drivers
- Result is compromise in latency
  - Small capacitors and large sub-arrays increase access time
- What about bandwidth?
  - Bandwidth is expensive:
    - \$.05 \$.10 per package pin
    - DDR2 requires 80 pins
- Secondary goal is optimizing BW/pin

# Outline

- DRAM technology
- Impact on memory system
  - Stream architecture review
- DRAM organization, mechanism, and trends
- Memory system organization and design option
- A few results

Most slides courtesy Jung Ho Ahn, HP Labs



### Stream processor architecture



 BW <u>demand</u> of ALUs >> BW <u>supply</u> from DRAMs



### Stream processor architecture



LRF: local register file

SRF: stream register file

- LRF and SRF provide a hierarchy of bandwidth and locality
- SRF decouples execution from memory

# NE PROPERTY OF THE PROPERTY OF

### **Streaming Memory Systems (SMSs)**





Radeon X1800 memory controller [source: ATI]

- Off-chip DRAMs need to meet the processor's bandwidth demands
  - Multiple address-interleaved memory channels
  - High bandwidth DRAM per channel



 1) Because of load imbalance between multiple memory channels



 1) Because of load imbalance between multiple memory channels



 2) Because the performance of modern DRAMs is very sensitive to access patterns







 Parallelism and locality are necessary for efficient DRAM usage



 A stream load or store operation yields a large number of related memory accesses,



- Due to the blocked feature of accesses, a SMS can exploit
  - Parallelism by generating multiple references per cycle per thread



- Due to the blocked feature of accesses, a SMS can exploit
  - Parallelism by generating multiple references per cycle per thread
  - Parallelism by generating references from multiple threads

Stream store

Stream load

© Jung Ho







- Due to the blocked feature of accesses, a SMS can exploit
  - Parallelism by generating multiple references per cycle per thread
  - Parallelism by generating references from multiple threads
  - Parallelism by both of above
  - Locality by generating entire references of a thread

It is important to understand how the interactions between these different factors affect performance

# Outline

- DRAM technology
- Impact on memory system
  - Stream architecture review
- DRAM organization, mechanism, and trends
- Memory system organization and design option
- A few results

Most slides courtesy Jung Ho Ahn, HP Labs

# STEP THE CO

## A DRAM chip is organized as a number of memory banks where each bank is a 2-D array



- (bank, row, column)
- Many shared resources on a DRAM chip
  - Row & column accesses shared by request path.
  - Data read & written through shared data path.
  - All banks share request and data path.

This sharing and the dynamic nature of DRAM result in strict access rules and timing constraints

### A DRAM follows rules and occupies resources to access a location





### A DRAM follows rules and occupies resources to access a location



### Simplified Bank State Diagram





# A STORES

### A DRAM follows rules and occupies resources to access a location



### Simplified Bank State Diagram





# A THE STATE OF

### A DRAM follows rules and occupies resources to access a location



### Simplified Bank State Diagram





## DRAM operation sequences for two memory read requests



## DRAM operation sequences for two memory read requests



## DRAM operation sequences for two memory read requests



Internal bank conflict: requiring more commands and cycles



### Activate to active time determines random access performance









## Switches between read/write commands and data transfer require timing delay



### DRAM parameter trends over various DRAM generations



### DRAM parameter trends over various DRAM generations



Timing trends show that DRAM performance is very sensitive to the presented access patterns

# Outline

- DRAM technology
- Impact on memory system
  - Stream architecture review
- DRAM organization, mechanism, and trends
- Memory system organization and design option
- A few results

Most slides courtesy Jung Ho Ahn, HP Labs

#### **Streaming Memory Systems**

- Bulk stream loads and stores
  - Hierarchical control
- Expressive and effective addressing modes
  - Can't afford to waste memory bandwidth
  - Use hardware when performance is non-deterministic



Stream memory system helps the programmer and maximizes I/O throughput



### A streaming memory system consists of AGs, crosspoint switch and MCs



- AG: address generator
- MC: memory channel

## An AG translates a memory access thread into a sequence of individual memory requests



- AG: address generator
- [address, data]

## An AG translates a memory access thread into a sequence of individual memory requests



- AG: address generator
- [address, data]
- Record size : # of consecutive words per data record mapped to a PE

## An AG translates a memory access thread into a sequence of individual memory requests



- AG: address generator
- [address, data]
- Stride : address gap between consecutive records

# STEP ME CE

#### **Cross-point Switch**



 On-chip and off-chip have <u>different address</u> <u>spaces</u>

## A STORES

### A memory channel contains MSHRs, channel buffer entries, and a memory controller





### A memory channel contains MSHRs, channel buffer entries, and a memory controller



#### The design space of a Streaming Memory System



- Address generator
  - # of AG
  - AG width
- Memory channel
  - # of channel buffer entries
  - MAS policy
  - Channel-split configuration

# STEP PROCES

#### AG design space



- A single wide AG vs. multiple narrow AGs
  - Inter-thread vs. intrathread parallelism
  - Load balancing across
    MC
- The AG width
  - # of accesses each AG can generate per cycle

- Per DRAM command cycle, the memory controller
  - Looks at the status of every DRAM bank

#### **DRAM bank status**

bank 0 : row 2 active

bank 1 : row 0 active

bank 2 : row 3 precharging

bank 3: idle

- Per DRAM command cycle, the memory controller
  - Looks at the status of every DRAM bank
  - Finds an available command per pending access without violating timing and resource constraints

DRAM bank status

Pending requests in channel buffer

bank 0: row 2 active

(0, 0, 0) write - precharge

bank 1: row 0 active

(1, 0, 0) read - read

bank 2 : row 3 precharging

(0, 2, 1) write

bank 3: idle

(1, 0, 1) read - read

Read occurred in the previous cycle

- Per DRAM command cycle, the memory controller
  - Looks at the status of every DRAM bank
  - Finds an available command per pending access without violating timing and resource constraints
  - Selects the command to issue among all available commands based on the priority of the chosen policy

#### DRAM bank status

bank 0: row 2 active

bank 1 : row 0 active

bank 2 : row 3 precharging

bank 3 : idle

#### Pending requests in channel buffer

(0, 0, 0) write - precharge

(1, 0, 0) read - read

(0, 2, 1) write

(1, 0, 1) read - read

Read occurred in the previous cycle

#### Scheduling policies

Open: a row is precharged when there are no pending accesses to the row and there is at least one pending access to a different row in the same bank

#### Pending requests in channel buffer

| Case 1 | (1, 0, 0) write | Case 2 | (1, 0, 0) write |
|--------|-----------------|--------|-----------------|
|        | (1, 0, 0) read  |        | (1, 3, 0) read  |
|        | (2, 0, 1) write |        | (2, 5, 1) write |
|        | (2, 0, 1) read  |        | (0, 1, 1) read  |

#### bank 0 : row 0 is active

#### Scheduling policies

- Open: a row is precharged when there are no pending accesses to the row and there is at least one pending access to a different row in the same bank
- <u>Closed</u>: a row is precharged <u>as soon as the last</u> available reference to that row is performed

#### Pending requests in channel buffer

| Case 1 | (1, 0, 0) write | Case 2 | (1, 0, 0) write |
|--------|-----------------|--------|-----------------|
|        | (1, 0, 0) read  |        | (1, 3, 0) read  |
|        | (2, 0, 1) write |        | (2, 5, 1) write |
|        | (2, 0, 1) read  |        | (0, 1, 1) read  |

#### bank 0 : row 0 is active

|           |             | Reorder row | Reorder column |             | Access    |
|-----------|-------------|-------------|----------------|-------------|-----------|
| Algorithm | Window size | commands    | commands       | Precharging | selection |
| inorder   | 1           | N/A         | N/A            | N/A         | N/A       |

|           |             | Reorder row | Reorder column |             | Access       |
|-----------|-------------|-------------|----------------|-------------|--------------|
| Algorithm | Window size | commands    | commands       | Precharging | selection    |
| inorder   | 1           | N/A         | N/A            | N/A         | N/A          |
| inorderla | nCB         | Yes         | No             | Open        | Column First |

|            |             | Reorder row | Reorder column |             | Access       |
|------------|-------------|-------------|----------------|-------------|--------------|
| Algorithm  | Window size | commands    | commands       | Precharging | selection    |
| inorder    | 1           | N/A         | N/A            | N/A         | N/A          |
| inorderla  | nCB         | Yes         | No             | Open        | Column First |
| firstready | nCB         | Yes         | Yes            | Open        | N/A          |

Firstready policy checks and processes pending requests one by one from the oldest until it finishes looking at all the CBEs

|            |             | Reorder row | Reorder column |             | Access       |
|------------|-------------|-------------|----------------|-------------|--------------|
| Algorithm  | Window size | commands    | commands       | Precharging | selection    |
| inorder    | 1           | N/A         | N/A            | N/A         | N/A          |
| inorderla  | nCB         | Yes         | No             | Open        | Column First |
| firstready | nCB         | Yes         | Yes            | Open        | N/A          |
| opcol      | nCB         | Yes         | Yes            | Open        | Column First |
| oprow      | nCB         | Yes         | Yes            | Open        | Row First    |
| clcol      | nCB         | Yes         | Yes            | Closed      | Column First |
| clrow      | nCB         | Yes         | Yes            | Closed      | Row First    |



### A new micro-architecture design for a memory channel – a channel-split configuration



- MSHR and CBE <u>per AG</u>
- Switch thread when the requests are completely drained
- Avoid
  - Resource monopolization
  - Internal bank conflicts
  - Read/write turnaround penalty

# Outline

- DRAM technology
- Impact on memory system
  - Stream architecture review
- DRAM organization, mechanism, and trends
- Memory system organization and design option
- A few results

Most slides courtesy Jung Ho Ahn, HP Labs

### Six applications from multimedia and scientific domains are used for study

- **<u>DEPTH</u>**: stereo depth encoder
- MPEG: MPEG-2 video encoder
- **RTSL** : graphics rendering pipeline
- QRD : complex matrix→upper triangular&othorgonal
- FEM : finite element method
- MOLE: n-body molecular dynamics



### A cycle-accurate Imagine simulator is used for performance evaluation



- 1 GHz
- 8 processing elements

# **STAN**INGS

### A cycle-accurate Imagine simulator is used for performance evaluation



### A cycle-accurate Imagine simulator is used for performance evaluation



### Memory system performance for representative configurations on six apps





## The key memory system related characteristics of six applications

|             | Average strided    |                         |                   | Average indexed    |                         |                |                   |                |
|-------------|--------------------|-------------------------|-------------------|--------------------|-------------------------|----------------|-------------------|----------------|
| Application | record<br>size (W) | stream<br>length<br>(W) | stride/<br>record | record<br>size (W) | stream<br>length<br>(W) | index<br>range | strided<br>access | read<br>access |
| DEPTH       | 1.96               | 1802                    | 1.95              | 1                  | 1170                    | 1180           | 46.6%             | 63.0%          |
| MPEG        | 1                  | 1515                    | 1                 | 1                  | 1280                    | 2309           | 90.1%             | 70.2%          |

### Memory system performance for representative configurations on six apps



Small record, stride size and index range means high spatial locality between generated requests from an access thread



## The key memory system related characteristics of six applications

|             | Average strided Average indexed |                         |                   |                    |                         |                |                   |                |
|-------------|---------------------------------|-------------------------|-------------------|--------------------|-------------------------|----------------|-------------------|----------------|
| Application | record<br>size (W)              | stream<br>length<br>(W) | stride/<br>record | record<br>size (W) | stream<br>length<br>(W) | index<br>range | strided<br>access | read<br>access |
| DEPTH       | 1.96                            | 1802                    | 1.95              | 1                  | 1170                    | 1180           | 46.6%             | 63.0%          |
| MPEG        | 1                               | 1515                    | 1                 | 1                  | 1280                    | 2309           | 90.1%             | 70.2%          |
| RTSL        | 4                               | 1170                    | 4                 | 1                  | 264                     | 216494         | 65.1%             | 83.5%          |
| MOLE        | 1                               | 480                     | 1                 | 9                  | 3252                    | 7190           | 9.9%              | 99.5%          |

Streams with small record size and large index ranges lacks spatial locality between generated requests

### Memory system performance for representative configurations on six apps



Long bursts hurt memory system performance



## The key memory system related characteristics of six applications

|                 | Average strided Average indexed |                  |                  |                |                  |                |             |                |
|-----------------|---------------------------------|------------------|------------------|----------------|------------------|----------------|-------------|----------------|
| Applicati<br>on | record<br>size                  | stream<br>length | stride/<br>recor | record<br>size | stream<br>length | index<br>range | stride<br>d | read<br>acce   |
| DEPTH           | 1.96                            | 1802             | 1.95             | 1              | 1170             | 1180           | acce<br>ss  | <b>\$</b> 3.0% |
| MPEG            | 1                               | 1515             | 1                | 1              | 1280             | 2309           | 90.1%       | 70.2%          |
| RTSL            | 4                               | 1170             | 4                | 1              | 264              | 21649          | 65.1%       | 83.5%          |
| MOLE            | 1                               | 480              | 1                | 9              | 3252             | 7190           | 9.9%        | 99.5%          |
| QRD             | 115                             | 1053             | 350              | N/A            | N/A              | N/A            | 100%        | 69.0           |
| FEM             | 12.4                            | 1896             | 12.4             | 24             | 3853             | 20334          | 48.8%       | %<br>74.0<br>% |

Large record size means high spatial locality in generated requests





# Performance se

#### Performance sensitivity to the size of MC buffers



# NEW MILECOLD

#### Performance sensitivity to the size of MC buffers



## Performance sensitivity to the size of MC buffers



### Performance sensitivity to the size of MC buffers rises as the number of AGs is increased



### Reordering row & column commands are important, but specific MAS policies are not



#### Conclusion

#### DRAM trends

- Data BW increases rapidly while latency and cmd BW improve slowly
  - DRAM access granularity grows
  - Throughput is very sensitive to access patterns
- Locality must be exploited
  - To minimize internal bank conflicts and read-write turnaround penalties
- Memory system design space
  - Number of AGs inter-thread vs. intra-thread parallelism
  - Load balance across MCs channel interleaving, multiple threads, and AG width
  - The amount of MC buffering determines the window size of MAS
- Design suggestions
  - A single wide AG exploits DRAM locality well
  - Channel-split mechanism exploits locality and balances loads across multiple channels simultaneously at the cost of additional hardware