EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011

#### Lecture 23 – Memory Systems

#### Mattan Erez



The University of Texas at Austin



### Outline

- DRAM technology
- DRAM organization and mechanism
- Memory system organization and design option
- New emerging trends

• Most slides courtesy Jung Ho Ahn, SNU



### **DRAM is Expensive**

- \$/bit is almost nothing
  \$2 x 10<sup>-9</sup>
- Memory in system is expensive
   ~ %10 50 of system cost
- Rule of thumb the more memory the better

#### What is DRAM?



#### **DRAM Cell**





5

#### **DRAM Cell**

- Capacity  $\rightarrow$  density  $\rightarrow$  3D
  - Recessed Channel Array Transistor (capacitor on top)
    - Samsung, Hynix, Elpida, Micron
  - Trench capacitor
    - Infineon, Nanya, ProMos, Winbond
    - IBM, Toshiba in embedded-DRAM





Infineon 80nm



Hynix 80nm

#### **DRAM** Array





7

#### **DRAM Sense Amplifier**





#### **DRAM** Array





#### **DRAM Array**





### **DRAM Optimization**

- The more memory the better → optimize capacity
- Also need to worry about power and drivers
- Result is compromise in latency
  - Small capacitors and large sub-arrays increase access time
- What about bandwidth?
  - Bandwidth is expensive:
    - \$.05 \$.10 per package pin
    - DDR2 requires 80 pins
- Secondary goal is optimizing BW/pin



#### Massively parallel processor architecture

12



BW <u>demand</u> of ALUs >> BW <u>supply</u> from
 DRAMs

#### Stream processor architecture



- LRF and SRF provide a hierarchy of bandwidth and locality
  - SRF decouples execution from memory

#### Streaming Memory Systems (SMSs)



- Off-chip DRAMs need to meet the processor's bandwidth demands
  - Multiple address-interleaved memory channels
    - High bandwidth DRAM per channel



 1) Because of load imbalance between multiple memory channels





 1) Because of load imbalance between multiple memory channels





 2) Because the performance of modern DRAMs is very sensitive to access patterns







 Parallelism and locality are necessary for efficient DRAM usage

## Memory system designs rely on the inherent parallelism/locality of memory accesses



 A stream load or store operation yields a large number of related memory accesses,



# Memory system designs rely on inherent parallelism/locality of memory accesses



- Due to the blocked feature of accesses, a SMS can exploit
  - Parallelism by generating multiple references per cycle per thread



# Memory system designs rely on inherent parallelism/locality of memory accesses



- Due to the blocked feature of accesses, a SMS can exploit
  - Parallelism by generating multiple references per cycle per thread
  - Parallelism by generating references from multiple threads



# Memory system designs rely on inherent parallelism/locality of memory accesses



- Due to the blocked feature of accesses, a SMS can exploit
  - Parallelism by generating multiple references per cycle per thread
  - Parallelism by generating references from multiple threads
  - Parallelism by both of above
  - Locality by generating entire references of a thread

#### It is important to understand how the interactions between these different factors affect performance

### Outline

- DRAM technology
- DRAM organization, mechanism, and trends
- Memory system organization and design option
- A few results

• Most slides courtesy Jung Ho Ahn, HP Labs



#### A DRAM chip is containsmultiple memory <sup>26</sup> banks where each bank is a 2-D array



- (bank, row, column)
- Many shared resources on a DRAM chip
  - Row & column accesses shared by request path.
  - Data read & written through shared data path.
  - All banks share request and data path.

This sharing and the dynamic nature of DRAM result in strict access rules and timing constraints

#### A DRAM follows rules and occupies resources to access a location





**Operation Resource Utilization** 

## A DRAM follows rules and occupies resources to access a location



#### Simplified Bank State Diagram



**Operation Resource Utilization** 



## A DRAM follows rules and occupies resources to access a location



#### Simplified Bank State Diagram



**Operation Resource Utilization** 





## A DRAM follows rules and occupies resources to access a location



#### Simplified Bank State Diagram



**Operation Resource Utilization** 



30

## DRAM operation sequences for two memory read requests





EE382N: Principles of Computer Architecture, Fall 2011 -- Lecture 23 (c) Mattan Erez, Jung Ho Ahn

## DRAM operation sequences for two memory read requests





## DRAM operation sequences for two memory read requests



## Activate to active time determines random access performance





## Switches between read/write commands and data transfer require timing delay



## DRAM parameter trends over various DRAM generations



### DRAM parameter trends over various DRAM generations





Timing trends show that DRAM performance is very sensitive to the presented access patterns

### Outline

- DRAM technology
- DRAM organization, mechanism, and trends
- Memory system organization and design option
- A few results

• Most slides courtesy Jung Ho Ahn, SNU



### **Streaming Memory Systems**

- Bulk stream loads and stores
  - Hierarchical control
- Expressive and effective addressing modes
  - Can't afford to waste memory bandwidth
  - Use hardware when performance is non-deterministic



- Automatic SIMD alignment
  - Makes SIMD trivial (SIMD ≠ short-vector)

### Stream memory system helps the programmer and maximizes I/O throughput

# A streaming memory system consists of AGs, cross-point switch and MCs



• AG : address generator

40

• MC : memory channel

## An AG translates memory access thread into a sequence of individual memory requests



- AG : address generator
- [address, data]



## An AG translates memory access thread into a sequence of individual memory requests



- AG : address generator
- [address, data]
- <u>Record size : # of</u> <u>consecutive words per</u> <u>data record mapped to</u> <u>a PE</u>



## An AG translates memory access thread into a sequence of individual memory requests



- AG : address generator
- [address, data]
- <u>Stride : address gap</u>
   <u>between consecutive</u>
   <u>records</u>



### **Cross-point Switch**



 On-chip and off-chip have <u>different address</u>
 <u>spaces</u>



## A memory channel contains MSHRs, channel buffer entries, and a memory controller





## A memory channel contains MSHRs, channel buffer entries, and a memory controller



**DRAM** commands



### The design space of a Streaming Memory 47 System



### AG design space



- A single wide AG vs. multiple narrow AGs
  - Inter-thread vs. intrathread parallelism
  - Load balancing across MC

# of accesses each
 AG can generate per cycle

### Memory controller in a MC determines DRAM command per cycle 49 based on the MAS policy

Per DRAM command cycle, the memory controller
 Looks at the status of every DRAM bank

#### DRAM bank status

bank 0 : row 2 active

bank 1 : row 0 active

bank 2 : row 3 precharging

bank 3 : idle



### Memory controller in a MC determines DRAM command per cycle 50 based on the MAS policy

- Per DRAM command cycle, the memory controller
  - Looks at the status of every DRAM bank
  - Finds an available command per pending access without violating timing and resource constraints

#### DRAM bank status

bank 0 : row 2 active

bank 1 : row 0 active

bank 2 : row 3 precharging

bank 3 : idle

Pending requests in channel buffer

(0, 0, 0) write - precharge

(1, 0, 0) read - read

(0, 2, 1) write

(1, 0, 1) read - read

#### Read occurred in the previous cycle

### Memory controller in a MC determines DRAM command per cycle 51 based on the MAS policy

- Per DRAM command cycle, the memory controller
  - Looks at the status of every DRAM bank
  - Finds an available command per pending access without violating timing and resource constraints
  - Selects the command to issue among all available commands based on the priority of the chosen policy



Memory controller in a MC determines DRAM command per cycle 52 based on the MAS policy

- Scheduling policies
  - <u>Open</u>: a row is precharged when there are <u>no pending</u> accesses <u>to the row</u> and there is at least <u>one pending</u> access <u>to a different row</u> in the same bank

#### Pending requests in channel buffer

| <u>Case 1</u> (1, 0, | <u>0) write</u> | <u>Case 2</u> | <u>(1, 0, 0) write</u> |
|----------------------|-----------------|---------------|------------------------|
| <u>(1, 0,</u>        | <u>0) read</u>  |               | <u>(1, 3, 0) read</u>  |
| <u>(2, 0,</u>        | <u>1) write</u> |               | <u>(2, 5, 1) write</u> |
| <u>(2, 0,</u>        | <u>1) read</u>  |               | (0, 1, 1) read         |
| <u>(2, 0,</u>        | <u>1) write</u> |               | <u>(2, 5, 1) wri</u>   |



bank 0 : row 0 is active

### Memory controller in a MC determines DRAM command per cycle 53 based on the MAS policy

- Scheduling policies
  - <u>Open</u>: a row is precharged when there are <u>no pending</u> accesses <u>to the row</u> and there is at least <u>one pending</u> access <u>to a different row</u> in the same bank
  - <u>Closed</u>: a row is precharged <u>as soon as the last</u> available reference to that row is performed

#### Pending requests in channel buffer

| <u>Case 1</u> | <u>(1, 0, 0) write</u> | <u>Case 2</u> | <u>(1, 0, 0) write</u> |
|---------------|------------------------|---------------|------------------------|
|               | <u>(1, 0, 0) read</u>  |               | <u>(1, 3, 0) read</u>  |
|               | <u>(2, 0, 1) write</u> |               | <u>(2, 5, 1) write</u> |
|               | <u>(2, 0, 1) read</u>  |               | <u>(0, 1, 1) read</u>  |
|               |                        | •••           | •                      |



|             |             | Reorder row | Reorder column |             | Access    |
|-------------|-------------|-------------|----------------|-------------|-----------|
| Algorithm N | Window size | commands    | commands       | Precharging | selection |
| inorder 1   | 1           | N/A         | N/A            | N/A         | N/A       |

Inorder policy processes pending requests **one by one**, effectively having window size of 1

|           |             | Reorder row | Reorder column |             | Access       |
|-----------|-------------|-------------|----------------|-------------|--------------|
| Algorithm | Window size | commands    | commands       | Precharging | selection    |
| inorder   | 1           | N/A         | N/A            | N/A         | N/A          |
| inorderla | nCB         | Yes         | Νο             | Open        | Column First |

#### Inorderla looks ahead of other pending requests and generates row commands, not column commands

|            |             | Reorder row | Reorder column |             | Access       |
|------------|-------------|-------------|----------------|-------------|--------------|
| Algorithm  | Window size | commands    | commands       | Precharging | selection    |
| inorder    | 1           | N/A         | N/A            | N/A         | N/A          |
| inorderla  | nCB         | Yes         | No             | Open        | Column First |
| firstready | nCB         | Yes         | Yes            | Open        | N/A          |

Firstready policy checks and processes pending requests one by one from the oldest until it finishes looking at all the CBEs

| Algorithm  | Window size | Reorder row<br>commands | Reorder column<br>commands | Precharging | Access<br>selection |
|------------|-------------|-------------------------|----------------------------|-------------|---------------------|
| inorder    | 1           | N/A                     | N/A                        | N/A         | N/A                 |
| inorderla  | nCB         | Yes                     | No                         | Open        | Column First        |
| firstready | nCB         | Yes                     | Yes                        | Open        | N/A                 |
| opcol      | nCB         | Yes                     | Yes                        | Open        | Column First        |
| oprow      | nCB         | Yes                     | Yes                        | Open        | Row First           |
| clcol      | nCB         | Yes                     | Yes                        | Closed      | Column First        |
| clrow      | nCB         | Yes                     | Yes                        | Closed      | Row First           |

Remaining four policies reorder both row and column commands using open/closed or column first/row first A new micro-architecture design for a memory channel – a channels split configuration



- MSHR and CBE per AG
- <u>Switch</u> thread when the requests are <u>completely</u> <u>drained</u>
- Avoid
  - <u>Resource monopolization</u>
  - Internal bank conflicts
  - <u>Read/write turnaround</u>
     <u>penalty</u>



### Outline

- DRAM technology
- Impact on memory system
  - Stream architecture review
- DRAM organization, mechanism, and trends
- Memory system organization and design option
- A few results

• Most slides courtesy Jung Ho Ahn, HP Labs



# Six applications from multimedia and scientific domains are used for study

- **<u>DEPTH</u>** : stereo depth encoder
- **MPEG** : MPEG-2 video encoder
- **<u>RTSL</u>** : graphics rendering pipeline
- **<u>QRD</u>** : complex matrix→upper triangular&othorgonal
- **<u>FEM</u>** : finite element method
- **MOLE** : n-body molecular dynamics



# A cycle-accurate Imagine simulator is used for performance evaluation

61





# A cycle-accurate Imagine simulator is used for performance evaluation

62



# A cycle-accurate Imagine simulator is used for performance evaluation



- DRAM burst length : 4 words
  - Peak DRAM BW : 4 GW/s

63

- # of internal DRAM banks :
   8
  - DRAM typing params : XDR
- Peak DRAM BW : 2
  - # of DRAM commands for accessing an inactive row
     3

# Memory system performance for representative configurations on six apps

64





# The key memory system related characteristics of six applications

|             | Average strided    |                      |                   | Average indexed    |                      |                |                   |                |
|-------------|--------------------|----------------------|-------------------|--------------------|----------------------|----------------|-------------------|----------------|
| Application | record<br>size (W) | stream<br>length (W) | stride/<br>record | record<br>size (W) | stream<br>length (W) | index<br>range | strided<br>access | read<br>access |
| DEPTH       | 1.96               | 1802                 | 1.95              | 1                  | 1170                 | 1180           | 46.6%             | 63.0%          |
| MPEG        | 1                  | 1515                 | 1                 | 1                  | 1280                 | 2309           | 90.1%             | 70.2%          |

#### DEPTH & MPEG has small record, stride size, and index range

# Memory system performance for representative configurations on six apps

66



Small record, stride size and index range means high spatial locality between generated requests from an access thread

## The key memory system related characteristics of six applications

|             | Average strided    |                      |                   | Average            | indexed              |                |                     |                |
|-------------|--------------------|----------------------|-------------------|--------------------|----------------------|----------------|---------------------|----------------|
| Application | record<br>size (W) | stream<br>length (W) | stride/<br>record | record<br>size (W) | stream<br>length (W) | index<br>range | strided<br>access   | read<br>access |
| DEPTH       | 1.96               | 1802                 | 1.95              | 1                  | 1170                 | 1180           | 46.6%               | 63.0%          |
| MPEG        | 1                  | 1515                 | 1                 | 1                  | 1280                 | 2309           | 90.1%               | 70.2%          |
| RTSL        | 4                  | 1170                 | 4                 | 1                  | 264                  | 216494         | <mark>65</mark> .1% | 83.5%          |
| MOLE        | 1                  | 480                  | 1                 | 9                  | 3252                 | 7190           | <b>9.9</b> %        | 99.5%          |

Streams with small record size and large index ranges lacks spatial locality between generated requests

# Memory system performance for representative configurations on six apps

68



#### Long bursts hurt memory system performance

## The key memory system related characteristics of six applications

|                 | Avera              | ge strided              | trided Average indexed |                    |                         |                |                           |                    |
|-----------------|--------------------|-------------------------|------------------------|--------------------|-------------------------|----------------|---------------------------|--------------------|
| Applicati<br>on | record<br>size (W) | stream<br>length<br>(W) | stride/<br>recor<br>d  | record<br>size (W) | stream<br>length<br>(W) | index<br>range | stride<br>d<br>acce<br>ss | read<br>acce<br>ss |
| DEPTH           | 1.96               | 1802                    | 1.95                   | ]                  | 1170                    | 1180           | 46.6%                     | 63.0%              |
| MPEG            | ]                  | 1515                    | 1                      | ]                  | 1280                    | 2309           | 90.1%                     | 70.2%              |
| RTSL            | 4                  | 1170                    | 4                      | ]                  | 264                     | 21649<br>4     | 65.1%                     | 83.5%              |
| MOLE            | 1                  | 480                     | 1                      | 9                  | 3252                    | 7190           | 9.9%                      | 99.5%              |
| QRD             | 115                | 1053                    | 350                    | N/A                | N/A                     | N/A            | 100%                      | 69.0<br>%          |

Large record size means high spatial locality in generated requests

# Memory system performance for representative configurations on six apps

70



#### QRD & FEM perform similar to DEPTH & MPEG

#### Performance sensitivity to the size of MC 71 buffers





#### Performance sensitivity to the size of MC 72 buffers





# Performance sensitivity to the size of MC buffers

73





### buffers rises as the number of AGs is increased

74





## Reordering row & column commands are 75 important, but specific MAS policies are not





### **Emerging Memory Technology**

- New non-volatile storage devices
  - Fine-grained access like DRAM, non-volatile like FLASH
    - No refresh (or almost no refresh)
  - Density equal to or better than DRAM
    - Potentially more scalable than DRAM
  - Higher latencies, especially for writes
  - Higher write energy
  - Possible endurance issues
  - Very similar array structure and interface design to DRAM
- New interface options
  - 3D integration
    - Memory on top of a processor
    - Memory cubes
  - Optical interconnect

### Conclusion

- DRAM trends
  - Data BW increases rapidly while latency and cmd BW improve slowly
    - DRAM access granularity grows
    - Throughput is very sensitive to access patterns
  - Locality must be exploited
    - To minimize internal bank conflicts and read-write turnaround penalties
- Memory system design space
  - Number of AGs inter-thread vs. intra-thread parallelism
  - Load balance across MCs channel interleaving, multiple threads, and AG width
  - The amount of MC buffering determines the window size of MAS
- Design suggestions
  - A single wide AG exploits DRAM locality well
  - Channel-split mechanism exploits locality and balances loads across multiple channels simultaneously at the cost of additional hardware

