Lecture 4 (9/9/2009) — Locality + Cache Aware

These are notes from last year, but I think they cover things well — feel free to edit.

Last Time

  • Overview of CPU
  • Memory vs. Registers
  • Importance of locality
  • Future of Wires
    • Why is locality so important? Wires!
      • Power
      • Latency
      • Bandwidth

VLSI Interconnect

Resistance, Capacitance, Inductance, Delay

Resistance

Resistance, R = \frac{length}{height * width} * \rho

Now, let \alpha = \frac{\lambda_\textit{new}}{\lambda_\textit{old}}

Length scales down at the same rate as width.

Then R_\textit{new} = \frac{\alpha\cdot length}{\left(\alpha\cdot height\right)\left(\alpha\cdot width\right)}\rho so \frac{R_\textit{new}}{R_\textit{old}} = \frac{1}{\alpha}

However, height tends to decrease slower than alpha to reduce resistance (increase aspect ratio) so it’s not actually that bad, so \frac{R_\textit{new}}{R_\textit{old}} < \frac{1}{\alpha}

Capacitance

There are four main contributors to wire capacticance, bottom plate, top plate, and sidewalls to other wires on each side.

Capacitance, C \propto length

Then per unit length, \frac{C_\textit{new}}{C_\textit{old}} = 1

However, due to better dielectrics, for unit length, \frac{C_\textit{new}}{C_\textit{old}} < 1

Inductance

Inductance doesn’t contribute much to delay because the inductance, L is relatively small.

Delay

Now, Delay D \propto RC \propto \frac{length}{height * width} * length

So, as we scale down, if the wire scales in length (local wire) the delay stays constant, but if the wire does not scale (global wire) delay scales with \frac{1}{\alpha^2}

So long global wires are a problem because delay goes up quadratically. What to do?

  1. By using repeaters for global wires, delay scales only with \frac{1}{\alpha}
    • delay is now linear
    • power is high due to repeaters
    • vias cause major congestion and problems for CAD tools
  2. By taking advantage of RLC we can build wireless transmission on die
    • speed of light transmission
    • for low latency, less than a cycle delay
    • power and area problems
  3. Capacitive feed-forward low-swing
    • delay is reasonable
    • better power
    • better bandwidth per wire, but less overall wires so perhaps constant total bandwidth

Bandwidth

  • unrepeated wires, simply 1\over\mathit{delay}\mathit{density}
    • delay and density go down with wider wires
      • optimal exists for global mostly
  • repeated, same but per wire-segment (which does scale)
    • 1\over\mathit{segment delay}\mathit{density}
    • repeated wires have increasing BW with technology
  • global wires have higher BW per wire, but fewer wires
    • around 500 50K blocks on .18 so 10X more BW in semiglobal wires
    • another order of magnitude for local wires
  • low-swing techniques can have higher BW per wire
    • but again, fewer wires because of extra circuits, so perhaps ~constant BW.

Power

Power for unrepeated wires is P = \alpha C f V \Delta V + P_\textit{static}

  • Frequency goes up since we want our chips to run faster
  • Supply voltage
    • Used to scale down with technology (same ratio as length).
    • Not anymore though, voltage has stabilized ~1V or 0.9V.
    • Why? Balance of leakage and speed. ( V_\textit{th} can’t keep going down).
    • We want V_\textit{DD} - V_\textit{th} fairly high so our transistor is off. As the difference decreases, leakage current goes up
  • Repeated wires makes thing even worse because repeaters are power hungry
    • Example:
      • roughly speaking constant 0.35 - .45pF per repeated mm across tech (depending on who is estimating).
      • V is stuck at ~.8–1.1V
      • power is .25 - .32mW/gbit/s per mm (or .28 - .32pJ/mm/bit)
  • Solutions?
    1. Communicate over shorter distances so C drops
      • Locality!
    2. Reduce \Delta V using capacitive low swing signaling
      • Two capacitors in series so C goes down
      • Differential low swing signals are amplified using something like DRAM sense amps
      • 10X drop in power or more
      • Pay area:
        • Capacitors
        • Low-swing detectors
      • Reduced area means reduced total bandwidth
      • power consumed when not transmitting
        • state of the art (Mensink, E. , Schinkel, D. and Klumperink, E. (2007) A 0.28 pJ/b 2Gb/s/ch Transceiver in 90nm CMOS for 10mm On-Chip interconnects. Solid-State Circuits Conference, 2007. ISSCC 2007. Digest of Technical Papers. IEEE International. ((URL)) (BibTeX)) is .028 pJ/bit/mm (.12pJ/bit static!) in 90nm
        • but pitch is 15um instead of .5um
        • pitch can be much improved though with same general technique.

Off-chip

  • high-speed serial links
    • 2 - 3 pJ/bit for custom specialized designs from Rambus and Intel
    • 20 - 30 pJ/bit more standard
    • efficient designs are ~5–10gbps per pin pair
      • area is high, .3 mm2 per channel or more (and probably not really shrinking).
      • economically do 1 - 4 Tbps chip I/O
  • Proximity
    • currently 3 pJ/bit but can come down
    • very high BW demonstrated, 10s Tbps possible
  • Optical
    • today > 3pJ/bit and not all that competitive
    • maybe good in future if we have on-chip optical for power dissipation on chip.
  • 3D stacking
    • pins → vias !
    • heat dissipation problems and limited scalability.
      • pushes the problem out but doesn’t solve it.
  • aggressive research seems to be optical.

Summary

Wires don’t scale much in terms of power, just improving BW over time. very small improvements in energy will likely occur.

Wire TypeEnergyQualitative
Scaled local0.25 - 0.3 pJ/bit/mmHigh bandwidth density
Repeated global0.25 - 0.3 pJ/bit/mmGood BW density
Capacitive feed-forward + transmission lines (low swing)0.2 - 0.3 pJ/bit (any distance)Much lower BW than repeaters
Off-chip (in research)2 - 3 pJ/bit 
Off-chip (in products)20 - 30 pJ/bit 
Off-chip far out (optics).5 - 2 pJ/bitHigh BW