Hardware Trends and SW/HW Co-Tuning Opportunities

Organizing Committee: Mattan Erez, University of Texas at Austin

Reaching exascale will involve significant changes to the underlying system components, such as the processor, memory, and interconnect. These changes include new technology as well as continued advances and improved efficiency of known techniques. For example emerging non-volatile memory can potentially be integrated as part of the main memory system rather than just used as solid-state disks and integrated optical interconnect can significantly change communication tradeoffs. At the same time new opportunities are emerging for improving the efficiency of the processor architecture itself and both on-chip and off-chip electrical links. In this workshop we will focus on trends in hardware components, projections on future capabilities and constraints, and the implications on applications. We will also discuss opportunities for co-tuning software and hardware and the potential for new paradigms. The goals of the workshop are to present predictions on where hardware is heading and identify potential problems and opportunities resulting from this new technology on applications and software.

The workshop is set up in 3 sessions, each with 2 talks about trends and opportunities, followed by a brief mini-panel to develop ideas and start discussions. At the end of the panel we will have an audience-wide discussion to identify the implications on applications and software of interest and explore opportunities for co-tuning and co-design.

8:15 - 8:30Welcome and Introduction, Mattan Erez
8:30 - 9:05Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?, James C, Hoe
9:05 - 9:40From GPU Computing to Exascale: Technology Trends, Brucek Khailany
9:40 - 10Minipanel: Processor Trends
10 - 10:30Break
10:30 - 11:05Sustainable Silicon: Energy-Efficient VLSI Interconnects, Patrick Chiang
11:05 - 11:40Optical Interconnects for Exascale Systems, Moray McLaren
11:40 - 12Minipanel: Interconnect Trends
12 - 1:30Lunch
1:30 - 2:05Low-power/Low-voltage Computing, Shih-Lien Lu
2:05 - 2:40Processors have evolved, why haven’t main memories?, Al Davis
2:40 - 3Minipanel: On- and Off-Chip Memories
3 - 3:30Break
3:30 - 4Quick Recap, Mattan Erez
4 - 5Discussion: Implications on Software and Opportunities for Co-Tuning (Co-Design) |

Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?

SPEAKER: James C. Hoe, Carnegie Melon University

Slides

ABSTRACT:

To extend the exponential performance scaling of future chip multiprocessors, improving energy efficiency has become a first-class priority. Single-chip heterogeneous computing has the potential to achieve greater energy efficiency by combining traditional processors with unconventional cores (U-cores) such as custom logic, FPGAs, or GPGPUs. To understand the relative merits between different approaches in the face of technology constraints, we have developed a parameterized analytical performance model for heterogeneous multicores with U-core support. Unlike prior multicore performance models that trade performance, power, and area using well-known relationships between simple and complex processors, our model must consider the less-obvious relationships between conventional processors and a diverse set of U-cores. Further, our model supports speculation of future designs from scaling trends predicted by the ITRS road map. The predictive power of our model depends upon U-core-specific parameters derived by measuring performance and power of tuned applications on today’s state-of-the-art multicores, GPUs, FPGAs, and ASICs. The results from our study reinforce some current-day understandings of the potential and limitations of U-cores and also provides new insights on their relative merits. This work is the thesis research of Eric Chung at Carnegie Mellon University in collaboration with Peter Milder and Ken Mai.

BIO:

James C. Hoe is Professor and Associate Department Head of Electrical and Computer Engineering at Carnegie Mellon University. He received his Ph.D. in EECS from Massachusetts Institute of Technology in 2000. He is interested in many aspects of computer architecture and digital hardware design, including the specific areas of fault-tolerant processors and systems; high-level hardware description and synthesis; and computer simulation and prototyping technologies. He co-directs the Computer Architecture Lab at Carnegie Mellon (CALCM) and is affiliated with the Center for Silicon System Implementation (CSSI) and the Carnegie Mellon CyLab. He heads the OpenSPARC Center of Excellence at Carnegie Mellon. For more information, please visit http://www.ece.cmu.edu/~jhoe.

From GPU Computing to Exascale: Technology Trends

SPEAKER: Brucek Khailany, NVIDIA

ABSTRACT:

The transition to Exascale computing over the next decade presents several significant processor architecture and memory system architecture challenges. In this talk, we will highlight two key challenges, energy efficiency and hardware/software mechanisms for supporting many more parallel threads of execution.

Energy efficiency is expected to be the most significant challenge for scaling today’s processor architectures to Exascale. A sustained 20 pJ/FLOP (50 GFLOPS/W) on double-precision floating-point intensive workloads is expected to be needed to meet thermal and power delivery requirements in Exascale systems. Today’s multi-core CPU-based systems would need to improve by 100x from 2nJ/FLOP to hit this goal. Today’s high-throughput GPU architectures already have a 5–10x pJ/GFLOP advantage over multi-core CPU architectures. A natural candidate for Exascale computing would be composed of heterogeneous elements: latency-optimized cores to run O/S, runtime, and serial tasks, and throughput-optimized cores to provide high computation capabilities and energy efficiency for code sections with ample parallelism. In this talk, we will highlight why today’s GPU architectures provide such a large energy-efficiency advantage and describe some research opportunities for getting to the 20 pJ/FLOP needed for Exascale.

Another key challenge for Exascale will be supporting parallel execution of millions to billions of threads while providing a productive environment for programmers. This is another area we can look to recent successes in GPU computing as an opportune starting point. GPUs already provide hardware and software mechanisms for creating and scheduling tens of thousands of lightweight threads, thread synchronization, and thread communication. In this talk, we will describe these mechanisms and present research opportunities for extending these concepts to many more threads over many nodes.

BIO:

Brucek Khailany joined NVIDIA Research in December 2009 as a Senior Research Scientist. His research interests include energy-efficient throughput-oriented processor architectures and circuits, software-managed memory and register hierarchies, VLSI design methodology, and computer arithmetic. Prior to joining NVIDIA, Dr. Khailany was a Co-Founder and Principal Architect at Stream Processors, Inc. (SPI) where he led the design and implementation of highly-parallel programmable processors. From 2004–2009, under his technical leadership, SPI developed the industry’s first commercially-available stream processor architecture targeting signal and image processing applications. Brucek received his Ph.D. from Stanford University in 2003, where he led the silicon implementation of the Imagine stream processor, a research chip that introduced the concepts of stream processing and partitioned register organizations. He is a member of IEEE and ACM, received his PhD and Masters from Stanford University in 2003 and 1999, and received a BSE from the University of Michigan in 1997.

Sustainable Silicon: Energy-Efficient VLSI Interconnects

SPEAKER: Patrick Chiang, Oregon State University

Slides

ABSTRACT:

The goal of the OSU-VLSI research group is to develop energy-efficient, VLSI interconnect circuits and systems that will facilitate future massively-parallel, high-performance computing. Extreme-scale computing will exhibit massive parallelism on multiple vertical levels, from thousands of computational units on a single processor to thousands of processors in a single data center. Unfortunately, the energy required to communicate between these units at every level (onchip, off-chip, off-rack) will be the critical limitation to energy efficiency.

In this talk, I will describe current research in improving upon the state-of-the-art in energy efficiency (pJ/b) of short-range wireline interconnects. Specifically:

  1. Energy-Efficient, On-Chip Links – Because both VDD and wire capacitance have essentially stopped scaling with CMOS technology, the energy consumed has also been pegged to approximately 150fJ/b/mm. In this talk, I will describe two measured testchips that improve on-chip interconnect energy by 4x (40fJ/b/mm) and 20x (8fJ/b/mm).
  2. Sub-1mW/Gbps Off-Chip Links – The power consumption of off-chip links are critical for future many-core systems, as they will be limited by data communications for keeping functional units busy. Here, we will present two 8Gbps serial link receivers that achieve measured energy-efficiencies of 0.6mW/Gbps and 0.1mW/Gbps, approximately 5x better than current state-of-the-art.

For the conclusion, I will describe how these circuit level optimizations will require significant interfacing will the software layer, such as explicit coarse and fine-grain control of the transceiver operating conditions.

BIO:

Patrick Chiang received the B.S. degree in electrical engineering and computer sciences from the University of California, Berkeley, in 1998, and the M.S. and Ph.D. degrees in electrical engineering from Stanford University in 2001 and 2007. He is currently an assistant professor of electrical and computer engineering at Oregon State University. In 1998, he was with Datapath Systems (now LSI Logic), working on analog front-ends for DSL chipsets. In 2002 he was a research intern at Velio Communications (now Rambus) working on 10GHz clock synthesis architectures. In 2004 he was a consultant at startup Telegent Systems, evaluating low phase noise VCOs for CMOS mobile TV tuners. In 2006 he was a visiting NSF postdoctoral researcher at Tsinghua University, China, investigating low power, low voltage RF transceivers. In Summer 2007, he was a visiting professor at the Institute of Computing Technology, Chinese Academy of Sciences. He is the recipient of the 2010 Department of Energy Early CAREER award.

Optical Interconnects for Exascale Systems

SPEAKER: Moray McLaren, HP Labs

Slides

ABSTRACT:

The roadmap to Exaflop systems contains significant challenges in network design. Optical interconnects will play a significant part in achieving the aggressive power and bandwidth targets for these systems. The open question is whether optical interconnects will lead to radically different system architectures, or simply more efficient implementations of traditional designs. New short range optical interconnects have reduced the cost of converting to and from the optical domain, but this can be regarded as simply “better wires”. On the other hand technologies such as integrated CMOS photonics will allow system architects to be unconstrained by chip boundaries, challenging the assumptions which have underpinned high performance processor design.

BIO:

Moray McLaren is a Distinguished Technologist with HP labs, working in the Exascale Computing Laboratory. His recent research activities have focused on the impact of nanophotonics on future computer architectures. The two main areas of study have been high speed networking, and memory architectures. Prior to joining HP Labs in January 2007, he work on the development of high speed interconnects for parallel processors. These interconnects were successfully deployed in a significant number of supercomputing systems around the world. He holds a number of patents in the area of high speed network interface design. His previous experience also includes the development of parallel systems architectures, and CMOS microprocessors. He holds a 1st class honours degree in microelectronics from the University of Edinburgh.

Processors have evolved, why haven’t main memories?

SPEAKER: Al Davis, University of Utah

Slides

ABSTRACT:

Both processors and main memory systems were continually optimized over the years to improve single thread performance. A decade ago, processor architectures started to change in the direction of optimizing parallel workload throughput rather than single thread latency. Today, increasing the number of cores per processor is the only game in town for high performance computing. The DRAM devices that populate main memory have evolved more slowly but interestingly not in a direction that benefits the needs of highly parallel workloads. In addition, new non-volatile memory technologies are rapidly taking their place in the memory hierarchy. This talk will focus on the problems that exist with conventional memory systems and present a brief snapshot of a variety of solution alternatives for main memory along with an introduction into emerging non-volatile memory device technologies.

BIO:

Al Davis is presently a Professor of Computer Science at the University of Utah, and a part-time visiting scientist at HP Laboratories. Previously he has held faculty positions at the University of Waterloo, and has also worked in industrial research laboratories for Burroughs, Fairchild, Schlumberger, Intel, and Hewlett‐Packard. His research interests include computer and memory system micro-architecture, interconnection networks, embedded systems, EDA CAD tool development, VLSI circuit design, and silicon nanophotonics. He has published over 75 peer reviewed papers, journals, and book chapter, and holds (or has pending) 31 patents.

Low-power/Low-voltage Computing

SPEAKER: Shih-Lien Lu, Intel

Slides

ABSTRACT:

One of the most effective techniques to reduce a processor’s power consumption is to reduce supply voltage. However, reducing voltage in the context of dynamic and static variations can cause circuits to fail. As a result, voltage scaling is limited by a minimum voltage below which circuits may not operate reliably. In this talk we will discuss opportunities for resiliency to improve energy efficiency of computing in scaled CMOS technologies.

BIO:

Shih-Lien Lu received his B.S. in EECS from UC Berkeley, and M.S. and Ph.D. in CSE from UCLA. He had worked as a design manager on the MOSIS project at USC/ISI and served as a faculty at Oregon State University’s ECE department. Currently he is a Principal Scientist and leads a research group in Intel Labs on microarchitecture.