

#### **Designing Tomorrow's Microprocessors**

#### Antonio González

Director, Intel Barcelona Research Center Professor, Computer Architecture Department, UPC

#### Lecturers:

Josep M Codina, Ayose Falcón, Antonio González, Enric Herrero, Marc Lupon, Pedro Marcuello, Raúl Martínez, Tanausu Ramírez, Kyriakos Stavrou, Ferad Zyulkyarov

Aula Empresa, Facultat d'Informàtica de Barcelona, January 29-31, 2013

o Intel Corporation, 2013

#### **Overview of Today's Microprocessors** and Future Trends

- Computing Evolution
- Technology Scaling
- Microprocessor Families
- Future Challenges
- Some Research Projects at Intel Labs

#### **Agenda**

- · Overview of Today's Microprocessors and Future Trends Antonio González
- · Microarchitecture of Current Microprocessors
- Design Cycle for Microprocessors
- Methodology for Research in Microprocessors Pedro Marcuello
- · Profiling and Performance Evaluation Ferad Zyulkyarov
- Multi-core Processor Architectures
- Parallel Programming
- Josep M Codina
- Reliability
   Enric Herrero and Tanausu Ramirez
- Hw/Sw Co-designed Microprocessors

























#### Gordon Moore 's Law

- The number of transiston in a chip doubles every 2 years
  - Based on 4 points (year/transistor count) (1959, 1), (1962, 8), (1964, 32), (1965, 64)
  - Established on April 19, 1965 in the Electronics Magazine, predicting 65000 transistors in 1975
- Revised in 1975 in IEEE International Electron Device Meeting



| Intel's Technology Roadmap |                  |                  |                    |                  |                  |                |                |                    |
|----------------------------|------------------|------------------|--------------------|------------------|------------------|----------------|----------------|--------------------|
| Process Name               | P856             | P858             | Px60               | P1262            | P1264            | P1266          | P1268          | P1270              |
| 1st Production             | 1997             | 1999             | 2001               | 2003             | 2005             | 2007           | 2009           | 2011               |
| Process Generation         | 0.25µm           | 0.18µm           | 0.13µm             | 90 nm            | 65 nm            | 45 nm          | 32 nm          | 22 nm              |
| Wafer Size (mm)            | 200              | 200              | 200/300            | 300              | 300              | 300            | 300            | 300                |
| Inter-connect              | Al               | AI (             | Cu                 | Cu               | Cu               | Cu             | Cu             | Cu                 |
| Metal layers               | 5                | 6                | 6                  | 7                | 8                | 9              | 9              | 9                  |
| Channel                    | Si               | Si               | si (               | Strained Si      | Strained Si      | Strained Si    | Strained Si    | Strained Si        |
| Gate dielectric            | SiO <sub>2</sub> | SiO <sub>2</sub> | SiO <sub>2</sub>   | SiO <sub>2</sub> | SiO <sub>2</sub> | High-k         | High-k         | Tri-gate<br>High-k |
| Gate electrode             | Poly-silicon     | Poly-silicon     | Poly-silicon       | Poly-silicon     | Poly-silicon (   | Metal          | Metal          | Metal              |
| Lithography                | 248 nm           | 248 nm           | 248 nm             | 193 nm           | 193 nm           | lmm.<br>193 nm | Imm.<br>193 nm | Imm.<br>193 nm     |
| 17                         | •                | Des              | signing Tomorrow's | Microprocessors  | •                | •              | •              | (imbel)            |

# **Microprocessor Families**

- Pipelined / non-pipelined processors
- In-order / out-of-order processors
- Scalar / superscalar processor
- Vector processors
- Multithreaded processors

Designing Tomorrow's Microprocessor



# **Microprocessor Segments**

- Servers
- Desktop
- Mobile
- Embedded



- A wide range of smart devices, beyond servers and PCs
- Driven by the need and desire of being always connected
- Making computing and communication a seamless experience











#### Intel Labs Barcelona Mission

Develop novel processor microarchitectures
that provide dramatic improvements in performance
with no increase in power and
without compromising reliability
for the increasing diversity of computing systems





#### **MAIN RESEARCH AVENUES**











# Microarchitecture of Current Microprocessors

Marc Lupon marc.lupon@intel.com







#### **Outline**

- Evolution of computing market segments
  - The old days: Performance at whatever cost
  - Yesterday: Climbing walls
  - Today: Let's move!
- Requirements and micro-architectural techniques for distinct market segments
  - Desktops and laptops (PC domain)
  - Embedded and Mobile (Embedded domain)
  - Servers and Supercomputers (HPC domain)
- · Characteristics of products in the market
  - · Processors for gaming consoles (XBOX vs PS3)
  - Processors for mobile devices ()
  - Processors for high-computing domains (Atom vs Xeon)
- - User experience and novel devices
  - Design trends for products to come

#### The old days (<2005)

#### Embedded devices

First goal: Real-time Applications: Sequential Performance: Medium

Power: Efficient Cost: Cheap Challenge: Be cheap

#### Example: TI TMS family Cheap and efficient!

- Microcontrollers
- Accelerators
- Small area and width
- Easy to validate Few instructions
- Programmable devices

#### Personal Computer

First goal: Performance Applications: Sequential Performance: High Power: Don't care Cost: Medium

Challenge: ILP Example: Pentium family

#### Performance at whatever cost!

HPC World

Challenge: Communication

First goal: Performance

Applications: Parallel

Example: Cray family

Performance: High

Power: Don't care

Cost: High

- Big cores running at high-frequency
- Superscalar, deeply pipelined processors
- Aggressive Out-of-order pipelines
- Huge, very accurate branch predictors
  - Objective: Extract ILP (CISC-like Processors) Extract DLP (Vector Processors)

# Yesterday (<2010)

#### Embedded devices

First goal: Ubiquitous Applications: Sequential Performance: Medium Power: Efficient

Simple and General

Small area and width

Cost: Cheap/Medium Challenge: General PP Example: ARM9 family

General Purpose

More instructions

# Personal Computer

First goal: Low-power Applications: Parallel Performance: Medium Power: Efficient Cost: Medium

Challenge: TLP Example: Core family

#### HPC World

First goal: Fast response Applications: Parallel Performance: High Power: Efficient Cost: High Challenge: I/O

Example: Itanium family

#### Climbing walls

- power and memory wall
- SMT
- Objective: Extract ILP
- RISC-like Processors In-order
- Simple designs SOC with accelerators

# Techniques to climb the

- Memory Hierarchy Big area, huge amount of transistors
- Chip multiprocessor Medium-frequency Power-efficiency

# Keep running

- Huge cores running at high-frequency
- Wider architectures
- IA64b VLIW
- Predication

#### Embedded devices Personal Computer First goal: Performance First goal: Survive! Applications: Parallel Applications: Parallel Performance: High Performance: Variable Power: Efficient Power: Efficient Cost: Medium Cost: Medium Challenge: Perf/Power Challenge: Perf/Power Example: HSW family Example: Tegra family Performance Let's move! General Purpose Go to Low-power

Multiprocessors

Today (2010-2015)

- Out-of-order Objective: Extract TLP
- Heterogeneity
- Minimal power Always connected
- Win all domains
- Reconfigurable Adaptable
- Reliable Simplify Parallel programming

First goal: Reduce cost Applications: Parallel Performance: High Power: Efficient Cost: Low Challenge: Perf/Power

HPC World

Example: Atom family

#### Go cheap!

- Reliable
- Responsive
- Redundancy Simple cores...
- ...but lot of cores
- Objective: Increase Instr/Watt
- Enable parallelism

10 Intel











| Low-Power HPC Domain          |                   |                |                 |                             |  |  |  |
|-------------------------------|-------------------|----------------|-----------------|-----------------------------|--|--|--|
| Superscalar (brocessor cycle) | Fine-Grained (    | Coarse-Grained | Multiprocessing | Simultaneous Multithreading |  |  |  |
|                               | Thread 1 Thread 2 | ☐ Threa        |                 | aread 5<br>le slot          |  |  |  |

#### **Low-Power HPC Domain**

- Requirements for the PC domain (<2005)</li>
  - Extract maximum parallelism (20 Threads)
  - High power budget (130W)
  - Big-size Area (500mm²)
- · Techniques to implement a low-power HPC processor
- Voltage/frequency scaling: reduce supply voltage and/or frequency when processor is idle
- Clock gating: disable clocks to inactive components
- Reduce power consumption of memory components
- Pipeline gating: reduce mis-speculated instruction execution
- Pipeline balancing: adjust effective pipeline ways for available IPC
- Efficient issue logic: cluster structure, adjust effective issue queue size, no matching for ready entries, reducing tag matching entries

14

#### **Outline**

- Evolution of computing market segments
  - \*The old days: Performance at whatever cost
  - Yesterday: Climbing walls
  - \* Today: Let's move!
- •Requirements and micro-architectural techniques for distinct market segments
  - \* Desktops and laptops (PC domain)
  - Embedded and Mobile (Embedded domain
  - Servers and Supercomputers (HPC domain
- Characteristics of products in the market
  - Processors for gaming consoles (XBOX vs PS3)
  - Processor for graphics mobile devices (Tegra4)
- What's next?
  - User experience and novel devices
  - Design trends for products to come

















#### **Conclusions**

- The world moves fast, creating new devices (computing domains)
- Computing domains used to be independent...
- ...but now they face the same challenges
  - Low power
  - Parallel computing
  - Cheap (cooling, area, validation)
  - General-purpose applications
  - Performance (when needed)
- We have seen how different micro-architectures attempt to approach the same problems, with their strengths and weaknesses
- Processors must help to increase productivity (profiler, debugger), not the other way around (complex ISA, explicit synchronization, etc)!
- $\bullet$  We don't know what is going to be in the future, but the door is open to any bright idea (and this is )

























## Microarchitecture design

- Microarchitecture changes are not visible to the programmer and can improve performance without software changes.
- Because microarchitectural changes maintain software compatibility, processor microarchitecture have changed much more quickly than architectures.
- Today's higher integration capacity allows more complex techniques to be implemented.



## Microarchitecture design

- Designing a processor microarchitecture involves trade-offs of IPC, frequency, die area, power, and design complexity.
  - Number of stages of the pipeline.
  - Instruction issue width.

Designing Tomorrow's Microprocessors

#### Microarchitecture design

The microarchitecture defines the different functional units on the processor as well as the interactions and division of work between them.

Designing Tomorrow's Microprocess

## Microarchitecture design

- Designing a processor microarchitecture involves trade-offs of IPC, frequency, die area, power, and design complexity.
  - Number of stages of the pipeline.
  - Instruction issue width.
  - Methods to resolve control dependencies.
  - Methods to resolve data dependencies.
  - Memory hierarchy.
  - In-order / out-of-order execution
  - Multi threading
  - Branch prediction
  - Number and type of functional units

Designing Tomorrow's Microproces





#### Logic design • In order to obtain this model, a Hardware Description Language (HDL) is used to describe the processor. • HDL languages as Verilog and VHDL, are high-level programming languages created specifically to describe and simulate hardware designs. HDL levels of abstraction Includes all the important events but Behavioral level not specifies their exact timing. Register transfer level Models the processor clock and the (RTL) events/signals that happen at each Detail cycle. An RTL model should be an accurate simulation of the state of the processor at each cycle boundary. Structural level Shows the detailed logic gates to be used within each cycle.



















#### Post-silicon validation methods

- Random Instruction Testing (RIT)
- Real software testing
- Output checked against architectural simulators

Designing Tomorrow's Microprocessor

# Post-silicon validation • Several iterations or "steppings" Tapeout Validation Post-Si Validation Starts Qualified Development Production RTL Coding Schematics Schematics Tapeout

#### **Platform Validation**

Extensive operating system, network and application testing ensures compatibility in today's environment

Designing Tomorrow's Microprocessors

#### **DFT/DFV** features

- Design for Test/ Design for Validation
  - Specific HW features to ease testing/debug
- Scan logic
  - Logic to load and extract fine grain data
  - Allow reachability
  - Allow observability

Designing Tomorrow's Microprocessors

## **Cost of Silicon Bug**

"Finding bugs in model testing is the least expensive approach, but the cost of a bug goes up 10x if it's detected in component test, 10x more if it's discovered in system test, and 10x more it it's discovered in the field, leading to a failure, a recall, or damage to a customer's reputation."

John Bourgoin, MIPS CEO at a DesignCon 2006 panel

33

esigning Tomorrow's Microprocessors





| Design<br>Type | Reuse                                                              |
|----------------|--------------------------------------------------------------------|
| Lead           | Little to no reuse                                                 |
| Proliferation  | Significant logic<br>changes and new<br>manufacturing<br>process   |
| Compaction     | Little or no logic<br>changes, but new<br>manufacturing<br>process |
| Variation      | Some logic changes<br>on same<br>manufacturing<br>process          |
| Repackage      | Identical die in<br>different package                              |



# Design Cycle for Microprocessors Raúl Martínez (raul.martinez@intel.com) Intel Barcelona Research Center Aula Empresa, Facultat d'Informàtica de Barcelona © Intel Corporation, 2013

#### **Conclusions**

- Moore's Law predicts the increase in transistor density.
- Transistor scaling and growing transistor budgets have allowed microprocessors performance to increase at a dramatic pace, but they have also increased the effort of microprocessor design.
- The production of new fabrication generations is inevitably more complex than previous generations.
- This implies a higher effort in validation at all the design levels.
- There is a need for new and better methodologies and tools to help in the different tasks.
- A sustained research at all the steps but specially at the fields of microarchitecture, process technology, and validation is required.

Designing Tomorrow's Microprocesso

25

Intel

#### Metodología de Investigación en Microprocesadores

Pedro Marcuello

Intel Barcelona Research Center

Aula Empresa, Facultat d'Informàtica de Barcelona, February 2010

© Intel Corporation, 2010

#### Nota del presentador

Esta presentación solo muestra opiniones personales y no las de Intel Corporation

Methodology for Research in Microprocessors

#### Agenda del curso

- Introducción y futuras tendencias en microarquitectura
- Ciclo de vida en el diseño de microprocesadores
- Investigación en microprocesadores
- Arquitecturas multi-core
- Programación paralela
- Systems-on-Chip
- Reducción de consumo
- Fiabilidad
- · Máquinas virtuales co-diseñadas

Methodology for Research in Microprocessors

se puede hacer?

26

#### Objetivo de este bloque

- Mostrar cómo es el trabajo de un investigador
  - Qué tareas se realizan
  - De dónde vienen las ideas
  - Cómo se evalúan las ideas
  - Qué se hace con estas ideas
- Convenceros de que puede ser un buen trabajo

## **Agenda**

- · La investigación científica
- El método científico
- Estudio del Entorno. Propuesta de ideas
- Preparación y validación de experimentos
- Interpretación de resultados
- Diseminación de resultados

Methodology for Research in Microprocessors

#### Qué necesito para ser investigador?

- Infinitas ganas de aprender
- Mucha paciencia
- Mucha dedicación
- Mucho espíritu crítico
- Mucha iniciativa/inventiva

Methodology for Research in Microprocessors

#### **Definiciones**

- Investigación:
  - RAE: Actividad que tiene por fin ampliar el conocimiento científico, sin perseguir, en principio, ninguna aplicación práctica
  - Wikipedia: La investigación científica es la búsqueda intencionada de conocimientos o de soluciones a problemas de carácter científico
- Investigador:
  - RAE: Persona que realiza investigación
  - Persona que tiene el firme convencimiento que sus conocimientos pueden mejorar un proceso o solucionar un problema

Methodology for Research in Microprocesso

#### Qué obtengo por ser investigador?

- Reconocimiento
- Realización personal
- Convertirte en un experto a nivel mundial en un tema concreto

#### Reconocimiento

Video "Intel's Rock Star"

Methodology for Research in Microprocessors

#### Mi lugar de trabajo



Methodology for Research in Microprocessors

#### Reconocimiento?

- Todo el mundo sabe en que trabajas ...
  - 'No se, pero te estás muchas horas allí'
  - 'Tu trabajo consiste en acelerar ordenadores'
  - 'Hace que los ordenadores sean más pequeños y se calienten más'
- ... y lo valoran
  - Investigar en Intel era considerado profesión de riesgo en un banco para otorgarte una hipoteca porque los detectives no tenían ingresos fijos
- Por no hablar de los medios de comunicación

10

Methodology for Research in Microprocessor

#### Y entonces?

- Si a un investigador le haces las siguiente encuesta
  - Te gusta tu trabajo?
  - Lo cambiarías por otro?
  - Lo cambiarías por otro trabajo razonable relacionado con la informática?
- Las respuestas serían por amplia mayoría
  - Sí → 95%
  - Sí → 70%
  - No → 90%

12

#### **Agenda**

- La investigación científica
- El método científico
- Estudio del Entorno. Propuesta de ideas
- Preparación y validación de experimentos
- Interpretación de resultados
- Diseminación de resultados

Methodology for Research in Microproces

#### Fuentes de conocimiento

- Azar:
- El método científico

Methodology for Research in Microprocessor

## Grandes descubrimientos por azar

- Gastronomía
  - Queso
  - Coñac
  - "LSD"
  - Patatas Chips
- Otros campos
  - Caucho vulcanizado

Y en la ciencia ... ???

Methodology for Research in Microprocessor

#### El azar en la ciencia

- Alexander Fleming: Penicilina
- Wilhem Roentgen: Rayos X
- Percy Spencer: Microondas

#### Pregunta: En qué se parecen?

- a) Los tres recibieron el Premio Nobel por sus arduas investigaciones y descubrimientos
- b) Los tres eran estadounidenses
- c) Los tres estaban investigando cuando la suerte les sonrió
- d) Los tres anteriores son correctas

Methodology for Research in Microprocesso

#### Investigación aplicada/tecnológica

- Investigación que se realiza en ingeniería y que genera conocimientos para el sector productivo
  - Farmacéuticas
    - · Nuevas vacunas
  - Automóvil
    - Seguridad
    - Prestaciones
    - Consumo
  - Tecnología
    - Seguridad
    - · Prestaciones / Velocidad
    - Consumo

Methodology for Research in Microprocessors

#### El método científico

Conjunto de reglas aceptadas por la comunidad científica para guiar la investigación

- 1. Observación del medio
- 2. Planteamiento de hipótesis
- 3. Probación de las hipótesis mediante experimentación
- 4. Tesis o teoría científica (conclusiones)

Methodo

# El método científico en el estudio de microprocesadores

- 1. Observación del medio
  - Estudio del entorno
  - Detectar qué se ha de mejorar
- 2. Planteamiento de hipótesis
  - Plantear cómo se puede mejorar
- 3. Probación de hipótesis
  - Preparación de experimentos
- 4. Teoría científica
  - Patentes/Prototipos
  - Artículos

#### **Agenda**

- La investigación científica
- El método científico
- Estudio del Entorno. Propuesta de ideas
- Preparación y validación de experimentos
- Interpretación de resultados
- Diseminación de resultados

Methodology for Research in Microprocessor

Aplicaciones

Compilador

ISA Arquitectura Tecnología

Modelos de ejecución
Etapas pipeline
Estructuras internas

Microprocesador

#### Estudio del entorno / Observación

- Observación del fenómeno a fin de poder formular una hipótesis
- Incluye
  - Tomar medidas sobre el proceso
  - Comparar diferentes realizaciones del suceso

Methodology for Research in Microprocesso

## Posibles mejoras

• Reducir el tiempo de ejecución de las aplicaciones

- T<sub>ejec</sub> = #inst \* CPI)\* Frec

• Consumir prenos (más tiempo de batería)

- P = Fred \* V<sup>2</sup> \* C \* Act



## Qué se puede hacer?

- Procesadores actuales IPC ~= 1
  - Con un Ancho de Banda de Issue de 4/6
  - Rendimiento de memoria ~70/80% tasa aciertos
  - Explotar TLP / MLP además de ILP
- Pentium-4 consumía cerca de 80W a 3.6GHz
  - Actuales Core 2 Duo está entorno 35W a 2.2GHz

#### Utilidad

- · La idea debe ser nueva
  - Hacer un estudio exhaustivo del tema
- La idea debe mejorar una situación real / actual
  - El problema detectado no debe ser artificial

Methodology for Research in Microprocesso

#### **Ejemplos**

- Problema: el coste de SQRT de ~40 ciclos
  - Buscar un algoritmo que tarde en media 10 ciclos
- Problema: La tasa de fallos en cache es muy elevada
  - Buscar un prefetcher mejor basado en agrupar instrucciones dependientes
- Problema: el ROB se llena en caso de fallos en L2
  - Graduar las instrucciones especulativamente y para las instrucciones dependientes predecir el valor que vendrá de L2

## Impacto de la idea. Ley de Amdahl

$$A = \frac{1}{(1 - F_m) + F_m/A_m}$$



Methodology for Research in Microprocessors

# El factor tiempo en el estudio de microprocesadores

- La investigación en microprocesadores se hace a 5-7 años vista
  - 4 últimos años son para la fabricación del chip, pero el diseño está ya congelado
  - Tecnología de fabricación todavía no existe
- Esto influye en el entorno
  - Qué querrán los consumidores de aquí 6 años?
  - Cómo serán las aplicaciones de entonces?

.....

## Tienen impacto nuestros ejemplos?

- SQRT?
  - Representan en media menos del 0.01% de las instrucciones
- Fallos de memoria?
  - ~30% instrucciones son loads
  - ~10% de los accesos a memoria son fallos
- ROB?
  - Cada vez la latencia de L2 es mayor
  - Cada vez se hace fetch de mayor número de instrucciones
  - ROB no aumenta significativamente en cada generación

30

Methodology for Research in Microprocessor

#### **Agenda**

- La investigación científica
- El método científico
- Estudio del entorno. Propuesta de ideas
- Preparación y validación de experimentos
- Interpretación de resultados
- Diseminación de resultados

32

33

#### Pero antes ...

- Detallar la propuesta
  - Identificar los bloques afectados
  - Encajar la técnica en el pipeline
    - Asignar cuándo y durante cuánto va a realizarse cada paso de la técnica
  - Detectar algún posible caso patológico
    - Rollbacks
- Establecer nuestra microarquitectura base
  - Actual
  - Justa

33

Methodology for Research in Microprocessor

#### Nuestros ejemplos II

- ROB:
  - Predictor de valor para loads
  - Hardware para rollback
    - Guardar todo el estado arquitectónico
  - Timing
    - Actualizaciones del predictor
    - Acceso del predictor

Methodology for Research in Microprocessors

#### **Nuestros ejemplos**

- SQRT:
  - Sólo afecta a la ALU donde está el divisor/SQRT
- Memoria:
  - Tabla adicional para aparejar stores/loads
    - IP de la operación de memoria
    - @ accedida por las operaciones de memoria
  - Casos patológicos
    - Si en una pareja store/load, hay un store entre ellos que escribe en la misma @?
  - Timing
    - · Actualizaciones en graduación o en ejecución

34

34

Methodology for Research in Microprocessors

#### Estudio de potencial

- Simulador sencillo que nos pueda dar una indicación del potencial
  - Técnica ha de estar detallada
  - Las demás partes del procesador según el impacto

Si el potencial es muy inferior al esperado



Volver a la casilla de salida

#### Cómo experimentar?

| Tipo                     | Desarrol.  | Depurac.          | Adaptab. | Error          | O(t)      |
|--------------------------|------------|-------------------|----------|----------------|-----------|
| Diseño chip              | Altísima   | Muy<br>costosa    | Nula     | 0              | Igual     |
| FPGA                     | Muy Alto   | Muy<br>costosa    | Sí       | 0              | Medio     |
| Simulador<br>Detallado   | Alto       | Costosa           | Sí       | Poco           | Muy lento |
| Simulador<br>Esquemático | Medio/Alto | Medio/<br>Costosa | Sí       | Medio/<br>Poco | Lento     |
|                          |            |                   |          |                |           |
|                          |            |                   |          |                |           |

Methodology for Research in Microprocessors

#### **Agenda**

- La investigación científica
- El método científico
- Estudio del Entorno. Propuesta de ideas
- Preparación y validación de experimentos
- Interpretación de resultados
- Diseminación de resultados

#### Con qué experimentar?

- Programas de prueba (benchmarks)
  - Coherentes con la propuesta
  - Reflejen el problema que queremos tratar
  - Disponibles para todo el mundo
  - Programas reales / sintéticos
- · Suites reconocidas
  - SPEC.org
  - MediaBench, PhysicBench, etc.

38

Methodology for Research in Microprocessor

#### Qué he de mirar?

- Fundamentalmente, aquello que estamos intentando mejorar
  - Velocidad → Tiempo de ejecución
  - Consumo → Potencia consumida
- Pero se ha de intentar justificar el por qué de la mejora / empeoramiento
  - Tratamiento de outliers
  - Números extraños
  - Fuera de rango
  - Comportamientos anómalos

40

#### Métricas

- Métricas reales
  - Tiempo de ejecución (s)
  - Potencia consumida (W)
  - Fallos por año
- Otras métricas
  - IPC
  - Fallos de cache / Fallos predicción
  - ED / ED<sup>2</sup>

Methodology for Research in Microprocess

# 

#### Resumir todo en un número

- Medias
  - Aritmética
    - Distribuciones normales
    - · Tiempo, Tasas de fallos
    - · Sensible a outliers
  - Geométrica
  - · Distribuciones no normales
  - Insensible a outliers
  - Rendimiento
  - Harmónica
    - Insensible a outliers
    - Rendimiento, velocidad
- NO HACER MAL USO DE ELLAS !!!



#### Ingeniería de números

- Outliers
  - ... la técnica A obtiene una mejora en rendimiento del 38%



- ... la técnica A (sin B) obtiene una mejora del 17%

Methodology for Research in Microprocesso

#### **Agenda**

- · La investigación científica
- El método científico
- Estudio del Entorno. Propuesta de ideas
- Preparación y validación de experimentos
- Interpretación de resultados
- Diseminación de resultados

......

#### Ingeniería de números IV

- Imaginación
  - Nuestra técnica obtiene una mejora del 15% en el ratio fallos de cache por cada 1000 fallos de saltos

Methodology for Research in Microprocessors

#### Diseminación de resultados

- Compartir tus ideas con los demás investigadores del mundo
  - Escribir un artículo
    - · Presentarlo en una conferencia
    - · Publicarlo en una revista
  - Escribir una patente
  - Escribir una tesis

Methodology for Research in Microprocessors

#### Y luego ...

• Fama, reconocimiento, trabajo y...

#### Vuelta a empezar

Methodology for Research in Microprocesso

#### Agradecimientos

- Ramon Canal
- Josep-Llorenç Cruz
- Pepe González
- Fernando Latorre
- Javier Lira
- Grigorios Magklis
- Raúl Martínez

Methodology for Research in Microprocessors

### Application Profiling and Performance Evaluation

Ferad Zyulkyarov Intel Labs Barcelona January 29, 2013

#### Outline

- Introduction to profiling
- Platform independent profiling
- Platform dependent profiling
- Profiling parallel programs
- Power profiling

#### Outline

- □ Introduction to profiling
  - Platform independent profiling
  - Platform dependent profiling
  - Profiling parallel programs
  - Power profiling

#### Introduction to Profiling

- What is profiling?
- Why profiling is important?

#### Introduction to Profiling

- What is profiling?
  - Recording and analyzing the behavior and the characteristics of a program at runtime
- Why profiling is important?
  - Profiling is important for optimization



#### Introduction to Profiling

- What is profiling?
  - Recording and analyzing the behavior and the characteristics of a program at runtime
  - Examples:
    - · Functions which are called most
    - The type of instructions executed
    - Cache misses
    - Etc.
- Why profiling is important?
  - Profiling is important for optimization
  - Optimization: identify and resolve bottlenecks
    - Identify the parts within the program which cause the program to run slow and improve these program parts

#### **Outline**

- Introduction to profiling
- Platform independent profiling
  - Platform dependent profiling
  - Profiling parallel programs
  - Power profiling

#### Platform Independent Profiling

- Platform independent results
  - Obtained by analyzing the program itself without considering the platform
  - Expressed with platform independent metrics
- Application level optimization
  - Optimizations will improve the performance of the program for all platforms



#### Platform Independent Profiling

- Platform independent results
  - Obtained by analyzing the program itself without considering the platform
  - Expressed with platform independent metrics
- Application level optimization
  - Optimizations will improve the performance of the program for all platforms



#### Platform Independent Profiling

- Platform independent results
  - Obtained by analyzing the program itself without considering the platform
  - Expressed with platform independent metrics
- Application level optimization
  - Optimizations will improve the performance of the program for all platforms



#### Other Platform Independent Metrics

- Same memory reference distance
- Number of instructions
  - Static
  - Dynamic
- Working data set

#### Other Platform Independent Metrics

• Same memory reference distance



• Same memory reference distance



### • Same memory reference distance | Same memory reference distance | Useful to understand whether the program is cache friendly (hit/miss) | Cache friendly (hit/miss) |

rate.

| Cache with size 10 | Hit | Miss |  |
|--------------------|-----|------|--|
| ABCDEFG            | 5   | 7    |  |
|                    |     |      |  |
| Cache with size 5  |     |      |  |
| FIGIAID B          | 3   | 9    |  |
|                    |     |      |  |
|                    |     |      |  |

#### Other Platform Independent Metrics

- Number of instructions
  - Static
  - Dynamic



#### Other Platform Independent Metrics

- Number of instructions
  - Static
  - Dynamic

For performing an effective optimization is necessary to find the part of the program where most of the instructions are executed.



- Working data set
  - The amount of memory the application operates on
  - Important to know and organize with respect to the available system memory and CPU caches



### Other Platform Independent Metrics

- Working data set
  - The amount of memory the application operates on
  - Important to know and organize with respect to the available system memory and CPU caches



#### Other Platform Independent Metrics

- Working data set
  - The amount of memory the application operates on
  - Important to know and organize with respect to the available system memory and CPU caches



- Working data set
  - The amount of memory the application operates on
  - Important to know and organize with respect to the available system memory and CPU caches



- Working data set
  - The amount of memory the application operates on
  - Important to know and organize with respect to the available system memory and CPU caches



#### Other Platform Independent Metrics

- Working data set
  - The amount of memory the application operates on
  - Important to know and organize with respect to the available system memory and CPU caches



#### Other Platform Independent Metrics

- Working data set
  - The amount of memory the application operates on
  - Important to know and organize with respect to the available system memory and CPU caches



- Working data set
  - The amount of memory the application operates on
  - Important to know and organize with respect to the available system memory and CPU caches



- Working data set
  - The amount of memory the application operates on
  - Important to know and organize with respect to the available system memory and CPU caches



## • Working data set - The amount of memory the application operates on - Important to know and organize with respect to the available system memory and CPU caches Data Operations performed on every element A B C Operations performed on every element

#### Other Platform Independent Metrics

- Working data set
  - The amount of memory the application operates on
  - Important to know and organize with respect to the available system memory and CPU caches



- Working data set
  - The amount of memory the application operates on
  - Important to know and organize with respect to the available system memory and CPU caches



- Working data set
  - The amount of memory the application operates on
  - Important to know and organize with respect to the available system memory and CPU caches



## Working data set The amount of memory the application operates on Important to know and organize with respect to the available system memory and CPU caches Data Operations performed on every element A more cache friendly implementation re-uses the data which is already in the CPU caches or the memory.

#### Other Platform Independent Metrics

- Working data set
  - The amount of memory the application operates on
  - Important to know and organize with respect to the available system memory and CPU caches



- Working data set
  - The amount of memory the application operates on
  - Important to know and organize with respect to the available system memory and CPU caches



- Working data set
  - The amount of memory the application operates on
  - Important to know and organize with respect to the available system memory and CPU caches



#### **Outline**

- Introduction to profiling
- Platform independent profiling
- Platform dependent profiling
  - Profiling parallel programs
  - Power profiling

#### Platform Dependent Profiling

- Platform dependent results
  - Obtained by analyzing the underlying system
  - Expressed through platform dependent metrics
- Platform specific optimizations
  - Optimizations will improve the performance of the program for that specific (or similar) platform

#### **CPU Performance Monitoring**

- Execution unit
  - Instructions executed
  - Type of instructions
  - Instruction per cycle
- Cache
  - Hits/misses
- Branch predictor
  - Mispredicts
- TLB
  - Hits/misses
  - Page faults
- Others

47





#### Performance Analysis Techniques

- Hardware monitors
- Hardware counters
- Event based
- Statistical sampling
- Instrumentation
- Emulation and simulation
- Hybrid
  - Combination of some of the above



#### **Example Profiling Tools**

- VTune
- CodeAnalyst
- Gprof
- Visual Studio Profiler
- PAPI
- Pin
- Valgrind
- Netbeans Profiler
- JProfiler
- And many others

#### Identifying Problematic Program Code

- Code where most of the
  - CPU cycles are burnt
  - Cache misses happen
  - TLB misses happen
  - Branch mispredictions happen

#### Outline

- Introduction to profiling
- Platform independent profiling
- Platform dependent profiling
- Profiling parallel programs
  - Power profiling

#### **Profiling Parallel Programs**

 What do we want to find when profiling parallel programs?

We want to find when, why and how much of the program does not execute in parallel?

- Find lock contention
  - which causes a program execution to serialize
- Find critical path
- Find bottlenecks





#### 









#### Outline

- Introduction to profiling
- Platform independent profiling
- Platform dependent profiling
- Profiling parallel programs
- □ Power profiling

#### **Power Profiling**

- Power consumption is very important in the world of battery powered (mobile) devices
- There exist guidelines for writing power-aware programs which consume less power



#### System Level Power Profiling

- Identify the applications which consume most power
- Used for manual or automated power management

  | Terminal | T
- Example:
  - PowerTop



### Power Optimization Strategies Application Level

- Use events instead of continuous polling
- Update graphics less often
- Redraw only the window components that change
- Operate in small data that fits in the caches
- Avoid or reduce background activity

Power Efficiency: Developing Power Aware Apps http://software.intel.com/en-us/articles/energy-efficient-software-developing-power-aware-apps/ Optimizing Software Applications for Power

http://software.intel.com/en-us/blogs/2010/11/29/optimizing-software-applications-for-power-part-1-of-13

#### **Application Level Power Profiling**

- Identify parts in the program which cause excessive power consumption
  - Functions which excessively use GPS, WiFi, 3G, graphics
  - Functions which prevent falling to sleep mode
  - Functions which use power-expensive instructions



- Compare the power consumption of alternative implementations
- The optimizations target re-writing the program code in more power-efficient manner

#### Questions?

Contact: feradz@gmail.com









#### Parallel computing promises

- Increase speed and concurrency (big datacenters)
- Solve larger problems (Human genome deciphering)
- Process huge amount of data
- Solve problems in real time and in due time
- More economic (Google cluster computing)
- Power efficiency (Embedded computing)

Ayose Falcón Multi-core Processor Architectures
Intel Barcelona Research Center Seminaris d'Empresa 2013



#### **Agenda**

- Motivation
- · Parallel architectures
- Cache coherency and consistency
- Interconnecting the cores
- Multithreading
- Concluding remarks

Ayose Falcón Multi-core Processor Architec Intel Barcelona Research Center Seminaris d'Empresa 201

# Position of processing elements that communicate and cooperate to solve large problems fast Parallel and synchronization Multiple computer can be defined as a collection of processing elements that communicate and cooperate to solve large problems fast





















### Bell's taxonomy of MIMD computers Multicomputers Multiprocessors Multiprocessors System consists of multiple computers, called nodes Nodes are interconnected by a message-passing Multiprocessors Further classified based on how memory is accessed: Uniform Memory Access (UMA) Non-Uniform Memory

Access (NUMA)

(COMA)

(cc-NUMA)

- Cache-Only Memory Access

- Cache-Coherent Non-

**Uniform Memory Access** 

e Falcón Multi-core Processor Architectures Barcelona Research Center Seminaris d'Empresa 2013

network

57

Each node has its own

and I/O devices

processor, memory, NIC,

### Multi-core processor is a special kind of multiprocessor

- All processors are on the same chip
- Multi-core processors are MIMD:
- Different cores execute different threads (Multiple Instructions)
- Operate on different parts of memory (Multiple Data)
- Multi-core is a **shared memory multiprocessor**:
- All cores share the same address space

yose Falcón Multi-core Processor Architectures
tel Barcelona Research Center Seminaris of Empresa 2013 20 (intel-

#### Why multi-core processors?

- For individuals:
- Solve larger problems → divide & conquer
- Concurrency
- Power efficiency
- For companies:
- Time-to-market → simply replicate cores
- Easier to extract performance → but programmers have to do it
- Better power and thermal management → die layout

Ayose Falzón Multi-core Processor Architectures 21 Intel-Intel Barcelona Research Center Seminaris d'Empresa 2013

#### Multi-core programming

- Programmers must use threads or processes
- Threads are relevant for business/desktop apps
- Multiple processes with complex systems or SPMD based parallel applications
- Spread the workload across multiple cores
- OS maps threads/processes to cores
- Transparent to the programmer
- Write parallel algorithms
- True for scientific/engineering applications
- Programmer needs to define the mapping to cores

Ayose Fakón Multi-core Processor Architectures 1sember 1sember



#### **Agenda**

- Motivation
- · Parallel architectures
- Cache coherency and consistency
- Interconnecting the cores
- Multithreading
- Concluding remarks

ose Falcón Multi-core Processor Architectures el Barcelona Research Center Seminaris d'Empresa 2013





# Caches for uniprocessors and CMPs For single core systems caches are nice: Reduce average access time Bigger → performance Generally power efficient For multi-core still nice, but shared memory requires: Cache Coherency Memory Consistency















61

















# Speculative multithreading Speculatively parallelize a serial application Use speculation to overcome ambiguous dependences Hardware support to recover from mis-speculations Different implementations Thread Level Speculation (extract parallelism) Helper threads (prefetch to memory) Multi-path (follow all paths on a branch) Etc ..



#### **Summary—Conclusions**

- Parallel computing is here to stay
- Today we discussed about:
- Why we need parallelism
- What are the types of parallel architectures
- How we keep memory and caches coherent and consistent
- Multi-core interconnects
- Multithreading
- Top of the iceberg: Loads of exciting research happening!!

Ayose Falcón Multi-core Processor Architectures Intel Barcelona Research Center Seminaris d'Empresa 2013







# Parallel Programming Josep M. Codina Intel Barcelona Research Center, Intel Labs Aula Empresa, Facultat d'Informàtica de Barcelona, 2013 © Intel Corporation, 2013

# Why Parallel Programming? • Applications are naturally parallel • Multi-core are out there • We need to think in parallel!!!! Sequential Application Parallel Application AB- IA cores in the Intel "Single-chip Cloud Computer" Designing Tomorrow's Microprocessors

#### **Parallel Programming?**

I want to divide the effort of eating pizza to eat all slices as fast as possible!!





 $\label{eq:But_loss} \text{But, I wonder...}$  How we split this hard task  $\circledcirc$ 

Designing Tomorrow's Microprocessors

#### **Agenda**

Motivation

65

- Designing Parallel Applications
- Parallel Programming Models
- Automatic Parallelization
- Parallel Programming for Games
- Concluding Remarks

Designing Tomorrow's Microprocessors

#### **Design of Parallel Applications** Understand the Problem Identify hotspots · Profile application - Identify bottlenecks to the parallelization Communication · Synchronization - Consider alternative versions of the application Split Application in Parts - Maximum parallelism - Minimum imbalance - Minimum waiting time



#### **Who Creates Parallel Applications?** • User/ Programmer SW Ecosystem User - Parallel languages and libraries Compiler Applications / Libraries Compilers - Automatic Parallelization

Parallel Architectures

- Multi-core in a single chip

- Many multi-core chips



#### When to Parallelize an Application?

Allways!!!!

• Stop thinking in sequential applications

• Multicore era is here and we have to leverage the computing capabilities that we have out there

- Motivation
- Designing Parallel Applications
- Parallel Programming Models
- Automatic Parallelization
- Parallel Programming for Games
- Concluding Remarks

Designing Tomorrow's Microprocessors

#### **Parallel Programming Models**

- Data Parallel
- Message Passing
- Shared Memory
- Distributed Shared Memory Model
- NOTE: These models are orthogonal to the actual hardware!!!

Designing Tomorrow's Microproces









- Motivation
- Designing Parallel Applications
- Parallel Programming Languages
- Automatic Parallelization
- Parallel Programming for Games
- Concluding Remarks

Designing Tomorrow's Microprocessors

#### **Traditional Auto Parallelization**

- Automatic decomposition of applications into threads
- No need for inserting directives or pragmas
- Compiler identifies suitable parts of the application for parallelization
  - Typically simple loops
  - Considering simple memory disambiguation schemes
- This approach is typically limited to simple applications
  - Dependences limit its applicability to large scale applications

Designing Tomorrow's Microprocessors





- Motivation
- Designing Parallel Applications
- Parallel Programming Languages
- Automatic Parallelization
- Parallel Programming for Games
- Concluding Remarks

Designing Tomorrow's Microprocesso











- Motivation
- Designing Parallel Applications
- Parallel Programming Languages
- Automatic Parallelization
- Parallel Programming for Games
- Concluding Remarks

Designing Tomorro

#### **Parallel Programming**

Josep M. Codina

Intel Barcelona Research Center, Intel Labs

Aula Empresa, Facultat d'Informàtica de Barcelona, 2013

© Intel Corporation, 2013

#### **Summary - Conclusions**

- Parallel Applications are required to leverage Parallel Architectures
- Today we discussed about:
  - Why we need parallel applications
  - Considerations When Designing parallel applications
  - Parallel Programming Models
  - Auto Parallelization and Speculative Multithreading
  - The importance of threading in gaming
- We need to think in parallel to create parallel applications!!!

Designing Tomorrow's Microprocesse

#### Designing Tomorrow's Microprocessors

Tanausu Ramírez and Enric Herrero

Intel Barcelona Research Center

Aula Empresa, Facultat d'Informàtica de Barcelona, January 2013

© Intel Corporation, 2013

#### Course Agenda

- Overview of Today's Microprocessors and Future Trends
- Microarchitecture of Current Microprocessors
- Design Cycle for Microprocessors
- Methodology for Research in Microprocessors
- Profiling and Performance Evaluation
- Multi-core Processor Architectures
- Parallel Programming
- Reliability
- Hardware/Software Co-designed Microprocessors

Designing Tomorrow's Microprocessors

#### **Acknowledgment**

- Raimat/TRAMS Project
  - Xavi Vera
  - Nicholas Axelos, Javier Carretero, Daniel Sánchez, Matteo Monchiero
- Phoenix Project
  - Jaume Abella, Pedro Chaparro, Osman Unsal, Oguz Ergin
- Intel
  - J. Tschanz, S. Mitra. S. Iacobovici, K. Bowman, C. Wilkerson, and many others!

Designing Tomorrow's Microprocessors































other wires

### Degradation

|                                                    | Affects            | Weakest | Worst<br>input       |
|----------------------------------------------------|--------------------|---------|----------------------|
| Electromigration                                   | Connections        | Longest | "1"                  |
| Stress migration                                   | Connections        | Longest | Any                  |
| Time-dependent dielectric breakdown (TDDB)         | Gate oxide         | Widest  | "0" PMOS<br>"1" NMOS |
| Negative Bias<br>Temperature Instability<br>(NBTI) | Gate oxide<br>PMOS | Widest  | "O"                  |
| Large thermal cycling                              | Package            |         |                      |
| Short thermal cycling                              | Unknown            |         |                      |

Designing Tomorrow's Microprocesso

### **Degradation: Current practice**

- Speed guardbands
  - Very expensive
- Summary
  - If we add temperature variations we open the door to...



OVERCLOCKING!







### Soft Errors (2)

- Soft Errors can cause problems in different ways
  - Change the data value in the Caches and Memory
  - Corrupt the execution of instruction due the flip of data in the pipeline registers.
  - Change the character of a SRAM-Based FPGA circuit. (Firm Error)
  - Datapath logic SET (Single Event Transient) caught by registers/memory

Designing Tomorrow's Microprocessor

77

### Soft Errors (3): Evidence

- Error logs of large servers
  - Normand, IEEE T. Nuclear Science, Dec. 1996.
- Sun Microsystems, 2000 (from Baumann, IRPS 2002)
  - Cosmic ray strikes on L2 cache
    - Mysterious crashes of Sun flagship servers
  - Companies affected
    - Baby Bell (Atlanta), America Online, Ebay, & dozens others
    - Verisign moved to IBM Unix servers (for the most part)



Designing Tomorrow's Microprocessors

# Altitude of 30,000 feet on a route crossing the north pole both cause increase in neutron flux. Four 1M 130nm SRAM-based FPGAs, it would be subject to 0.074 upsets per day = 324 hours between upsets. Assume one such system on-board each commercial aircraft, 4,000 civilian flights per day, 3 hours average flight time. Nearly 37 aircraft will experience a neutron-induced SRAM-based FPGA configuration failure during the duration of their flight.

### Summary: VLSI Trends & Reliability

| Reliability Problems                          | VLSI Trends                                                                                  |
|-----------------------------------------------|----------------------------------------------------------------------------------------------|
| Power supply, Signal integrity problems       | High speed, low voltage, large current                                                       |
| Process variation (die-to-<br>die, intra-die) | Many transistors, nano-<br>fabrication, high speed                                           |
| Manufacturing defects                         | Many transistors, new material,<br>burn-in issues                                            |
| Degradation over time                         | Large current, increased electric<br>field, new material, thin oxide,<br>stress void, High-K |
| SEUs due to radiation                         | Many transistors, low-voltage,<br>high speed                                                 |



### **General Solutions for Faults**

Design systems that minimize the presence of faults

### Fault avoidance / removal

- Rigorous design and verification: Design a system with minimal faults
- Comprehensive testing: Validate/test a system to remove the presence of faults

### Fault tolerance

- Living and deal with faults!
- Built-in error detection
  - eg. Redundancy

29

esigning Tomorrow's Microprocessors

### **High Availability Building Blocks** Fault Tolerance Fault Avoidance Spare/ Concurrent Design Verification Degrade sw Repair System Integration Recover **Failure Masking** Validation Reliability & Testing **Data Integrity** Detect & Isolate System Design Technology





### **Testing Challenges**

- Technology issues/integration consequences
  - Testing time (vs. time to market)
  - Yield vs Debug
  - Low Operating voltages (Vcc): decrease margins, more noise
- "New" on-die complex structures
  - Interaction of multiple cores
  - Interconnection network, validate the coherence protocol
  - DVS/DVFS

Designing Tomorrow's Micro

### **Design For Testing (DFT)**

- DFT consists in designing with "testability features" to speed testing and reduce costs
  - Increasing the observability and controllability of circuits
  - Putting hardware in place to allow testing
    - e.g. scan chain, test pins, Built-In Self Test (BIST), etc.
  - Space & time redundancy
    - Self-checking blocks: some properties of the outputs are checked for correctness (e.g. residue, parity, ECC)
    - Full hardware replication: operations are repeated in replicated hardware
    - Outputs are latched twice at different times to detect timing errors

Designing Tomorrow's Microprocessors

### **Testing Cost**

- Circuit issues becoming much larger fraction of overall Post-Si bugs
  - Functional issues losing weight
- Test costs sharply up from generation to generation
  - Regular pattern quantity goes down every generation to keep costs
    - · More corner cases
    - · More complexity to test
- Circuit bugs take much longer to root
  - Long latency between failure and syndrome

Designing Tomorrow's Microproc

### Fault-tolerant design

- Providing fault-tolerant design for every component is normally not an option
- · How critical is the component?

80

- How likely is the component to fail?
- How expensive is it to make the component fault-tolerant?

fault-tol-er-ant \fölt-'täl(- a)-ron

adj : able to function in the



Designing Tomorrow's Microprocessor

Intel

### **Error Detection**

- Most important factor because a processor cannot tolerate a problem which is not aware
- Key to error detection is redundancy
  - without redundancy a processor cannot detect any errors.

| Type of Redundancy | Basic Idea                   | Example                                                          |
|--------------------|------------------------------|------------------------------------------------------------------|
| Physical/spatial   | Add redundant hardware       | Replicate modules and have replicas to compare the results       |
| Temporal           | Perform redundant operations | Run a program twice on<br>the same HW and<br>compare the results |
| Information        | Add redundant bits           | Add a parity bit to a word in memory                             |

Designing Tomorrow's Microprocesso

### Physical redundancy • No error on outputs → failure masking - Triple Modular Redundancy (TMR) - N-Modular Redundancy (NMR) Module 1 Voted Outputs Module 3

### **Temporal Replication**

RMT: Redundant threads execute at the same time to check for errors

Hardware implementation on SMT

Multi-core based RMT using fingerprint

Compare signatures of instruction sequences only Fewer compares are needed -> reduce thread communication

Optimized versions: selective/partial replication

Designing Tomorrow's Microprocessors

### **Information Redundancy**

- Coding Like Error detecting codes (EDC)
  - add redundant bits to a datum to detect when it has been affected by an error
- Parity
  - Simple and cheap solution (dataword + parity bit)
- ECC
  - More sophisticated and can also correct errors (SECDEC)
  - Used in large caches and memory
- Extended protection against multiple upsets
  - Interleaving (spatial), scrubbing (temporal)

### Coding Like Logic – FUs - ALUs

- Residue/arithmetic codes
  - Uses residues with coverage close to the one achieved by FUs replication at a fraction of the area & power
- Mod D residue of number N: remainder of N divided by D
  - For arithmetic ops:  $X \text{ op } Y = Z \rightarrow (X \text{ op } Y) \text{ mod } D = ((X \text{ mod } D) \text{ op } (Y \text{ mod } D)) \text{ mod } D$
- Example:
  - X= 46238; 46238 mod 3 = 2Y=56788; 56788 mod 3 = 1
  - -X+Y=103206; 103206 mod 3 = 0 (2+1 mod 3 is 0!)

Designing Tomorrow's Microprocessi



### **Software Assisted Error Detection**

- Software-implemented hardware fault-tolerance
  - It introduces the redundant operations and checks/assertions
  - Hardened values and subroutines
  - Programmer may choose to check only critical code
- EDDI, SWIFT approaches

Id r12=[GLOBAL] add r11=r12,r13 st m[r11]=r12 1: Id r22=[GLOBAL+offset] add r11=r12,r13 2: add r21=r22,r23 3: cmp.neq.unc p1,p0=r11,r21 4: cmp.neq.or p1,p0=r12,r22 5: (p1) br faultDetected st m[r11]=r12 6: st m[r21+offset]=r22

Id r12=[GLOBAL]

(a) Original Code

(b) Dupplicated Instr. Code







### Multi-layer Approach

- A lot has been said about solutions that combine circuit, µarch, and SW approaches
  - Few to none seen
  - Each layer tries to do its best... probably paying a high price
- We need to design solutions bottom-up considering all different layers
  - Clearly identify error detection and recovery requirements for each level
  - Each layer contributes with its own detection and recovery capabilities
  - Co-design CKT/HW/SW solutions!

Designing Tomorrow's Microprocess

### Adapt the System

- Reconfiguration
  - Transistor level (manufacturing time)
    - · Spare transistors
  - Circuit level
    - · Spare cache lines
  - Microarchitecture
    - Memory, processor, routers, networks
  - Software
    - Components that can be virtualized (e.g., bypass levels in inorder cores)
  - OS

83

Task allocation (heterogeneity)











## HW/SW co-design targets the major challenges Power Consumption Through using simpler Hardware Through using less Hardware By optimizing code once / using many times Design Complexity Co-designed processors have simpler HW Simpler HW is much easier to validate Performance Synergy between the HW and SW Exploit dynamic information

**Designing Tomorrow's Microprocessors** 



# Overview of the technology What are the HW/SW co-designed processors Key Ideas and Advantages From the HW-only CPU to the co-design paradigm Research Projects / Market examples Academic Research Products Potential and Open Issues A glance to the huge improvement potential Designing Tomorrow's Microprocessors



### The key idea In a traditional design - There must be different HW for different optimizations - This HW needs to be designed and validated - The cost to optimize the same code is paid multiple times - The hardware is very aggressive - Consumes a lot of power - The hardware tries to optimize always - Optimizations do not always pay-off (power / performance) - HW exploits limited information (limited #instructions in the queue) - Non-efficient resource utilization - Area increase - Validation cost - Power consumption increase - Pay the optimization cost multiple times **Designing Tomorrow's Microprocessors**







### The key idea

- Cost of running the SW:
  - Running the SW comes with some cost
    - "Observing" the dynamic code
    - Optimizing the code
    - Store optimized regions
  - The SW cost is amortized
    - The optimized segments are executed many times
    - Using staged optimization helps significantly:
      - Frequent regions : Few optimizations / low SW cost
      - Hot regions : Further optimization / medium SW costCritical regions : Maximum optimizations / high SW cost

Designing Tomorrow's Microprocessors



### The key idea

- The benefits
  - Similar performance- Less Power(often higher performance)(extended battery life)
  - Smaller Area (lower cost)
  - Easier to design and validate (lower cost / shorter time-to-market)
- How?
  - Use SW instead of HW for optimizing
  - SW is usually easier to debug than HW
  - Keep the optimized code for future use
  - Efficient resource utilization (optimize once, use many times)

Designing Tomorrow's Microprocessors



88

### The key idea

- Comparison between HW-only and co-designed CPUs
  - "Amount" of hardware
  - Traditional approach : More hardware Co-designed processors : Less hardware
  - Complexity of hardware
  - Traditional approach : Very complex (e.g. support for ooo)
  - Co-designed processors : Much simpler
  - Power Consumption
  - Traditional approach : High (optimization, HW complexity)
  - Co-designed processors : Significantly lower
  - Performance
    - Traditional approach : High
    - Co-designed processors : In the same order

Designing Tomorrow's Microprocessors



### **Outline**

- Overview of the technology
  - What are the HW/SW co-designed processors
- Key Ideas and Advantages
  - From the HW-only CPU to the co-design paradigm
- Research Projects / Market examples
  - Academic Research
  - Products
- Potential and Open Issues
  - A glance to the huge improvement potential



### Research projects / Market examples

- Many researchers identify the value of the approach
- · A lot of work in academia
  - Parrot
    - ISCA 2004:
    - "Power Awareness through Selective Dynamically Optimized Traces"
  - Targets both performance and power
  - The processor has 2 pipelines
    - Simple lower power for "cold" regions
    - Aggressive higher power for "hot" regions
  - Operation
    - Instructions initially go through the "cold pipeline"
    - Hot regions are identified and optimized
    - Optimized regions are stored and reused

Designing Tomorrow's Microprocessors



### Research projects / Market examples

- Market Examples
  - Transmeta™ Corporation (1995 2009)
  - Transmeta™ built the first co-design processors
    - Crusoe<sup>™</sup> 2000
    - Efficeon™ 2004
  - VLIW architectures
    - Host ISA: x86
    - Elaborated Software : Code Morphing Software
    - Simple Hardware which provides special support for the optimizer
  - Main target
    - 100% compatibility
    - Similar performance
    - Lower power

Designing Tomorrow's Microprocessors



### Research projects / Market examples

- Many researchers identify the value of the approach
- · A lot of work in academia
  - rePLay
    - IEEE Transactions on computers 2001
       "rePLay: A Hardware Framework for Dynamic Optimization"
  - Mainly targets higher performance
  - It is a HW only solution but follows the same principles
  - Is equipped with an optimization engine
    - Identify hot regions / optimize / store
    - Includes HW mechanisms to enable more optimizations
    - Uses aggressive HW

Designing Tomorrow's Microprocessors



### **Agenda**

- Overview of the technology
  - What are the HW/SW co-designed processors
- Key Ideas and Advantages
  - From the HW-only CPU to the co-design paradigm
- Research Projects / Market examples
  - Academic Research
  - Products
- Potential and Open Issues
  - A glance to the huge improvement potential

20



### **Open Issues**

- Conventional processors have been evolving for many years
- · Co-designed processors is a new paradigm
  - A lot of work is needed for full exploitation
  - This is an amazing topic! Compilers + Computer Architecture + ...
- Some important questions
  - Exploit the dynamic information
    - Mechanisms to exploit dynamic information
    - Speculation techniques for performance / power
  - Leverage for multiprocessing
    - Software for efficient execution of parallel code
    - Fault tolerance

Designing Tomorrow's Microprocessors



### Market examples / Research projects

- Reading:
  - "The Architecture of Virtual Machines"

    IEEE Computer 2005, James E. Smith, Ravi Nair

    Gives an excellent introduction to the whole technology
  - "Power Awareness through Selective Dynamically Optimized Traces" ISCA 2004, Rosner et al.

Easy to understand and follow overview of where the benefits come from

- "The Technology Behind Crusoe™ Processors"
 White Paper 2000

A lot of information of the underlying implementation issues

Designing Tomorrow's Microprocessors



90

### **Open Issues**

- Conventional processors have been evolving for many years
- Co-designed processors is a new paradigm
  - A lot of work is needed for full exploitation
  - This is an amazing topic! Compilers + Computer Architecture + ...
- · Some important questions
  - Segment Specific designs
    - Can we have CPUs optimized for different market segments?
    - One HW and different SW (maximize for performance / power)
- Are traditional mechanisms good enough?
  - e.g. pre-fetchers, branch predictors
  - How to take advantage of the simpler circuitry

Designing Tomorrow's Microprocessors



### Conclusions

- Radical improvements usually come from radical solutions
- Do not use HW for everything
  - Consumes power, more complex design, higher cost...
  - Non-efficient resource utilization
- Co-designed processors: CPU=SW+HW
  - Efficient resource utilization
  - Less power, less complexity, lower cost, similar (higher) performance
  - Huge room for technology innovation
  - A huge interest in the research community



