

# Understanding Performance Gains of Accelerator-Rich Architectures

#### **Zhenman Fang**

**Assistant Professor** 

Computer Engineering, SFU

Email: zhenman@sfu.ca

http://www.sfu.ca/~zhenman

### The Power Wall and Customized Accelerators

#### The famous power wall !

#### **Customized accelerators !**

ASIC



Source: Shekhar Borkar, Intel

e.g., Google TPU v3

#### GPU



e.g., Nvidia Tesla GPUs

FPGA



e.g., Xilinx Alveo FPGAs

### Trend of Accelerator-Rich Architectures (ARA)



### **PARADE: Platform for ARA Design & Exploration**



Paper at [ICCAD'15], Tutorials at [ISCA'15 & MICRO'16] PARADE open source link: <u>http://vast.cs.ucla.edu/software/parade-ara-simulator</u>

### **Example Acceleration Results.. and Insights?**



### **Gains from Both Computation and Memory**

#### **CPU performance [optimized]**

■ CPU-Computation ■ CPU-Memory

#### Accelerator performance

■ Acc-Computation ■ Acc-Memory



### **Gains from Computation Customization**



$$1/\sqrt{\mathop{\text{a}}_{i=0}^{5} (x_c - x_i)^2}$$

**#2 customized accelerator pipeline** 

6.1x speedup



a) fine-grained parallelism: more flexible than SIMD

b) customized pipeline: no instruction overhead

| load Xc      | <b>CPU execution</b> |
|--------------|----------------------|
| load Xi      |                      |
| sub Xc – Xi  | <b>↓</b>             |
| store result | Acc execution        |
|              |                      |

c) coarse-grained parallelism: by duplicate this pipeline

### **Memory Customization**



### **Gains from Memory Customization**

#### #3 Memory access reduction is not the key!

#### #4 Memory-level parallelism improvement is the key!



### ARA (Multi-PE) vs. GPU



### **Programmable Accelerators: FPGA vs. GPU**

Although their performance and energy advantages are clear, ASICs have high design cost and lack flexibility

#### Let's look at more programmable accelerators



## Applying the Insights into FPGA Accelerators



For fair comparison, we port the widely recognized GPU benchmark suite Rodinia to FPGA using HLS C, and apply the prior insights during porting

### **Preliminary GPU-FPGA Comparison [FCCM 2018]**



**Performance**: out of 15 kernels, 3 FPGA kernels win, 3 kernels comparable **Performance/watt**: 6 FPGA kernels win, 4 kernels is > 2x worse than GPU

### **Conclusion and Future Directions**

The power wall has led to the trend of heterogeneous accelerator-rich architectures (ARAs)

#### The performance gains of ARAs come from

- Computation customization: 1) customized accelerator pipeline, and
  2) coarse-grained parallelism
- Memory customization (often more important): 1) memory access reduction and 2) improved memory level parallelism (often the key)

#### **Future directions**

- Better understand when apps run better on FPGAs, when on GPUs
- Near data acceleration architectures and systems, with corresponding programming, compiler, and runtime support



#### More info at <a href="http://www.sfu.ca/~zhenman">http://www.sfu.ca/~zhenman</a>

#### **Past Sponsors**





#### idre Postdoc Fellowship



# **Backup Slides**

### **HLS-based Automatic Accelerator/App Generation**



### Customize Your Own Accelerator (e.g., Denoise)

Denoise core computation:





ABB: Accelerator Building Block Auto-generated application using accelerator chaining data flow

### (Automated) Application Execution on ARA



| <b>/1</b>    |               |
|--------------|---------------|
| Extended ISA |               |
| acc-req      | type          |
| acc-rsrv     | id, time      |
| acc-cmd      | id, cmd, addr |
| acc-free     | id            |

- **1.** Request available accelerators (acc-req)
- 2. Response available ones & waiting time
- 3. Request reservation (acc-rsv) and wait
- 4. Reserve accelerator, send it the core ID
- 5. The core shares a task description and start the accelerator (acc-cmd)

- 6. Read task & start work
- 7. Work done, notify the GAM
- 8. Free accelerators (acc-free)

Users don't have to worry about these, we provide a dataflow language and tool to automatically generate the library

### FPGA vs GPU Results

- ✓ Ported a comprehensive set of 15 kernels from widely-used GPU benchmark suite Rodinia to FPGA using HLS C
  - Performance: 3 FPGA kernels win, 3 kernels comparable
  - Perf/watt: 6 FPGA kernels win, 4 kernels is > 2x worse than GPU
- Proposed an analytical model with new metrics (pipe\_OPC and e\_para\_factor) to analyze FPGA and GPU performance
  - FPGAs often have better pipe\_OPC due to their pipeline customization
  - FPGAs often lose in e\_para\_factor due to off-chip BW limitation
  - With higher BW, 9 out of 15 FPGA kernels achieve > half of GPU perf
- ✓ Future work: port to the latest Amazon F1 instance for FPGA and P3 instance for GPU, compare and analyze perf and perf/dollar