

### Photonic Processor for Fully Discretized Neural Networks

<u>Jeff Anderson</u>, Shuai Sun, Yousra Alkabani, Volker Sorger, Tarek El-Ghazawi

The George Washington University

July 2019

### Introduction to ML

#### Machine Learning (ML) is everywhere!!





https://www.datasciencecentral.com/profiles/blog/show?id=6448529%3ABlogPost%3A5 98753&commentId=6448529%3AComment%3A599182&xg\_source=activity



# Introduction to ML

#### Machine Learning (ML) is everywhere!!





https://www.datasciencecentral.com/profiles/blog/show?id=6448529%3ABlogPost%3A5 98753&commentId=6448529%3AComment%3A599182&xg\_source=activity



#### Lenet-5 is used for handwriting recognition







### Lenet-5 is used for handwriting recognition



#### - Feature extraction identifies interesting features





### Lenet-5 is used for handwriting recognition



#### Feature extraction identifies interesting features





### Lenet-5 is used for handwriting recognition



#### Feature extraction identifies interesting features

- Classification uses features to identify digit





### Lenet-5 is used for handwriting recognition



#### Feature extraction identifies interesting features

- Classification uses features to identify digit







### Lenet-5 is used for handwriting recognition



- Feature extraction identifies interesting features
- Classification uses features to identify digit
- NNs are comprised of layers of neurons
  - Neurons (Yj) execute multiply-accumulate :  $Y_j = bias + \sum_{i=1}^{3} in_{i,j-1} \times w_{i,j}$



https://www.d2l.ai/chapter\_convolutional-neural-networks/lenet.html

EXCELLEN

#### How many MAC operations are needed?







#### How many MAC operations are needed?



C1 = 28x28x6x(25) = 117600





#### How many MAC operations are needed?







#### How many MAC operations are needed?







### How many MAC operations are needed?



Rish Performance Computing Lab

HPC EXCELLENC

#### How many MAC operations are needed?



Parallelize layers to reduce latency
 Increase in hardware





# **NN Architectural Optimizations**

### Discretization of the NN

- Partially discretized NN reduces weights
  - {-1,0,1} (Ternary Connect)
  - {-1,1} (Binary Connect)

#### - Fully discretized NN reduces weights and I/O

- {-1,0,1} (Ternarized NN)
- {-1,1} (Binarized NN)
- Reduces latency
  - Smaller HW footprint
  - Simple operations
    - » XOR, SUM
- Sum called "popcount"
  - Population count
  - Smaller than accumulator



EXCELLENC



## **Accumulation Drives Latency**

MAC consists of parallel multiplies and a summation

- Latency of parallel multiplies = latency of one multiply
- Latency of summation is a system bottleneck



### How to Improve on Analog Summation?

- Micro-Ring Resonator (MRR) enables optical equivalent to analog electric summation
  - Latency not influenced by RC constant
  - Wavelength division multiplexing (WDM) enables parallel operation with no increase in hardware



### How to Improve on Analog Summation?

- Micro-Ring Resonator (MRR) enables optical equivalent to analog electric summation
  - Latency not influenced by RC constant
  - Wavelength division multiplexing (WDM) enables parallel operation with no increase in hardware



### **MRRs As Discretized Multiplier**

x -1





Digital



x0



| Digitai |    |    |         |         |      |             |          |              |
|---------|----|----|---------|---------|------|-------------|----------|--------------|
| -       |    |    | Α       |         | В    |             |          | Y            |
| Α       | В  | Y  | in1 (%) | in2 (%) | bias | through (%) | drop (%) | through/drop |
| 0       | 0  | 0  | 0       | 0       | 0    | 0           | 0        | 1            |
| 0       | 1  | 0  | 0       | 0       | 1    | 0           | 0        | 1            |
| 0       | -1 | 0  | 0       | 0       | -1   | 0           | 0        | 1            |
| 1       | 0  | 0  | 100     | 0       | 0    | 50          | 50       | 1            |
| 1       | 1  | 1  | 100     | 0       | 1    | 100         | 0        | > 1          |
| 1       | -1 | -1 | 100     | 0       | -1   | 0           | 100      | < 1          |
| -1      | 0  | 0  | 0       | 100     | 0    | 50          | 50       | 1            |
| -1      | 1  | -1 | 0       | 100     | 1    | 0           | 100      | < 1          |
| -1      | -1 | 1  | 0       | 100     | -1   | 100         | 0        | > 1          |

**Photonic Encoded** 





### **Photonic Neuron**

# System Simulations using INTERCONNECT. – S-parameters derived from MODE results





Neuron Output



HPC

EXCELLENC

### **Photonic NN Processor Architecture**

#### Loosely-coupled architecture controlled by MSP430







### **Photonic NN Processor Architecture**

#### Loosely-coupled architecture controlled by MSP430





### **Photonic NN Processor Architecture**

Loosely-coupled architecture controlled by MSP430

- Weight memory selects Vref to bias MRR
- Discretized values enable HW reduction
  - Reduces latency since analog MUX selects voltage





### **Photonic Computation of a CNN**

- Partial unrolling of convolutional layer reduces hardware requirements with minimal performance impact
  - Shift register enables emulation of dragging window





## LeNet and AlexNet Simulation Results

- AlexNet performance lagged due to network size
- Larger NN
   processor
   reduces
   latency



EXCELLEN



### **LeNet Layer Analysis**

#### Execution time is write-dominated



**Execution Time Breakout for LeNet** 



### **Questions?**







# **Efficiency Analysis**

#### Due to WDM, large optical components rival performance per area of CMOS counterparts



EXCELLEN