Mostly, inspired by NNOM, CMSIS-NN, I want to do something for Edge AI.
But as I think NNOM is not well designed for different runtime, CPU/DSP/GPU/NPU etc, it doesn't have a clear path to handle different type of runtime, and nowdays, I really want to study somehing about OpenCL, and I came across MACE, and I find there is a bunch of CL kernels can be used directly.
So I decieded to do something meaningfull, do some study of OpenCL and at the meantime to create a Lightweight Neural Network that can be suitale for decices such as PC, mobiles and MCU etc.
And for the purpose to support variant Deep Learning frameworks such as tensorflow/keras/caffe2, pytorch etc, the onnx will be supported by lwnn, also for some old frameworks such as caffe/darknet that doesn't support onnx, they are supported by special handling.
| Layers/Runtime | cpu float | cpu s8 | cpu q8 | cpu q16 | opencl | comments |
|---|---|---|---|---|---|---|
| Conv1D | Y d | Y | Y | Y | Y | based on Conv2D |
| Conv2D | Y d | Y | Y | Y | Y | |
| DeConv2D | Y | Y | Y | Y | Y | |
| DepthwiseConv2D | Y | Y | Y | Y | Y | |
| DilatedConv2D | Y | N | N | N | Y | |
| EltmentWise Max | Y d | Y | Y | Y | Y | |
| ReLU | Y d | Y | Y | Y | Y | |
| PReLU | Y d | N | N | N | Y | |
| MaxPool1D | Y d | Y | Y | Y | Y | based on MaxPool2D |
| MaxPool2D | Y d | Y | Y | Y | Y | |
| Dense | Y | Y | Y | Y | Y | |
| Softmax | Y d | Y | Y | Y | Y | |
| Reshape | Y d | Y | Y | Y | Y | |
| Pad | Y | Y | Y | Y | Y | |
| BatchNorm | Y | Y | Y | Y | Y | |
| Concat | Y | Y | Y | Y | Y | |
| AvgPool1D | Y d | Y | Y | Y | Y | based on AvgPool2D |
| AvgPool2D | Y d | Y | Y | Y | Y | |
| Add | Y d | Y | Y | Y | Y | |
| PriorBox | Y | N | N | N | F | |
| DetectionOutput | Y | F | F | F | F | |
| Upsample | Y | Y | Y | Y | Y | |
| Yolo | Y | F | F | F | F | |
| YoloOutput | Y | F | F | F | F | |
| Mfcc | Y | F | F | F | F | |
| LSTM | Y | N | Y | N | F | |
| Proposal | Y | N | N | N | N | |
| Mul | Y d | N | N | N | Y |
- F means fallback to others runtime that supported that layer.
- d means dynamic shape support
- s8/q8/q16: all are in Q Format
- s8: 8 bit symmetric quantization with zero offset, very similar to tflite quantization
- q8/q16: 8/16 bit symmetric quantization, no zero offset.
- q8/s8/q16 activation(ReLU/Clip) will reuse its input layer's buffer, so the activation layer's input layer must has only one consumer that is itself.
Below is a list of command to run above models on OPENCL or CPU runtime.
# objection detection
lwnn_gtest --gtest_filter=*CL*SSDFloat -i images/dog.jpg
lwnn_gtest --gtest_filter=*CPU*SSDFloat -i images/dog.jpg
lwnn_gtest --gtest_filter=*CL*YOLOV3Float -i images/dog.jpg
lwnn_gtest --gtest_filter=*CPU*YOLOV3Float -i images/dog.jpg
lwnn_gtest --gtest_filter=*CPU*MASKRCNNFloat -i images/dog.jpg
# semantic segmentation
lwnn_gtest --gtest_filter=*CL*ENETFloat -i ENet/example_image/munich_000000_000019_leftImg8bit.png
lwnn_gtest --gtest_filter=*CPU*ENETFloat -i ENet/example_image/munich_000000_000019_leftImg8bit.png
# speech to text
lwnn_gtest --gtest_filter=*CPU*DSFloat -i speech_dataset/bird/042ea76c_nohash_0.wav
stt 49/29: b irr dNote: Those models has big accuracy drop when do quantization, I think quantization awareness training or something like TensorRT calibration is necessary.
conda create -n lwnn python=3.6
source activate lwnn
conda install scons
pip install tensorflow keras keras2onnx onnxruntime
sudo apt install nvidia-opencl-devscons