MegEngine

Bugfix

通用组件

修改分组卷积计算中通道不匹配时的错误信息，使其更好理解。

CUDA

修复 import megengine 时，cuda 版本检测时报错信息冗余的问题，使报错信息更加合理。
修复了 TRT7 带来的内存泄漏问题。
修复了某些情况下找不到 libnvrtc-builtins.so 的问题。

模型序列化

序列化 fbsv2 模型时，用户可配置是否序列化中间 tensor 信息，当不序列化中间 tensor 信息，可以压缩模型大小。

周边工具

修复 megbrain 访问 redis server 时，对同一个 std::future 反复调用其 get 接口，进而产生的 future_error：no state 问题。

量化

修复 SyncExponentialMovingAverageObserver 在单机场景下不可用的问题。

Python API

修复 deconv 的 flops 统计错误的问题。
修复 Tensor 值小于 1e-4 时，打印显示为0的问题。
增加非 float32 的输入支持，在输入类型不满足要求但仍然是数值时，输出一个元素全为 False 的 bool 类型 Tensor。

New Features

新增 meshgrid 算子的实现。

Python API

通用组件

为 conv_transpose 添加output_padding参数，用来控制输出的图像尺寸。
将 MegEngine 中使用 flatbuffer 定义的模型格式文件上传到 github 中。
新增 warp_affine 算子反向的支持。

ARM

增加 ARM nchw 的 winograd f43 算法的实现，优化 nchw 下 arm 的部分 conv3*3 的速度，有6%～74%的提升。
增加 ARM nchw44 的 winograd f43 实现。

Improvements

ARM

优化 arm fp32 的 sigmoid，推理速度有 10% 的性能提升。

文档

优化文档中关于 batch_norm 的参数及使用介绍的描述，使之更完整明确。
更新 max_pool2d/copy 接口文档，使之更完整明确。
优化部分 python 接口的 docstring。

Dataloader

优化 dataloader，在 basekps 上的性能平均提高5倍。

MegEngine

Bugfix

Common components

Modify the error infomation of group convolution when input channel mismatch more readable.

CUDA

Making the error message of cuda version detection more reasonable when importing megengine.
Fix TRT7 workspace memory leak.
Fix missing libnvrtc-builtins.so in some environments.

Model serialization

When serializing the fbsv2 model, the user can configure whether to serialize the middle_tensor tensor information.

Peripheral tools

Fixed the future_error: no state generated by megbrain calling its get interface repeatedly to the same std::future when accessing the redis server.

Quantify

Fix SyncExponentialMovingAverageObserver is not available in non distributed mode.

Python API

Fix stats error for deconv flops.
Fix the problem that the tensor value is printed as 0 when it is less than 1e-4.
Add non-float32 input support, output a bool type Tensor whose elements are all False when the input type does not meet the requirements but is still numeric.

New Features

Python API

Add meshgrid opr.

Common components

Add the output_padding parameter for conv_transpose to control the output image size.
Upload the model format file defined by flatbuffer in MegEngine to github.
Support the backward of warp_affine operator.

ARM

Added winograd f43 implementation for ARM nchw.
Add FP32 winograd F43 NCHW44 algo.

Improvements

ARM

Optimize the sigmoid of arm fp32, improve the performance by 10%.

CUDA

Document

Optimize the description of the parameters and usage of batch_norm in the document to make it more complete and clear.
Update the interface document of max_pool2d/copy operator.
Optimize the docstring of some Python API.

Dataloader

Optimizing dataloader, the performance on basekps is improved by an average of 5 times.

MegEngine

HighLight

新增 CUDA INT4 支持。在 cuda11.4 + cudnn8.2.1 + trt7.2.2.3 + A2 卡上验证，和 Float32 相比，ResNet-50 Acc top1 精度损失 0.993%，速度提升5.8倍（557.969ms ->96.726ms ）; 和 INT8 相比，ResNet-50 Acc top1 精度损失 0.131%，速度提升 1.3 倍(125.76ms -> 96.726ms)。详情参考MegEngine example 。
尝鲜通道： python3 -m pip install megengine==1.11.0+cu114 -f https://megengine.org.cn/whl/mge.html
Netron 可以可视化 Traced Module 了！欢迎大家体验： https://netron.app/

Bugfix

发版流程

修复 traced module 中重命名张量导致的错误。

通用组件

修复 fastrun 过程中跳过算法的判定条件。
修复 fastrun 过程中显存占用过多触发的 OOM 错误。
修复 Windows7 + 32bit + 多线程组合情况下，进程无法退出问题。
修复了参数初始化时 tensor 格式信息丢失的问题。
修改 nchw44 broadcast_vec 的场景下的算法选择, 修复 nchw44 的 elemwise 性能缺陷。
修复源码污染问题，使得 git status 恢复只显示用户本人的改动信息。
优化卷积通道不匹配，Matmul shape 不匹配时的输出信息，使其更好理解。
修复读取 persist cache 过程中由于网络原因导致的偶发性数据读取异常问题。
修复参数 tensor 初始化中未考虑 DTR 导致的卡死问题。
修复 softmax 运行时动态创建 elemwise 等 opr 导致不能开 record2 优化的问题
修复 elewise multitype 所引发的前向兼容的问题，使得之前的 load and run 可以正常运行该版本 dump 下来的模型。
修复 Repeat 算子无法开启 trace 模式的问题。
修复 load_and_run fitting 模式下仅指定输入 shape 或给定输入 batch-size 时设置无效等问题。
修复 ReduceMean 不同版本之间以及相同版本的 CPU 与 GPU 之间误差较大的问题。
修复 1.10 版本的模型内存占用增大的问题。

CUDA

修复 cutlass 编译 SM86 时间过长或者编译失败问题。
更改多卡环境的检测逻辑。取消初始化时对当前所有显卡是否支持 import megengine 的检测与提示，只有当运行时所使用的显卡不支持 import megengine 时才报错。
修复 cudnn8 的编译不通过的问题。
修复了 TensorRT8 在编译由于不指定 LIBRARY_PATH 导致失败的问题。

周边工具

修复 load_and_run 中 record_comp_seq 没有生效的问题。
修复参数和技术量统计工具中由于 long 类型的表示范围限制导致模型计算量的计算不准确的问题。
修复 load_and_run 中模型包含测试用例在全局图优化 dump 模型时报错的问题。
修复参数量和计算量统计工具 module_stats 重复统计共享权重的问题。
修复 megengine.tools.network_visualize 不支持CondTake 导致报错的问题。
修复 load and run 设置 multithread 后，没有加速效果的bug。

ROCM

修复 ROCM 平台由于缺少 conv bias 的实现导致的卷积算子无法执行的问题。

分布式训练

修复多卡训练时设置 async_level 为0会导致训练卡死的问题。

New Features

Python API

新增暴露如下API： is_cambricon_available、is_atlas_available、is_rocm_available、what_is_xpu。

通用组件

resize 反向传播支持 fp16 及 nhwc 的数据格式
CPU 和 CUDA 的 algo policy 的 cache 写入方式改为追加模式
elemwise multitype 中添加输出类型为 bool 的 opr，以提升megengine.functional.isnan、megengine.functional.not_equal、megengine.functional.less_equal、megengine.functional.greater_equal、megengine.functional.greater、megengine.functional.less、megengine.functional.isinf 、megengine.functional.equal 这些 opr 的性能，优化后整体和 pytorch 一致，其中megengine.functional.isinf 、megengine.functional.equal 优于pytorch表现。
增加可以查询whl包中的 trt、cudnn 版本、cuda 版本的接口：megengine.get_cuda_version、megengine.get_cudnn_version、megengine.get_tensorrt_version
使用 VF 指令优化 X86 和 RVV 的 GI 直接卷积, winograd 卷积, nchw_nchw44 卷积, 矩阵乘性能。经过验证 ResNet18 在 amax04 有 50ms 性能提升。
矩阵乘：12 Gflops -> 20 Gflops E5-2620 v4 @ 3.0GHz amax, 0.3 Gflops -> 1.2 Gflops @ nezha D1
GI algo RVV 去掉 FIXLEN 的依赖, 避免 FIXLEN 产生多余的 load/store 操作，加速推理过程，RVV 上 resnet18 模型有 5%～10% 的提升。
优化 softmax 的实现。在 arm 的设备上，优化后的 softmax 实现相较于之前代理版 softmax 性能提升 10 倍左右。
新增支持 TensorRT8 的编译的工具链。
load_and_run 增加 mdl 模型可用的 optimize_for_inference 优化选项，可以用来实现 optimize-for-inference 的图优化, 如bn融合。

ARM

针对 pooling 算子，支持 nchw44 format 下的 reduce 和 elemwise 算子融合。

第三方硬件

优化 X86+RISC-V 的性能，在resnet18 上验证加速 1.1 倍。

周边工具

load and run 添加运行时给定 loader init 接口的功能，使业务侧业务的 loader 在修改 init api 名字后指定参数可以继续加载。此功能使用参数：--c-opr-init-interface 。
使用示例：./load_and_run --c-opr-init-interface="your_loader_init_API"。
c-opr-init-interface 的默认值为 mgb_c_opr_init 。举例在业务中业务可能使用的值为： anc_c_opr_init。
load_nerwork_and_run 支持权重预处理以及设置warm up iter数。

发版流程

添加 cu114 whl包的生成方式。

Improvements

ARM

优化 CPU 上 reduce Opr 在 shape (xxx，xxx, 2/3/4) 的最后维度进行 reduce 时候的前向计算性能，提升约10倍。

CUDA

优化 conv2d padding mode 为 reflect 时的性能，大 shape 场景下提升明显，经过验证提升约50%。

文档

优化 functional.vision 模块中 roi_pooling，roi_align，nms，remap，warp_affine，warp_perspective，interpolate 的文档描述。
优化 pad 的文档中关于 mode 参数的描述，使之更准确。
优化 dataloader、Dataset、MNIST dataset 的文档描述，使之更完整明确。

MegEngine Lite

Bugfix

修复 MegengineLite 的 python 接口中 get_io_tensor、slice 及 concat 接口反复调用导致的内存泄漏问题。
修复 lite 中同时开 fast_run 和 nchw44 会挂的问题。

New Features

MegEngine Lite的 LiteConfig 增加 auto_optimize_inference 选项进行设备检测，可以根据推理时的CPU信息自动设置对应的 layout 优化选项。
添加 Lite 中 set_data_by_share 和 set_data_by_copy 接口，当输入是 numpy ndarry 时必须是连续的断言。

MegEngine

HighLight

MegEngine 模型支持前向兼容性。即新版本的 MegEngine 序列化的模型可以在老版本的 MegEngine 加载。
- 从该版本及以上的版本，具备向前兼容的能力。
- 部分场景不具备向前兼容的能力。例如使用了新版本中新增的 opr，此时则不可向前兼容。
增加 python3.9 的支持。

Know Issue

v1.10 trace 模式下 sublinear 和静态图 dtr 是失效的。
2080ti cuda 上 ResNet50 推理耗时略慢于 v1.9。
树莓派上 VGG 推理耗时略慢于v1.9。

Bugfix

Python API

限制把输入自动转换成 tensor 的场景：仅 elemwise 会自动转换输入为 tensor。
修复 megengine.functional.matmul 在动态图模式下反传时挂掉的问题。
修复 megengine.functional.transpose 的 shape 推断错误。
修复 conv 反传和 megengine.random.RNG 算子中空 tensor 的问题。
限制 trace 模式下的 megengine.functional.concat 的 apply 时输入是非 tensor 的类型转换。
修复 megengine.functional 里比较函数结果的 dtype 不为 bool 的问题。

混合精度训练

修复 v1.9 版本在 BaseCls 上部分网络显存占用增大的问题。

通用组件

修复 fp16 参数使 AMP 不能工作的问题。
修复cpuinfo版本，以避免ARM上dlopen时可能造成内存泄露的问题。
修复 adaptive_pooling 在推不出 shape 时 ndim 不正确设置的问题。
修复 riscv64 gcc 使用大于 O0 的编译优化选项报错的问题。
修复异步读写 tensor shape 的错误。
修复 advanced indexing 在一个元素被多次取出时的求导错误。
修复commit改变会导致大量文件重新编译的问题。
修复 fastrun 与 heuristic 混用时缓存混乱的问题。
修复某些情况下在 fork 之后，使用 megengine.get_cuda_compute_capability 接口获取 cuda 环境报错的问题。
修复不能 attach 已经在求导路径上的 Tensor 的问题。
修复类似 softmax 等通过其他 Opr 组合完成计算的 Opr 在 midout 之后运行奔溃问题。
修复 pooling，matmul 中执行 policy 缺失的问题。
修复使用 MegEngineLite 推理，并 reset memory 之后报错的问题，具体为修复 reduce opr 中，当 input 的内存地址发生改变时报错的问题，在实际执行前增加了 update 的功能。
修复 path 里不带 nvcc 时使用 jit 相关的函数会挂的问题。
修复 reduce 算子在 v1.9 其参数 keepdims 的默认值从 True 修改为 False 后，reduce 前后 dim 维度不一样的问题。
修复 layernorm 训练不稳定、normalize 的维度较小时速慢的问题。
修复在极小的概率下 tensor 产生时 shape 信息不全导致获取 shape 时出现卡死的情况。
修复在 adaptivate_pooling 中输入 tensor 作为 tshape 时抛出异常的问题。
修复 reduce 在 backward 构建反向图时，不参与反向计算，没有梯度时抛出异常的问题。
使输入带 axis 选项的 op 都支持负数 axis。
修复使用 GraphInference 跑 mge 计算图时出现的内存泄漏的问题
修复 fastrun 过程中跳过算法的判定条件。
修复 fastrun 过程中显存占用过多触发的 OOM 错误。
修复 maximum(x,x) 求导错误的问题。
在 cmake中添加 MGE_WITH_BENCHMARK 选项，允许开启 DNN 中 BENCHMARK 的编译。
修复 Function 中的 inplace 操作。
修复 broadcast_to 不能被 trace 的问题。
使用 tensor 去构造新 tensor 时检查 dtype, device 等其他参数。

发版流程

修复 traced module 中重命名张量导致的错误。
修复 traced module 中可能错误抛出异常的问题。
修复 traced module 中的兼容性问题

ARM

修复 ARM 上执行 NHWCD4 模型的报错信息。

周边工具

修复 load_and_run fitting 模式下用户开启 const_shape 时 shape 变化的模型抛出异常的问题。
修复 load_and_run 中 record_comp_seq 没有生效的问题。
修复 profile 时 altas 的 event sync 的问题。

New Features

Python API

移除 Imperative python 接口里的 Symbolvar，并将其功能由 Tensor 实现（兼容之前的 mgo 图手术代码）。
新增了支持大 batch size 训练的 lamb 优化器。
megengine.functional.nn.roi_align 算子支持空 tensor 的输入。
添加 swapaxes 接口支持维度交换功能。

通用组件

优化 third_party 的准备工作，增添可选项，改善只训练或者只推理用户的体验。在 cmake 前添加 EXTRA_CMAKE_ARGS="-DMGE_SYNC_THIRD_PARTY=ON" ，会自动调整编译所需的 THIRD_PARTY 库。
增加检查本机 CUDA 版本和当前 MegEngine 依赖的 CUDA 版本是否匹配，如果不匹配打印 warning 信息，如下图所示。
支持对 uint16 tensor 进行 astype 。
在 fastrun 的 profile 模式中添加 warmup，以提高评判的准确。
MegEngine 模型支持前向兼容性。即新版本的 MegEngine 序列化的模型可以在老版本的 MegEngine 加载。
补全 gi 对 risc-v 的支持。
增加 python3.9 的支持。

ARM

在 arm_common 中添加了 chanwise 的 9x9点和 11x11 点积运算；9x9 的情况下有 25% 的无用计算, 11x11 的情况下无用计算只有 8.3%, 在满足对齐的情况下测试 9x9 与 11x11 耗时差距不大，因此推荐使用 11x11 的版本。
在 dnn/src/fallback/matrix_mul 下实现一个 gi 版本的 gemm 非 mk4 的版本。

CUDA

支持 int1 conv 的基本实现。

三方硬件

支持 Atlas710 的硬件。

周边工具

优化了 cmake 编译说明 , 如有问题欢迎提交 PR 修改或在论坛提出反馈。
在 load_and_run 中添加了 fitting 模式接口。
load_and_run --input 选项新增指定输入 shape 的用法。使用格式：--input="data_name:{d0,d1,d2, ...,dn}" 。
load_and_run 新增 layout_transform_batch_size 选项，支持指定全局图优化输入的 batch size。

Improvements

Python API

提高 megengine.functional.nn.pixel_shuffle 在小 shape 下的性能，可达 500%。
提高 megengine.functional.matmul 在小 shape 下的性能约 15%。

通用组件

优化跨 stream 张量复制。
优化 adaptive_pooling 实现。imperative 情况下的 megengine.functional.nn.adaptive_avg_pool2d megengine.functional.nn.adaptive_max_pool2d 速度提升约 6.5 倍。
优化 megengine.functional.nn.conv_transpose3d 实现。imperative 情况下的速度提升约 2 倍。
优化 pooling 实现。imperative 情况下 megengine.functional.nn.avg_pool2d megengine.functional.nn.max_pool2d的速度提升约 5 倍。
优化 megengine.functional.nn.conv_transpose2d 实现。imperative 情况下的速度提升约 3 倍。
在 heuristic cache 中使用简单构造 key 的方式，获得性能提升。
重写 matmul 和 batchmatmul 的自定义求导规则，提升matmul batchmatmul 反向计算速度，与 1.9 版本相比， vit 模型训练单个迭代训练时间从 354ms 降低到 350ms。
缩小单个 sm cuda 编译时间到原来的 2/3。

CUDA

优化大尺寸卷积的 CUDA direct 算法性能，正向的速度达到峰值的 80% 以上。

MegEngine Lite

Bugfix

修复 lite_shared.dll 没有在install 目录的问题。
修复从 numpy 拷贝数据到 device tensor 的错误。
修复 cpu:default 下多线程执行，MegEngine Lite 仍使用同一个线程的问题。
修复 pylite 中的接口名: set_tensorrt_cache → set_redis_cache
修复旧版本load_and_run无法解析历史的打包模型的兼容性问题。

New Features

MegEngine Lite 中添加上传和下载 redis cache 的功能。
MegEngine Lite 中增加 LITE_extra_configure 接口，用户可以设置是否使用模型信息进行网络配置。

MegEngine

Bugfix

Python API

restrict using convert_inputs in py_apply.
Fix megengine.functional.matmul grad error.
Fix megengine.functional.transpose shape infer.
Fix empty tensor bug of conv_bwd and megengine.random.RNG.
Restrict value converts to tensor for megengine.functional.concat.
Fix return dtype of comparison.
Fix the problem that cuda environment cannot be used after fork.
Fix the problem that tensors already in gradient path cannot be attached.
Fix the crash of some Operators running after midout, these Operators will call other Operator to finish compute task, such as softmax.
Fix the problem that policy is missed for pooling and matmul.
Fix the problem of reporting an error when the input memory address changes in reduce opr, and add the update function to fix it before the actual execution.

AMP

Fixed v1.9 the memory usage incresing problem of some network on basecls .

Common components

Fix an amp error occuring when some parameters has float16 dtype.
Fix cpuinfo version to avoid memory leakage when dlopen on arm.
fix incorrect ndim when could not infer shape for adaptive_pooling.
Fix riscv64 gcc error when using compilation optimization options greater than O0.
fix bug when asynchronously read/write tensor's shape.
print warning information when CUDA on user's PC mismatched with CUDA which in MegEngine.
Fix advanced indexing grad error.
Fix many object need recompile when commit id changed.
Fix lookup heuristic cache even in fastrun.
Fixed the problem that jit related functions will fail when NVCC not in path.
Fixed the problem that the default behavior of reduce operation is inconsistent with older version whick keepdims.
Fixed the problem that layernorm training is unstable and the speed is slow with small normalization dimensions.
Fixed the situation where the tensor would get stuck when getting shape if the probability of creating a tensor was not complete.
Fixed the problem when entering tensor as tshape in adaptivate_pooling.
Fix the problem that reduce does not participate in reverse calculation when constructing backward graphs and throws exceptions when there is no gradient.
Make input op with axis option support negative axis.
Fixed memory leak when using GraphInference to run mge calculation graphs.
Fix skip condition in fastrun.
Fix OOM error in fastrun.
Fix grad of maximum(x, x).
Add the MGE_WITH_BENCHMARK option to cmake to allow the compilation of BENCHMARK in DNN.
Fix inplace operation on autodiff.Function.
broadcast_to supports mutable target shape.
check args when construct tensor with existing tensor.

Release process

Fix the bug occurred when renaming tensor in traced module.
Fix trace_module function may raise error in finally scope
Fix traced module compatible issues.

ARM

Fix error message when executing NHWCD4 model on ARM.

Peripheral tools

Fix the problem that the model whose shape changes when the user turns on const_shape in load_and_run fitting mode throws an exception.
Fix the bug that record_comp_seq in load_and_run does not take effect.
Fix the bug of event sync of altas when profiling.

New Features

Python API

Remove Symbolvar and implement its function in Tensor.
Add lamb optimizer that supports large batch size training.
megengine.functional.nn.roi_align operator supports empty tensor input.
Add swapaxes interface to support dimension swapping.

Common components

Optimize third_party's prepare, add options, and improve the experience of training-only or inference-only users.Adding EXTRA_CMAKE_ARGS="-DMGE_SYNC_THIRD_PARTY=ON" before cmake will automatically adjust the THIRD_PARTY library required for compilation.
Add warmup before profile in fastrun.
MegEngine models support forward compatibility. That is, the model serialized by the new version of MegEngine can be loaded in the old version of MegEngine.
Complete gi support for risc-v.
support python3.9.

Third-party hardware

supports Atlas710.

ARM

Added chanwise's 11x11 & 9x9 dot product operation in arm_common.
Implement a gi version of gemm's non-mk4 algorithm under dnn/src/fallback/matrix_mul.

CUDA

Support simple implementation of int1 conv.

Peripheral tools

Improve cmake build note,if you have any questions, welcome to contribute or give feedback in here.
Added fitting mode interface for load_and_run.
Add the usage of specifying input shape to the --input option of load_and_run. format: --input="data_name:{d0,d1,d2, ...,dn}".
Add layout_transform_batch_size option for load_and_run to specify global layout transform input batch size.

Improvements

Python API

Speed up `megengine.functional.nn...

MegEngine

Bugfix

Python API

修复 conv 反传和 megengine.random.RNG 算子中空 tensor 的问题。
限制 trace 模式下的 megengine.functional.concat 的 apply 时输入是非 tensor 的类型转换。

MegEngine Lite

Bugfix

修复 cpu:default 下多线程执行，MegEngine Lite 仍使用同一个线程的问题。

MegEngine

Known Issue

使用 megengine.random.RNG 的输入包含 0 维 tensor 场景，训练会报错。

HighLight

本次版本性能有较大提升，大部分网络训练提速约 10% ， host bound 严重的场景如检测模型，QAT 训练等有 20%~40% 的加速。尤其是在小 batch、amp 等情况下有显著提速。在 BaseCls 的多卡训练上验证，平均提速15.4%。
- 支持在一些算子中，输出张量可以与输入张量共享数据（Memory Forwarding）。此时不会发生数据拷贝，只有当数据是共享的张量发生修改时，才会触发数据拷贝，保证共享这一部分数据的其他张量不会受到影响。涉及到的算子包括：megengine.functional.transpose、megengine.functional.broadcast_to、megengine.functional.reshape 、megengine.functional.expand_dims 、megengine.functional.split 、张量索引等。这样可以尽可能地减少数据拷贝的过程，性能得到提升。为了防止极端情况下显存异常，提供 megengine.config.disable_memory_forwarding 用于禁用这项功能。

Notice

本次版本对 python3.5 的支持继续维持，从下个版本 MegEngine v1.10（MegBrain v8.17）开始将停止，请大家注意提前做好准备。

Bug fixes

Python API

修复使 @ 运算符与 megengine.functional.matmul 的行为一致。
修复使用 megengine.functional.nn.pad ，输出 Tensor 值可能为全 0 的问题。
为 megengine.functional.nn.remap megengine.data.transform.Resize 添加 nearnest mode 模式。

通用组件

修复在混合精度训练时无法使用 megengine.functional.nn.sync_batch_norm 的问题。
修复全局优化 conv 与两个 nolinear 算子融合时出错的问题。
修复不开 fastrun 的情况下大 kernel 卷积速度慢的问题。
修复对输入为非 float32 的类型求导时不报错，并且没有梯度的问题。
修复分布式训练 RPC 通信 IO 中断问题。
修复 BatchNorm 对二阶导的支持问题。

New Features

Python API

megengine.functional.nn.conv1d megengine.functional.nn.conv2d 增加 padding_mode 参数，支持 zeros、reflect、replicate 模式。

CUDA

添加大核的 direct conv 实现。
添加 implicit bmm 大核 depthwise conv 的实现。
CUDA 上 resize 的 nearest mode 支持不止 1 和 3 的多通道输入。

通用组件

基于业务降噪模型进行关于 cd4 优化，主要是添加 NHWC 和 NHWCD4 两种 format 之间的转换。在业务的降噪模型上验证性能提升 15% 左右。
添加 int1 数据类型的支持。
tensor indexing 中支持 np.newaxis(None) 。

Improvements

通用组件

优化性能，大部分网络训练提速约 10% ， host bound严重的 vit、检测模型，在 QAT 场景有 20%~40% 的加速。
提升 op dispatch 系统的性能。修复了 v1.8 使用的新 dispatch 系统存在的性能问题，修复后性能与 v1.7 持平。
提升 dispatch 系统 jit trace 性能。性能与 v1.7 相比略有提升。开启 trace 下部分模型训练性能提升如下， ResNet50 提升 0.7% ， ShuffleNet 提升 9%， ATSS 提升 10%。
subgraph op 支持 shape 推导和 jit fusion 优化，并用 subgraph op 重写了部分由 elemwise 组合成的性能较差的op。优化后 megengine.functional.nn.hsigmoid、megengine.functional.nn.relu6、megengine.functional.nn.prelu、megengine.module.LeakyReLU、megengine.functional.nn.softplus 、megengine.functional.nn.logsigmoid、megengine.functional.where 性能在大输入 shape 时与 pytorch 持平。
提升batch_norm的性能，小尺寸下提升 4.3 倍。
优化 reduce op 性能，速度提升 75%。

CUDA

融合 conv 和 h_swish，部分模型性能提升。

MegEngine Lite

Bug fixes

lite 修复全局图优化接口 symbolvar 替换不完整导致 cuda 设备上无法使用的问题。
修复 load_and_run lite 模型全局图优化接口与 fast-run 接口使用冲突的问题。
修复 load_and_run 使用 “–cuda” 参数时报错的问题

New Features

lite-c 接口中添加错误码和全局获取错误码的接口 LITE_get_last_error_code。
lite 增加通过虚拟地址查询物理地址的接口。
load_and_run 支持 lite 模型全局图优化。

Improvements

优化 Lite 中 get_data_by_share python 接口的性能。在算法仓的模型中略有性能提升。

MegEngine

Bug fixes

Python API

make operator "@" behaves in a way consistent with the behavior of megengine.functional.matmul .
Fix the output tensor of megengine.functional.nn.pad may be all 0 .
Add the nearNest mode for megengine.functional.nn.remap and megengine.data.transform.Resize .

Common components

Fix megengine.functional.nn.sync_batch_norm not being available when training with mixed precision.
Fix bug of fuse conv bias and two nolinear opr.
Fix the problem of poor performance of the large kernel convolution without fastrun.
Fixed bug gm attach non-float type does not report error without gradient.
Fix the IO interruption for RPC communication when distributed training.
Fix BatchNorm support for higher-order differentiation.

New Features

Python API

Add padding_mode parameter，support zeros、reflect、replicate mode for megengine.functional.nn.conv1d megengine.functional.nn.conv2d.

CUDA

Add implementation of large kernel's direct conv algo.
Add implementation of large kernel's depthwise conv by implicit bmm.
The nearest mode of resize on cuda supports more than 1 and 3 multi-channel inputs.

Common components

Add conversion between NHWC and NHWCD4 formats.
Add support for int1 dtype.
Add np.newaxis(None) for tensor indexing.

Improvements

Common components

Optimized performance, Most networks speed up to 10%, host bound heavy VIT or detection models, QAT scenarios speed up 20% to 40%.
Improve the performance of the op dispatch system. Fix the performance problems of the new dispatch system in version 1.8. After the repair, the performance is the same as that of version 1.7.
Improve the jit trace performance of the dispatch system. The performance is slightly improved compared to the 1.7 version.
When trace is enabled, the training performance of some models is improved as follows, resnet50 0.7%, shufflenet 9%, and atss 10%.
Subgraph op supports shape infer and jit fusion optimization, and rewrites some ops with it.
Performance of megengine.functional.nn.hsigmoid、megengine.functional.nn.relu6、megengine.functional.nn.prelu、megengine.module.LeakyReLU、megengine.functional.nn.softplus 、megengine.functional.nn.logsigmoid、megengine.functional.where, and where is on par with pytorch for large input shapes.
Improve the performance of the op batch_norm by 4.3 times for small object.
Improve the performance of the op reduce,speed up 75%.

CUDA

Fusion of conv and h_swish, the performance of some models is improved.

MegEngine Lite

Bug fixes

Fix lite global layout transform symbolvar replace error.
Fix the conflict between load_and_run lite model global layout transform optimization interface and fast-run interface.
Fix load_and_run error when using "--cuda" parameter.

New Features

Add 'LITE_get_last_error_code' interface in lite-c.
Add get physic address interface in lite.
Load_and_run supports lite model global layout transform optimization.

Improvements

Optimize the get_data_by_share interface of LiteTensor.

MegEngine

Known Issue

训练和推理的GPU显存占用（MiB）各模型有不同程度的增加。

New Features

CUDA

添加大卷积核的 direct conv 实现。
添加 implicit bmm 大卷积核 depthwise conv 的实现。

MegEngine

New Features

CUDA

Add implementation of large kernel's direct conv algo.
Add implementation of large kernel's depthwise conv by implicit bmm.

MegEngine

Notice

从下个版本 MegEngine v1.9 开始将停止对 python3.5 支持，请大家提前做好准备。

HighLight

megengine.functional.topk 新增「descending」以定义排序行为，本次版本默认为「False」保持从小到大排列，如果未指定则提示warning 信息。在 v1.12 版本将修改「descending」默认值为 true 以符合惯常情况下大家对 topK 的定义，即从选出二维矩阵中 Top-K 个最大元素。
MegEngine 支持端上训练，使用参考这里。

Bug fixes

Python API

修复 megengine.functional.floor_div 对于异号整数输入的计算错误。
使 megengine.functional.broadcast_to 接受 None，表示这一维无需进行广播以支持 -1 shape 自动推导。

发版流程

修复 MegEngine v1.7 版本序列化的 TM 模型，由 MegEngine v1.8 版本加载做图手术会失败的问题。
TracedModule Bug 修复如下。
- 修复无法序列化第三方后端中 op 的问题。
- 修复 Input 类型 expr 未绑定 top_graph 的问题。
- 修复图手术中将 ModuleNode 作为输入时，expr 的插入位置计算错误的问题。
- 修复 TracedModule 加载 v1.7 及之前含有 ones 或 zeros 的模型无法运行的问题。
- 修复 TracedModule 在部分情况下递归过深的问题。
- 修复 TracedModule 无法重复 trace 的问题。
- 修复 TracedModule 无法正确识别 pad 的问题。
- 改善 TracedModule 对不合法输入的报错信息。
修复同时开全局图优化和 fastrun 时，选中的算法只有 naive 时会报错的问题。

CUDA

前置输入 Tensor 太大的判断，优化错误提示信息，避免直接输出 cuDNN 报错。
修复 tensorrt 改变 shape 时，output推导错误问题

通用组件

修复 MegDNN fallback 的 ConvBias 算子不可用的问题。
修复图优化之后无法正常 fastrun 模型中的 matmul 和 pooling 的问题。
修复在低内存环境（8G）无法编译 MegEngine 的问题。
修复将较大的 numpy array 转换为 tensor，或将较大的 tensor 转换为 numpy array 时，占用额外内存的问题。
增加计算设备上的异步错误的检查与报错。
修复了 tensor 的 ndim 未知时 indexing 操作无法被 trace 的问题。

周边工具

修复 load and run 命令行输入的数据无法解析的问题
修复 io dump 中 qint4 和 bool 数据类型 dump 错误
修复megengine.utils.module_stats没有import相关库而无法使用的问题
修复 load and run 编译 cuda 时错误。
删除 dump_with_testcase 工具。
修复 load and run 无法识别用 flatbuffer 序列化模型的问题。
修复参数和计算量统计工具 module_stats 接口的 inputs 为 dict 时，无法统计的问题。
修复 load and run工具使用 --get-static-mem-info选项，统计得到的权重信息数据有误的问题。
修复 load_and_run 工具中，使用形如 –input "ratio:1.0" 选项时的参数解析错误。

New Features

Python API

添加 megengine.functional.diag 算子。

发版流程

TracedModule 支持在图手术过程中修改 Node 的名字。
为 TracedModule 提供一个 enable_expr_checker 开关，以在 trace 时进行更多检查。

ARM

优化 Arm 中部分数学计算的实现，性能有微弱的提升
ARM 后端支持 rnn_cell/lstm_cell/lstm 算子
添加 elemwise 部分 case 对多线程的支持，以支持 TS 项目部分模型性能优化。

第三方硬件

增加对寒武纪 MLU270 支持。
TensorRT Runtime Opr 支持动态 shape 的模型,且可根据输入 shape 主动选择相近「IOptimizationProfile」。

通用组件

CPU 支持运行 int4 模型。
megengine.functional.nn.remap 支持 dtype 为 float16 下的求导
优化非连续情况下的 typecvt 的性能
新增端上训练支持，更多详情查看这里
在 windows 系统上，load_and_run 增加动态链接 MegEngine 支持。

周边工具

新增了 cmake 格式化工具，执行可将 cmake 相关文件进行格式化。
Custom Op 增加 JIT 构建工具，文档待补充。
支持构建 Android whl 包。

Improvements

Python API

优化 megengine.random.RNG.uniform API中 low=0 & high=1 的情况下的 elemwise 开销，单算子性能提升约75% 。

CUDA

改进 megengine.functional.nn.softmax 在 axis 为标量时，CUDA 平台上的性能提升约200%～450%。
提高 megengine.functional.nn.dropout 在 CUDA 平台上的性能，可提升约 650%。
提高 megengine.functional.nn.layer_norm 在 CUDA 平台上的性能，可提升约 540%。

ARM

当一个 tensor 需要进行 int16/uint16 → float 的转换，并且转换后的数据进行 Mul/ADD 运算时，将多个运算合并为 ElemwiseMultiType，在010项目的 369 号模型验证性能提升约20倍(23512.8us →1208 us)。

通用组件

动态 AMP 性能提升，多个模型验证可提升约1% 。
优化 cpu 环境下 jit.trace 的时间。bs 256 、VGG16 模型验证，jit.trace 从约 4 分钟提升至 2 分钟。
修复在 cpu 上模型执行速度过慢的问题，在 VGG16 bs 10 验证从 10 分钟提升至约 6s。

MegEngine Lite

Bug fixes

修复 lite 中 TensorBatchCollector 的 device id 设置错误
Lite 中空 tensor 的 to_numpy 方法增加输出 Tensor 的数据类型信息
修复用户在自定义模型输出空间时部分模型推理失败的问题
修复 MegEngine Lite 的 device 配置接口为只设置 xpu 的 device type 为用户指定的 device type 。
修复 MegEngine Lite python 接口在 TensorBatchCollector 的 batch id 出错时没有报错日志输出的问题。
修复 MegEngine Lite 开启「record level 2」时报错的问题。

New Features

lite 中增加对寒武纪的支持。
MegEngineLite 新增一个名为 get_data_by_share 的接口。通过调用该接口，用户可以零拷贝地获得一个 lite tensor 的 numpy 对象。
增加 cv 的分类与检测的 example 。
新增全局图优化支持。

MegEngine

Notice

Drop support for python3.5 from MegEngine v1.9.

HighLight

megengine.functional.topk will default to descending order in v1.12. Please specify the "descending" argument during the transition period.
MegEngine support Device Training，you can refer to here.

Bug fixes

Python API

Correct behavior of megengine.functional.floor_div for integers with opposite sign.
Allow passing None to megengine.functional.broadcast_to , meaning the corresponding axis should not broadcast.

Release process

Fix a compatibility issue with TracedModule.
Fix TracedModule Bug ：
- Fix the problem that ops in third-party backend such as tensorrt can not be serialized.
- Fix the problem that input expr bound top_ graph failed.
- Fix the problem of incorrect calculation of expr insertion position when ModuleNode is used as input of graph operation.
- Fix a bug of v1.7: the model with ones or zeros can't work.
- Fix a recursion too deep issue when copying traced module.
- Fix an error that prevents traced module from tracing a module more than once.
- Fix traced module not recognizing pad.
- Improve error message for illegal inputs feed into traced module.
Fixed the problem that when global graph optimization and fastrun are enabled at the same time, an error will be reported when the selected algorithm is only naive.

CUDA

To judge that the front input Tensor is too large, optimize the error message, and avoid directly outputting cuDNN to report errors.
Fixed output derivation error when tensorrt changed shape.

Common components

Fix the problem that the ConvBias operator of MegDNN fallback is not available.
matmul, pooling operators support fastrun, which will lead to better inference performance for C++ models.
MegEngine（8G） fix build issue at low memory env(8G).
Reduce memory consumption when a large numpy array is converted to tensor or a large tensor is converted to numpy array
Add out-of-bound access check for some operators.
Fix the problem that the indexing operation cannot be traced when the ndim of the tensor is unknown.

Peripheral tools

Fixed the problem that the data entered in the load and run command line could not be parsed.
Fix qint4 and bool data type dump errors in io dump.
Fix the problem that megengine.utils.module_stats cannot be used without import related libraries.
Fix load and run build error when build with CUDA.
Remove dump_with_testcase tool.
Fix the problem that load and run cannot recognize the serialized model with flatbuffer.
fix a bug in megengine.tools.network_visualize when inputs is instance of dict.
Fix a bug that user will get wrong statistic when using --get-static-mem-info.
Fix a bug that load_and_run will get parsing error when meet command like –input "ratio:1.0".

New Features

Python API

Add megengine.functional.diag operator.

Release process

Support that the name of node can to be modified during the graph operation in TraceModule.
Add a enable_expr_checker switch for traced module, which adds more checks during tracing.

ARM

Optimize the implementation of some mathematical calculations in arm, the performance is slightly improved.
Add arm rnn_cell/lstm_cell/lstm operator.
Support part of arm ternary elemwise multithread.

Third-party hardware

Added support for cambricon MLU270.
Supporting dynamic shape model in TensorRT Runtime Opr and set closest IOptimizationProfile according to input shape automatically .

Common components

CPU supports running int4 model.
Support backward computation for float16 dtype in remap.
Optimize the performance of typecvt in non-continuous situations.
Add training based on cpp Interface, more.
For windows system, load_and_run supports dynamicly linking megengine now.

Peripheral tools

Added a cmake formatting tool: cmakeformat.py.
Add the JIT builder for Custom Op.
Support build python wheel for Android(termux env).

Improvements

Python API

Add fastpath when low=0 and high=1 for megengine.random.RNG.uniform to improve performance.

CUDA

Improve performance of softmax when axis is scalar on CUDA platforms, by 200% - 450%.
Enhance performance of dropout on CUDA platforms by up to 650%.
Enhance performance of layer_norm on CUDA platforms, by up to 540%.

ARM

ADD an operator fusion case of TypeCvt and Elemwise. A pass will fuse a Typecvt(uint16 to float) operator and one Elemwise operator(Mul/ADD) to an ElemwiseMultiType operator and developing relative kernel on aarch64.

Common components

Add fastpath when low=0 and high=1 for megengine.random.RNG.uniform to improve performance.
Optimize the placement order of algorithms in matrixmul under the x86 platform in dnn to improve the dump time of jit.trace(bs256 VGG16, 4min -> 2min).
Fix the problem that the model speed on CPU is too slow (bs10 VGG16,10min -> 6s).

MegEngine Lite

Bug fixes

Fix the device ID setting error of tensorbatchcollector in lite.
Add data type information when call empty tensor to_numpy method.
Fix the problem that some model inferences fail when users customize the output space of the model.
Fix device type configuration for megengine lite. Now only the devices of which the device type is unspecified will be modified.
Add warning for megengine lite python interface, when error of batch indexes occurs in the TensorBatchCollector.
Fix runtime error when record level of megengine lite is 2.

New Features

Add interface for cambricon models in lite.
Add a new interf...

Releases: MegEngine/MegEngine

MegEngine v1.11.1

MegEngine

Bugfix

通用组件

CUDA

模型序列化

周边工具

量化

Python API

New Features

Python API

通用组件

ARM

Improvements

ARM

文档

Dataloader

MegEngine

Bugfix

Common components

CUDA

Model serialization

Peripheral tools

Quantify

Python API

New Features

Python API

Common components

ARM

Improvements

ARM

CUDA

Document

Dataloader

Uh oh!

MegEngine v1.11.0

MegEngine

HighLight

Bugfix

发版流程

通用组件

CUDA

周边工具

ROCM

分布式训练

New Features

Python API

通用组件

ARM

第三方硬件

周边工具

发版流程

Improvements

ARM

CUDA

文档

MegEngine Lite

Bugfix

New Features

Uh oh!

MegEngine v1.10.0

MegEngine

HighLight

Know Issue

Bugfix

Python API

混合精度训练

通用组件

发版流程

ARM

周边工具

New Features

Python API

通用组件

ARM

CUDA

三方硬件

周边工具

Improvements