Releases · MegEngine/MegEngine

问题修复

修复 trace 中的静态推导
修复 trace 中io和opr的执行顺序避免死锁
修复cd4在convolution中group=1时候转换错误以及elemwise转换错误
修复fuse conv bias optpass中bias shape固定时，inference时shape匹配不上问题
修复 mkl elemwise计算 LOG是异常
修复load_and_run --input对于单输入未指定正确输入名字时的错误处理

新功能

支持 scalar 类型的 Tensor
使用async_level来控制异步执行中错误检查
增加 group_norm、 instance_norm、layer_norm、conv1d、remap等算子
GradManger.attach 对 Tensor 使用 weak reference
支持分布式量化训练
支持inference weight-preprocess 释放原来weight的内存
jit mlir后端全面支持Elemwise, DimShuffle
增加cv DCT 算子支持

性能优化

减少了 batch normalization、elementwise、broadcast等算子在host CPU上的耗时
优化了 optimizer 的 step() 性能
优化量化训练性能
优化arm64 int8X8X16_mk4_k8x8x8 matmul 算子

兼容性破坏

无

Bug Fixes

Fixed static shape inference in trace to allow training larger models
Link io-opr in trace to avoid deadlock
Fixed cd4 conversion error when group=1 in convolution and some cases in elemwise
Fixed the problem of shape matching when the bias shape is fixed in fuse conv bias optpass
Fixed LOG mode of elemwise in MKL calculation abnormal
Fix the error processing when load_and_run --input does not specify the correct input name for a single input

New Features

Support representation of scalar-type tensor
Enable users to control error check during asynchronous execution by parameter async_level
Add operators including group_norm, instance_norm and layer_norm, conv1d and remap
Use weakref for GradManger.attach
Support distributed quantize aware training
After weight preprocessing, release the original weight memory during inference
Support Elemwise and DimShuffle operators in JIT of mlir backend
Support DCT operator in cv

Optimization

Reduce host overhead for operators including batch normalization, elementwise, and broadcast
Improve performance of the step function in optimizers
Improve performance of quantization training
Optimize arm64 int8X8X16_mk4_k8x8x8 matmul operator

Compatibility violation

No

新功能

IndexingMultiAxisVec添加bool类型支持
添加windows和macOS打包功能
添加adaptive pooling算子
mlir jit添加llvm-lit支持
添加weights preprocess开关控制是否在inference阶段缓存weights 预处理后的结果提升性能
添加MGB_USE_ATLAS_ASYNC_API宏控制开启异步API调用
按用户使用习惯更新 dtype promotion的规则
在broadcast增加参数检查
device的__repr__方法增加物理单元信息

问题修复

修复cpuinfo在arm linux下的编译warning
修复Imperative Runtime在出错情况下由于没有正确set exception导致卡住的问题
修复dump_with_testcase脚本在打开--output-strip-info选项时如果文件不存在会crash的问题
修复NCHW->NCHW4的pass对float类型的处理
在float转io16c32的pass中添加对deconv的处理
由于xcode的问题，在ios下关闭thread_local的支持
修复多机训练中，ParamPackSplit出现的refcnt计数问题
修复多线程下多个模型使用同一个compnode且开启record功能导致出错的问题
修复NCHW→NCHWxx的pass在处理conv_bias 且bias为空情况下的问题
修复jit.trace发生错误后使得后续trace完全不可用的问题
修复bool.sum()
修复graph binding 错误处理导致graph被错误回收的问题
修复jit.trace 对topk/warp/nms等op的处理
修复LocalConv2d算子对group的支持
修复 dump 中使用 optimize_for_inference 时的bug
修复NMSKeep、topk 、warp_perspective 被 trace 时的 bug

兼容性破坏

调整部分Function API的命名、参数或 import 路径，删除重复API

New Features

Add bool dtype support for IndexingMultiAxisVec
Add windows and macOS packaging capabilities
Add adaptive pooling opr
Add llvm-lit support for jit mlir
Add the weights preprocess option to control whether the results of weights preprocessing are cached in the inference phase to improve performance
Add macro MGB_USE_ATLAS_ASYNC_API to control whether enables asynchronous API calls
Update dtype promotion rule
Add parameter check in tensor broadcast method
Update device repr method to show physical placement

Bug Fixes

Fix cpuinfo compiling warning under arm Linux
Fix the stuck problem due to incorrect set exception in case of error
Fix crash when enabling --output-strip-info in dump_with_testcase if the file does not exist
Fix nchw → nchw4 pass when handling float type
Handle deconv opr in the pass from float to io16c32
remove thread_local support in ios due to Xcode problem
Fix refcnt counting problem in ParamPackSplit during multi-machine training
Fix the crash problem that multiple models use the same compnode and enable record function under multithreading
Fix nchw → nchwxx pass in processing conv_bias opr in the case of bias being empty
Fix jit.trace when an error occurs that may make subsequent trace completely unavailable
Fix bool.sum()
Fix graph binding error handling that caused the graph to be malcollected
Fix topk/warp/nms op when using jit.trace
Fix group support for local conv2d operator
Fix bug of optimize_for_inference in dump
Fix bugs of NMSKeep, topk, warp_perspetive during trace

Compatibility violation

Adjust names, paramters or import path of some functional API; delete dumplicated API

Highlights

全新的 Imperative Runtime，通过重写动态执行引擎，将过去几个版本中动态图的限制打破，解决了一系列资源释放问题，并大幅提升了动态自由度，让使用 GPU 计算就像 NumPy 一样方便自如；在提高自由度的同时，配合更强的动静转换功能，让增强能力的动态训练代码仍然可以静态化训练，方便高效；提供了更强的 trace 能力，每次仍然运行 python 代码，让 trace 的启用更加顺畅透明；实现了更可靠的 dump 功能，提供验证功能，确保没有多余的分支逻辑被误记录到序列化格式中，让用户可以低成本的兼得动态训练与静态推理的好处。
midout 代码裁剪功能，让用户可以全自动的针对自己网络使用的算子进行代码裁剪，免除手工配置就能最小化推理时的代码体积，极大的提升端侧推理效率。
一系列推理性能和功能优化。具体的推理优化包括代码编译速度优化、Winograd 算法优化提速、A7 / A55 / A53 等 ARM小核心GEMM的性能优化、NCHW44 下 8x8x16 channel-wise Conv和8x8x16 direct Conv 的性能优化、NCHW44 下 First Conv 的性能优化、修正了 Atlas H2D 拷贝时的问题、恢复对 CUDA9 的支持、添加了NCHW44下Winograd F(7,3) 实现、添加 cpuinfo 进行 CPU 架构的判断，让用户无需提前指定CPU架构、对框架代码进行了更细粒度的裁剪设计，配合midout功能，可以大幅度降低 binary size、Cmake 重构优化、修复了 shuffleShuffleRemovePass 在特定输入时的 segfault 问题。
更多国产硬件支持：支持主流国产芯片和ROCm的接入，方便在国产 NPU 芯片上进行推理工作。
MLIR 的接入：接入了 MLIR 做为静态图 JIT fusing 引擎，对于连续的访存算子进行融合加速；利用 MLIR 做为工具，提高框架代码的可维护性。
全新设计的 API，更加 Pythonic，控制更加清晰可靠。
对 Windows / Mac 提供了原生支持，同时发布Linux / Windows / Mac 的 PYPI 安装包。
尝鲜体验通道
pip3 install megengine==1.0.0rc1 -f https://megengine.org.cn/whl/mge.html -i https://pypi.douban.com/simple/

Improvements

通过添加cuda pinned memory内存池减少cudaHostAlloc调用次数降低host端的开销提升性能。
进一步优化A7/A53/A55等小核上int8和int8x8x16的性能。
添加Winograd F(7,3)，在一些网络有10%-30%性能提升(目前由于精度问题默认关闭)。
优化单stream下cuda device 内存分配器性能。

Bug Fixes

修复megengine 交叉编译。
修复conv NCHW88在fastrun搜索winograd算法未正确设置format导致的问题。
修复部分arm kernel内存越界问题。
修复cpuinfo在uclibc下的编译问题。
修复部分后端对于一些特殊的shape未优化导致选到naive算法的问题(通过optpass中将nchwxx → dimshuffle + nchw)。
修复寒武纪编译问题。
修复由于x86 im2col算法不支持非对称量化导致模型无法dump的问题。
修复fastrun写cache时由于snprintf使用不对导致崩溃的问题。
修复某些testcase在某些平台(EV300)爆栈导致无法执行的问题。
修复相同group组相同名字的多机opr频繁注册导致的冲突。

新功能

NHWC的warpperspective添加matidx支持。
添加CUDA版本的remap算子支持。
支持编译ios whl包。
megengine模型支持TensorCore加速。
Parameter 中增加 replica_mode来指定是否需要同步。
collective_comm算子添加local_grad参数。

性能优化

持续优化CPU下NCHW44性能，在业务线模型有5%-30%性能提升。
添加更多midout支持，进一步减少binary size体积。

问题修复

修复使用vs2015 编译后megbrain执行速度慢的问题。
修复CPU端偶尔出现free < total的内存分配问题。
修复arm linux下GCC编译器无法inline小函数导致的性能问题。
修复cuda warpperspective算子在batch * img_size 超过INT_MAX时的计算错误。
修复cuda elemwise 在int8 broadcast情况下的计算错误，不影响NCHW4模型。
修复psroi_pooling 算子的indexing计算逻辑。
修复若干个JIT求导时的错误。
修复GCC7下编译问题。
修复部分NCHW→NCHWxx的转换器问题。
修复reduce和gather的求导问题。
修复fbs模型格式下无法正确加载含有多个graph的情况(不影响内部mdl模型格式)。
修复warpperspective在开midout时可能存在的undefined reference问题。
修复开exception引入的敏感词问题。
修复标注中由于categories乱序导致生成的contiguous id错误。
修复了当一个进程中存在多个 dataloader 实例时，MGE_PLASMA_STORE_MANAGER销毁行为不正确的问题。
修复无法加载量化int8 pkl模型的问题。
修复nn.flatten的API 说明。@ChaiMind

Thanks to our Contributors

本次release非常感谢@ChaiMind 提交PR，期待更多的开发者一起共建MegEngine！

New Features

Add matidx support to warp perspective operator of NHWC.
Add remap operator support for CUDA.
Support compile whl package of IOS.
MegEngine quantized model supports tensorcore acceleration.
Add replica_mode to Parameter to specify whether synchronization is required.
Add local_grad parameter to collective_ comm operators.

Optimization

Continue to optimize the performance of NCHW44 under CPU, and improve the performance of online model by 5% - 30%.
Add more midout support to further reduce the size of binary.

Bug Fixes

Fixed slow execution of megbrain compiled with vs2015.
Fix the memory allocation problem of free < total on CPU side occasionally.
Fix the performance problem caused by GCC compiler unable to inline small functions in arm Linux.
Fix CUDA warp perspective operator in batch * img_ size exceeds INT_MAX.
Fix CUDA elemwise's calculation error in int8 broadcast without affecting current CUDA nchw4 model.
Fix psroi_ Indexing computational logic of pooling operators.
Fixed several JIT grad errors.
Fix compilation problem under gcc7.
Fix some converter bugs of NCHW → NCHW4.
Fix the derivation problem of reduce and gather.
Fix the situation that FBS model format cannot load multiple graphs correctly (It does not affect the internal MDL model format).
Fix possible undefined reference issue with warp perspective operator when using midout.
Fix sensitive words introduced by exception.
Fix the generated contiguous id error due to the disorder of categories in the annotation.
Fix incorrect destruction behavior of MGE_PLASMA_STORE_MANAGER when multiple instances of dataloader exist in a process.
Fix the problem of loading quantified int8 pkl model.
fix the msgstr for nn.flatten.@ChaiMind

Thanks to our Contributors

A kind acknowledgement to PR lodged by @ChaiMind , and we are genuinely welcoming more developers to co-build MegEngine!

@ztjryg4

New Features

在有tensorcore的pass前添加nchw->nchw4的pass
NCHW→NCHW4转换pass增加对Pooling/WarpPerspective/Resize等op的支持
增加加载pretrained的int8模型，再dump的支持

Bug Fix

修复当一个进程中存在多个 dataloader 实例时，MGE_PLASMA_STORE_MANAGER销毁行为不正确的问题
让FakeQuantize和Observer能够针对weight和activation使用不同的qmin，避免极端情况下出现数值溢出
修复mgb.opr.deformable_psroi_pooling 实现错误
修复cuda int8 nchw4支持channel小于4的问题
修复网络搭建文档中的拼写错误 @ztjryg4

Thanks to our Contributors

本次release非常感谢@ztjryg4 提交PR，期待更多的开发者一起共建MegEngine！

New Features

Insert nchw->nchw4 pass before tensorcore pass.
NCHW→NCHW4 pass supports ops such as Pooling/WarpPerspective/Resize.
Pretrained int8 models can now firstly be loaded and then be dumped.

Bug Fix

Fix MGE_PLASMA_STORE_MANAGER was destroyed incorrectly when there were multiple dataloader instances in a process.
Allow FakeQuantize and Observer to use different qmin in terms of different weights and activation to avoid numerical overflow in extreme cases
fix implementation mistakes in mgb.opr.deformable_psroi_pooling
Fix cuda int8 nchw4 support channel less than 4
Fix typo in network_build @ztjryg4

Thanks to our Contributors

A kind acknowledgement to PR lodged by @ztjryg4 , and we are genuinely welcoming more developers to co-build MegEngine!

@zjd1988

New Features

增加cuda对nchw quantized数据的计算支持。
conv1x1中添加对gemv的支持。
CPU(X86,ARM)上增加NCHW4 layout 计算方式的支持。
针对Armv8.2-a+dotprod指令集增加 nchw44-dot layout的优化支持。
增加nchw44来优化float的计算，包括直接卷积，channel wise卷积，混合layout的卷积，winograd，mk4-matmul，以及pooling和elemwise等algo的优化。
整理图优化，将一些通用转换能同时支持runtime和dump阶段。
增加 Calibration 量化训练接口。
QAT量化训练优化：ConvBn 添加支持BN和fake quantization/observer的复合操作、添加Conv的量化操作、添加Linear的量化操作、quantize_qat支持自定义跳过Module。
多卡训练增加同步 BN 统计量的支持。
多卡训练在图优化中增加 PackAllReducePass，打包需要AllReduce的参数，减少卡间通信次数。
API的一些优化调整：F.eye原本放在functional/nn.py里，现在挪到了core/tensor_factory.py里；F.add_axis和F.remove_axis里强行限制只能传入int的axis，而不再允许传入list。

Bug Fix

在FuseConvBiasWithZ的pass里添加HSwish激活函数的支持，将QFUSE_ADD_H_SWISH折叠进conv bias算子，提升性能。
修复cuda-TopK算法在batch超过65535时会导致grid的y维超出限制，而报出invalid parameter的cuda错误。
解除 cuda-stub 中对 libcuda.so 的路径限制。
修复conv1x1错误使用了基类的is_prefered方法导致的性能问题。
ConvDirectUnrollBuffer算法中，在load src时取出的数据会变成0，加入printf语句或者去掉循环的unroll优化可以避免这个问题。
修复paramfuse在处理midconst时，endpoint导致endpoint被replace两次的问题。
修复自8.3.0(包括)gopt中的ReorderArithChainPass BUG fix reorder arith chain pass。
修复cond op不支持空shape的问题。
修复SetMeshIndexing使用多个axis做indexing时的问题。
修复CompNode中assert locator.device < sd.MAX_NR_DEVICE 的书写错误 @zjd1988 。
修复voc和objects365的书写错误。
修复voc中错误的类名。
修复Tensor 的default_comp_graph 使用。
修复Function中saved_tensors在静态图下无法copy而导致图优化失败的问题。
修复 scatter 的API文档，避免在GPU上报错。
修复unused var ins的问题。
修复Module中字段的非str键错误。
修复QAT训练完的模型在eval模式下依然会更新scale和zero_point 的问题。
在所有mali系列机器上都关闭 image算法。

Thanks to our Contributors

本次release非常感谢@zjd1988 提交PR，期待更多的开发者一起共建MegEngine！

New Features

Enable cuda algos for nchw quantized.
Update conv1x1 to support gemv.
NCHW4 layout is now supported on CPU(X86,ARM).
Optimized nchw44-dot layout is available in Armv8.2-a+dotprod instruction set.
nchw44 is incorporated to optimize float-typed calculation, including but not limited to direct convolution, channel wise convolution, hybrid layout convolution, winograd, mk4-matmul, along with algorithm optimization of pooling and elemwise.
Graph optimization. Generalized conversion is supported both in runtime and dump phase.
Synchronized BN statistics are now available on multi-device training tasks.
PackAllReducePass is introduced into graph optimization on multi-device training.
Calibration quantization training interface is now available.
QAT quantization training updates: ConvBn is now able to conduct composed operation of BN and fake quantization/observer; enable quantization on Conv and Linear; quantize_qat is now allowed to skip Module on your needs
API adjustments: F.eye is moved to core/tensor_factory.py from the previous location functional/nn.py. F.add_axis and F.remove_axis are now restricted to accept axis of int type only, which disables axis of list type.

Bug Fix

HSwish activation function is enabled in pass of FuseConvBiasWithZ, and QFUSE_ADD_H_SWISH is wrapped into conv bias operator to enhance performance.
Fix cuda error‘invalid parameter’raised from cuda-TopK when batch exceeds 65535 which violates the y dimension limit of grid.
Drop path restriction of libcuda.so in cuda-stub.
Fix impacted performance for conv1x1 mistakenly adopts is_prefered from its base class.
Insert printf statements or removing looped unroll optimization to avoid the issue that data fetched through load src in ConvDirectUnrollBuffer are unexpectedly casted to 0.
Fix issue that endpoint would be replaced twice when paramfuse was processing midconst.
Fix ReorderArithChainPass in gopt raised since 8.3.0 (inclusive).
Fix empty shape not recognized by cond op.
Fix SetMeshIndexing uses multiple axes for indexing.
Fix typo assert locator.device < sd.MAX_NR_DEVICE in CompNode @zjd1988 .
Fix typo in voc and objects365.
Fix incorrect class name in voc.
Fix default_comp_graph of Tensor.
Fix graph optimization failure on occasion that saved_tensors in Function is unable to copy in a static graph.
Fix API documentation of scatter to circumvent exception on GPU environment.
Fix issues of unused var ins.
Fix none-str key exception in Module fields.
Fix unexpected eval-mode scale and zero_point updates in models trained by QAT.
Disable image algorithm on all of mali-series machines.

Thanks to our Contributors

A kind acknowledgement to PR lodged by @zjd1988 , and we are genuinely welcoming more developers to co-build MegEngine!

New Features

支持 ARM 交叉编译
添加多线程 CompNode
支持从 CPU 到 CUDA设备的 DeviceTensorND copy
动态图 profilng 功能
暴露 sublinear 参数 API
新增 objects365 数据集
voc 数据集支持 detection
添加 arange, conv_transpose2d, isnan, isinf, normalize 算子
axis 参数支持负值
均匀采样支持上界下届
LeakyReLU 支持 negative slope 参数
支持根据用户定义 key 进行参数打包
增加量化训练接口
ARM添加NCHW44的layout，并优化对应算子性能, 性能相比NCHW有加速

Bug Fix

修复 im2col 线程安全问题
修复 gcc5 -O0 编译错误
create_mm_server 失败时返回 -1
修复 reduce mean 的求导函数
修复 F.clamp 的边界问题
修复 coco 数据集中的 NaN 问题
修复 batch_norm2d, linspace 的文档问题
释放老参数的显存
F.where 支持 x 或 y 参数是空值的情况
one_hot 移除 num_classes 参数默认值
修复voc中错误的类名
修复batchnorm docstring的momentum更新公式

致谢开源社区贡献者

截至目前，也收到了很多的宝贵的反馈和建议，感谢： clhne、cydiachen、daa233、DaHaiHuha、junjie18、KaiyuYue、mgno32、xiaoguai0992。部分反馈和建议在本次发版中解决，其余的反馈和建议将陆续在接下来的发版中解决
衷心感谢每一个提问的开发者

New Features

Add cmake cross build ci
Add inference cross build support
Add compnode multithread in python
Support copy DeviceTensorND from cpu to cuda
Add profiler for dynamic graph
Expose sublinear related parameters at mge api level
Add objects365 dataset
Voc dataset supports detection
Add arange, conv_transpose2d, isnan, isinf, normalize opr
Support negative axis in math.py
Add lower bound and higher bound for uniform sampling
Add negative slope attribute for LeakyReLU module
Add user-defined key to pack params
Add quantization interface
Support NCHW44 layout for ARM

Bug Fix

Fix cublas matmul on sm60
Fix im2col thread safe problem
Fix gcc5 -O0 compiler error
Return -1 when create_mm_server failed
Fix grad function of reduce mean
Fix F.clamp
Fix an nan bug in coco dataset
Fix batch_norm2d, linspace doc
Add the more warnings in load_state_dict
Release old parameters
F.where support x or y is empty
Fix one_hot: no default value for num_classes
Fix wrong class name in voc
Fix momentum equation in batch norm doc

Thanks to our Contributors

many feedbacks from developers : clhne、cydiachen、daa233、DaHaiHuha、junjie18、KaiyuYue、mgno32、xiaoguai0992
We would also like to thank everyone who reported questions

Enhancement

公开发布sublinear API，并增加使用示例

Bug Fix

修复多卡无法使用问题

Enhancement

expose sublinear related parameters at mge api level
add example to use modified sublinear API

Bug Fix

fix multi-machine macro and add test_distributed

Enhancement

使用 MegEngine 的随机分布生成函数替换 numpy，加速参数初始化

Bug Fix

element-wise 函数增加数值输入参数的支持
collator挪到data_collection_process里，在divide模式下不用额外写collate的逻辑
移除_ParallelDataLoaderIter内部的strict属性，在任意worker发生异常时可直接抛出异常
内部的随机数由np.random.RandomState替换成megengine.randon.rng模块生成，保证数据供给的可复现性
修改 load_and run README.md 的内部代码，提供 xornet.py 示例
修复 ImageNet 解压错误
修复 python3.7下多进程报错的问题[TypeError: can’t pickle weakref objects]
修复 InternalGraphGenerator 拼写错误
修复 batchnorm、loss、tranform、tensor API文档问题

Enhancement

Speed up parameter initialization by using MegEngine's random distribution generator function instead of NumPy.

BugFix

support scalar inputs in element-wise functions
Remove internal codes in README.md and add missed example file xornet.py
Move collator to data_collection_process, no need to write additional collate in divide mode
Remove the strict attribute inside _ParallelDataLoaderIter, and throw an exception directly when any worker exception occurs
The random number generation module is replaced by np.random.RandomState to megengine.randon.rng to ensure the reproducibility of data supply
Fix imagenet extraction error
Fix DataLoader error for Multi-process under python3.7
Fix typo InternalGraphGenerator
Fix docstring for batchnorm、loss、tranform、tensor

新特性：

同时支持动态（Imperative）、静态（Tracing）模式的 Tensor 计算引擎，内建自动求导机制
- 实现了各类基础 Tensor 操作
- 通过装饰器 @jit.trace，Imperative 和 Tracing 模式下的代码可以高度一致
实现了基于 Module 的神经网络构建方式，并支持通过 save / load 持久化权重
- 在 megengine.functional 中提供了常见的计算函数
- 在 megengine.module 中提供了常见层的实现
提供了 X86 和 CUDA 下的高性能计算算子
- 支持 SSE4, VNNI 等指令集
- 广泛支持各种 NVIDIA GPU
- 内置测速模式实现自动算法选择
实现基本的数据加载机制（DataLoader）用于模型数据加载与预处理
实现 hub 协议支持，可以拉取在线模型和预训练模型
实现了 trace.dump() 对模型进行序列化，并提供 C++ 读取模型并运行的样例代码

实验性功能：

支持直接导入 PyTorch Module

已知问题：

动态模式显存占用和性能尚待进一步优化
- 动态模式下的显存占用较高
- 当前实现导致动态创建的 Tensor 显存不能自动回收，目前需要使用 set_value() 方法手动复用 Tensor 来避免显存持续增加
- PyTorch 子图，megengine.Funtion 等算子占用显存会持续增加
静态图相关
- trace 后的函数无法继续求导
- jit.trace 和 jit.sideeffect 不支持复杂嵌套
性能问题
- 当前 Adam 优化器的 step 性能不佳
- 模型参数随机初始化性能不佳

下阶段计划：

进一步提升动态模式的显存占用和性能
提供对 ARM CPU 等更多后端的支持
提供更多算子
完善文档、CI 和构建工具等周边基础设施

New Features:

An auto-differential numerical framework for Tensors, with Imperative and Tracing modes.
- Various Tensor-based operators.
- Unified code for both Imperative and Tracing modes via @jit.trace decorator.
Module-based APIs to build neural networks, with save/load methods for parameters.
- Common mathematical functions in megengine.functional package.
- Basic modules in megengine.module package
High performance operators for X86 and CUDA:
- Support instruction sets such as SSE4, VNNI, etc.
- Support NVIDIA GPUs.
- Automatic kernel selection by profiling.
A DataLoader to provide support for loading and processing data.
A hub module for load models and pre-trained models.
trace.dump() for module serialization, and sample code for load module and do inference.

Experimental Features:

Import PyTorch Modules.

Known Issues:

Memory usage and performance need to be optimized for Imperative mode.
- Memory consumption might be high in the Imperative mode.
- Dynamically created Tensors in Imperative mode can not be automatically released for now. The user has to call set_value() method to reuse Tensors.
- Some operators such as PyTorchModule and megengine.Function will increasingly alloc more memory.
Tracing related issues.
- Traced functions can not execute or perform backward().
- Multiple nested jit.trace and jit.sideeffect may result in undefined behaviors.
Performance issues.
- Step() in Adam optimizer is slow.
- Random initialization for parameters is slow.

Next Plans:

Improving memory consumption and speed for Imperative mode.
Supporting more devices, e.g., ARM.
More operators.
Providing more docs and tools.

Releases: MegEngine/MegEngine

MegEngine v1.1.0

问题修复

新功能

性能优化

兼容性破坏

Bug Fixes

New Features

Optimization

Compatibility violation

Uh oh!

MegEngine v1.0.0

新功能

问题修复

兼容性破坏

New Features

Bug Fixes

Compatibility violation

Uh oh!

MegEngine v1.0.0-rc1

Highlights

Improvements

Bug Fixes

Uh oh!

MegEngine v0.6.0

新功能

性能优化

问题修复

Thanks to our Contributors

New Features

Optimization

Bug Fixes

Thanks to our Contributors

Uh oh!

MegEngine v0.5.1

New Features

Bug Fix

Thanks to our Contributors

New Features

Bug Fix

Thanks to our Contributors

Uh oh!

MegEngine v0.5.0

New Features

Bug Fix

Thanks to our Contributors

New Features

Bug Fix

Thanks to our Contributors

Uh oh!

MegEngine v0.4.0

New Features

Bug Fix

致谢开源社区贡献者

New Features

Bug Fix

Thanks to our Contributors

Uh oh!

MegEngine v0.3.4

Enhancement

Bug Fix

Enhancement

Bug Fix

Uh oh!

MegEngine v0.3.2

Enhancement

Bug Fix

Enhancement

BugFix

Uh oh!

MegEngine v0.3.1: Hello World!

新特性：

实验性功能：

已知问题：

下阶段计划：

New Features:

Experimental Features:

Known Issues:

Next Plans:

Uh oh!