Releases: MegEngine/MegEngine
Releases · MegEngine/MegEngine
MegEngine v1.1.0
问题修复
- 修复 trace 中的静态推导
- 修复 trace 中io和opr的执行顺序避免死锁
- 修复cd4在convolution中group=1时候转换错误以及elemwise转换错误
- 修复fuse conv bias optpass中bias shape固定时,inference时shape匹配不上问题
- 修复 mkl elemwise计算 LOG是异常
- 修复load_and_run --input对于单输入未指定正确输入名字时的错误处理
新功能
- 支持 scalar 类型的 Tensor
- 使用async_level来控制异步执行中错误检查
- 增加 group_norm、 instance_norm、layer_norm、conv1d、remap等算子
- GradManger.attach 对 Tensor 使用 weak reference
- 支持分布式量化训练
- 支持inference weight-preprocess 释放原来weight的内存
- jit mlir后端全面支持Elemwise, DimShuffle
- 增加cv DCT 算子支持
性能优化
- 减少了 batch normalization、elementwise、broadcast等算子在host CPU上的耗时
- 优化了 optimizer 的 step() 性能
- 优化量化训练性能
- 优化arm64 int8X8X16_mk4_k8x8x8 matmul 算子
兼容性破坏
- 无
Bug Fixes
- Fixed static shape inference in trace to allow training larger models
- Link io-opr in trace to avoid deadlock
- Fixed cd4 conversion error when group=1 in convolution and some cases in elemwise
- Fixed the problem of shape matching when the bias shape is fixed in fuse conv bias optpass
- Fixed LOG mode of elemwise in MKL calculation abnormal
- Fix the error processing when load_and_run --input does not specify the correct input name for a single input
New Features
- Support representation of scalar-type tensor
- Enable users to control error check during asynchronous execution by parameter async_level
- Add operators including group_norm, instance_norm and layer_norm, conv1d and remap
- Use weakref for GradManger.attach
- Support distributed quantize aware training
- After weight preprocessing, release the original weight memory during inference
- Support Elemwise and DimShuffle operators in JIT of mlir backend
- Support DCT operator in cv
Optimization
- Reduce host overhead for operators including batch normalization, elementwise, and broadcast
- Improve performance of the step function in optimizers
- Improve performance of quantization training
- Optimize arm64 int8X8X16_mk4_k8x8x8 matmul operator
Compatibility violation
- No
MegEngine v1.0.0
新功能
- IndexingMultiAxisVec添加bool类型支持
- 添加windows和macOS打包功能
- 添加adaptive pooling算子
- mlir jit添加llvm-lit支持
- 添加weights preprocess开关控制是否在inference阶段缓存weights 预处理后的结果提升性能
- 添加MGB_USE_ATLAS_ASYNC_API宏控制开启异步API调用
- 按用户使用习惯 更新 dtype promotion的规则
- 在broadcast增加参数检查
- device的__repr__方法增加物理单元信息
问题修复
- 修复cpuinfo在arm linux下的编译warning
- 修复Imperative Runtime在出错情况下由于没有正确set exception导致卡住的问题
- 修复dump_with_testcase脚本在打开--output-strip-info选项时如果文件不存在会crash的问题
- 修复NCHW->NCHW4的pass对float类型的处理
- 在float转io16c32的pass中添加对deconv的处理
- 由于xcode的问题,在ios下关闭thread_local的支持
- 修复多机训练中,ParamPackSplit出现的refcnt计数问题
- 修复多线程下多个模型使用同一个compnode且开启record功能导致出错的问题
- 修复NCHW→NCHWxx的pass在处理conv_bias 且bias为空情况下的问题
- 修复jit.trace发生错误后使得后续trace完全不可用的问题
- 修复bool.sum()
- 修复graph binding 错误处理导致graph被错误回收的问题
- 修复jit.trace 对topk/warp/nms等op的处理
- 修复LocalConv2d算子对group的支持
- 修复 dump 中使用 optimize_for_inference 时的bug
- 修复NMSKeep、topk 、warp_perspective 被 trace 时的 bug
兼容性破坏
- 调整部分Function API的命名、参数或 import 路径,删除重复API
New Features
- Add bool dtype support for IndexingMultiAxisVec
- Add windows and macOS packaging capabilities
- Add adaptive pooling opr
- Add llvm-lit support for jit mlir
- Add the weights preprocess option to control whether the results of weights preprocessing are cached in the inference phase to improve performance
- Add macro MGB_USE_ATLAS_ASYNC_API to control whether enables asynchronous API calls
- Update dtype promotion rule
- Add parameter check in tensor broadcast method
- Update device repr method to show physical placement
Bug Fixes
- Fix cpuinfo compiling warning under arm Linux
- Fix the stuck problem due to incorrect set exception in case of error
- Fix crash when enabling --output-strip-info in dump_with_testcase if the file does not exist
- Fix nchw → nchw4 pass when handling float type
- Handle deconv opr in the pass from float to io16c32
- remove thread_local support in ios due to Xcode problem
- Fix refcnt counting problem in ParamPackSplit during multi-machine training
- Fix the crash problem that multiple models use the same compnode and enable record function under multithreading
- Fix nchw → nchwxx pass in processing conv_bias opr in the case of bias being empty
- Fix jit.trace when an error occurs that may make subsequent trace completely unavailable
- Fix bool.sum()
- Fix graph binding error handling that caused the graph to be malcollected
- Fix topk/warp/nms op when using jit.trace
- Fix group support for local conv2d operator
- Fix bug of optimize_for_inference in dump
- Fix bugs of NMSKeep, topk, warp_perspetive during trace
Compatibility violation
- Adjust names, paramters or import path of some functional API; delete dumplicated API
MegEngine v1.0.0-rc1
Highlights
- 全新的 Imperative Runtime,通过重写动态执行引擎,将过去几个版本中动态图的限制打破,解决了一系列资源释放问题,并大幅提升了动态自由度,让使用 GPU 计算就像 NumPy 一样方便自如;在提高自由度的同时,配合更强的动静转换功能,让增强能力的动态训练代码仍然可以静态化训练,方便高效;提供了更强的 trace 能力,每次仍然运行 python 代码,让 trace 的启用更加顺畅透明;实现了更可靠的 dump 功能,提供验证功能,确保没有多余的分支逻辑被误记录到序列化格式中,让用户可以低成本的兼得动态训练与静态推理的好处。
midout 代码裁剪功能,让用户可以全自动的针对自己网络使用的算子进行代码裁剪,免除手工配置就能最小化推理时的代码体积,极大的提升端侧推理效率。 - 一系列推理性能和功能优化。具体的推理优化包括代码编译速度优化、Winograd 算法优化提速、A7 / A55 / A53 等 ARM小核心GEMM的性能优化、NCHW44 下 8x8x16 channel-wise Conv和8x8x16 direct Conv 的性能优化、NCHW44 下 First Conv 的性能优化、修正了 Atlas H2D 拷贝时的问题、恢复对 CUDA9 的支持、添加了NCHW44下Winograd F(7,3) 实现、添加 cpuinfo 进行 CPU 架构的判断,让用户无需提前指定CPU架构、对框架代码进行了更细粒度的裁剪设计,配合midout功能,可以大幅度降低 binary size、Cmake 重构优化、修复了 shuffleShuffleRemovePass 在特定输入时的 segfault 问题。
- 更多国产硬件支持:支持主流国产芯片和ROCm的接入,方便在国产 NPU 芯片上进行推理工作。
- MLIR 的接入:接入了 MLIR 做为静态图 JIT fusing 引擎,对于连续的访存算子进行融合加速;利用 MLIR 做为工具,提高框架代码的可维护性。
- 全新设计的 API,更加 Pythonic,控制更加清晰可靠。
- 对 Windows / Mac 提供了原生支持,同时发布Linux / Windows / Mac 的 PYPI 安装包。
- 尝鲜体验通道
pip3 install megengine==1.0.0rc1 -f https://megengine.org.cn/whl/mge.html -i https://pypi.douban.com/simple/
Improvements
- 通过添加cuda pinned memory内存池减少cudaHostAlloc调用次数降低host端的开销提升性能。
- 进一步优化A7/A53/A55等小核上int8和int8x8x16的性能。
- 添加Winograd F(7,3),在一些网络有10%-30%性能提升(目前由于精度问题默认关闭)。
- 优化单stream下cuda device 内存分配器性能。
Bug Fixes
- 修复megengine 交叉编译。
- 修复conv NCHW88在fastrun搜索winograd算法未正确设置format导致的问题。
- 修复部分arm kernel内存越界问题。
- 修复cpuinfo在uclibc下的编译问题。
- 修复部分后端对于一些特殊的shape未优化导致选到naive算法的问题(通过optpass中将nchwxx → dimshuffle + nchw)。
- 修复寒武纪编译问题。
- 修复由于x86 im2col算法不支持非对称量化导致模型无法dump的问题。
- 修复fastrun写cache时由于snprintf使用不对导致崩溃的问题。
- 修复某些testcase在某些平台(EV300)爆栈导致无法执行的问题。
- 修复相同group组相同名字的多机opr频繁注册导致的冲突。
MegEngine v0.6.0
新功能
- NHWC的warpperspective添加matidx支持。
- 添加CUDA版本的remap算子支持。
- 支持编译ios whl包。
- megengine模型支持TensorCore加速。
- Parameter 中增加 replica_mode来指定是否需要同步。
- collective_comm算子添加local_grad参数。
性能优化
- 持续优化CPU下NCHW44性能,在业务线模型有5%-30%性能提升。
- 添加更多midout支持,进一步减少binary size体积。
问题修复
- 修复使用vs2015 编译后megbrain执行速度慢的问题。
- 修复CPU端偶尔出现free < total的内存分配问题。
- 修复arm linux下GCC编译器无法inline小函数导致的性能问题。
- 修复cuda warpperspective算子在batch * img_size 超过INT_MAX时的计算错误。
- 修复cuda elemwise 在int8 broadcast情况下的计算错误,不影响NCHW4模型。
- 修复psroi_pooling 算子的indexing计算逻辑。
- 修复若干个JIT求导时的错误。
- 修复GCC7下编译问题。
- 修复部分NCHW→NCHWxx的转换器问题。
- 修复reduce和gather的求导问题。
- 修复fbs模型格式下无法正确加载含有多个graph的情况(不影响内部mdl模型格式)。
- 修复warpperspective在开midout时可能存在的undefined reference问题。
- 修复开exception引入的敏感词问题。
- 修复标注中由于categories乱序导致生成的contiguous id错误。
- 修复了当一个进程中存在多个 dataloader 实例时,MGE_PLASMA_STORE_MANAGER销毁行为不正确的问题。
- 修复无法加载量化int8 pkl模型的问题。
- 修复nn.flatten的API 说明。@ChaiMind
Thanks to our Contributors
- 本次release非常感谢@ChaiMind 提交PR,期待更多的开发者一起共建MegEngine!
New Features
- Add matidx support to warp perspective operator of NHWC.
- Add remap operator support for CUDA.
- Support compile whl package of IOS.
- MegEngine quantized model supports tensorcore acceleration.
- Add replica_mode to Parameter to specify whether synchronization is required.
- Add local_grad parameter to collective_ comm operators.
Optimization
- Continue to optimize the performance of NCHW44 under CPU, and improve the performance of online model by 5% - 30%.
- Add more midout support to further reduce the size of binary.
Bug Fixes
- Fixed slow execution of megbrain compiled with vs2015.
- Fix the memory allocation problem of free < total on CPU side occasionally.
- Fix the performance problem caused by GCC compiler unable to inline small functions in arm Linux.
- Fix CUDA warp perspective operator in batch * img_ size exceeds INT_MAX.
- Fix CUDA elemwise's calculation error in int8 broadcast without affecting current CUDA nchw4 model.
- Fix psroi_ Indexing computational logic of pooling operators.
- Fixed several JIT grad errors.
- Fix compilation problem under gcc7.
- Fix some converter bugs of NCHW → NCHW4.
- Fix the derivation problem of reduce and gather.
- Fix the situation that FBS model format cannot load multiple graphs correctly (It does not affect the internal MDL model format).
- Fix possible undefined reference issue with warp perspective operator when using midout.
- Fix sensitive words introduced by exception.
- Fix the generated contiguous id error due to the disorder of categories in the annotation.
- Fix incorrect destruction behavior of MGE_PLASMA_STORE_MANAGER when multiple instances of dataloader exist in a process.
- Fix the problem of loading quantified int8 pkl model.
- fix the msgstr for nn.flatten.@ChaiMind
Thanks to our Contributors
- A kind acknowledgement to PR lodged by @ChaiMind , and we are genuinely welcoming more developers to co-build MegEngine!
MegEngine v0.5.1
New Features
- 在有tensorcore的pass前添加nchw->nchw4的pass
- NCHW→NCHW4转换pass增加对Pooling/WarpPerspective/Resize等op的支持
- 增加加载pretrained的int8模型,再dump的支持
Bug Fix
- 修复当一个进程中存在多个 dataloader 实例时,MGE_PLASMA_STORE_MANAGER销毁行为不正确的问题
- 让FakeQuantize和Observer能够针对weight和activation使用不同的qmin,避免极端情况下出现数值溢出
- 修复mgb.opr.deformable_psroi_pooling 实现错误
- 修复cuda int8 nchw4支持channel小于4的问题
- 修复网络搭建文档中的拼写错误 @ztjryg4
Thanks to our Contributors
- 本次release非常感谢@ztjryg4 提交PR,期待更多的开发者一起共建MegEngine!
New Features
- Insert nchw->nchw4 pass before tensorcore pass.
- NCHW→NCHW4 pass supports ops such as Pooling/WarpPerspective/Resize.
- Pretrained int8 models can now firstly be loaded and then be dumped.
Bug Fix
- Fix MGE_PLASMA_STORE_MANAGER was destroyed incorrectly when there were multiple dataloader instances in a process.
- Allow FakeQuantize and Observer to use different qmin in terms of different weights and activation to avoid numerical overflow in extreme cases
- fix implementation mistakes in mgb.opr.deformable_psroi_pooling
- Fix cuda int8 nchw4 support channel less than 4
- Fix typo in network_build @ztjryg4
Thanks to our Contributors
- A kind acknowledgement to PR lodged by @ztjryg4 , and we are genuinely welcoming more developers to co-build MegEngine!
MegEngine v0.5.0
New Features
- 增加cuda对nchw quantized数据的计算支持。
- conv1x1中添加对gemv的支持。
- CPU(X86,ARM)上增加NCHW4 layout 计算方式的支持。
- 针对Armv8.2-a+dotprod指令集增加 nchw44-dot layout的优化支持。
- 增加nchw44来优化float的计算,包括直接卷积,channel wise卷积,混合layout的卷积,winograd,mk4-matmul,以及pooling和elemwise等algo的优化。
- 整理图优化,将一些通用转换能同时支持runtime和dump阶段。
- 增加 Calibration 量化训练接口。
- QAT量化训练优化:ConvBn 添加支持BN和fake quantization/observer的复合操作、添加Conv的量化操作、添加Linear的量化操作、quantize_qat支持自定义跳过Module。
- 多卡训练增加同步 BN 统计量的支持。
- 多卡训练在图优化中增加 PackAllReducePass,打包需要AllReduce的参数,减少卡间通信次数。
- API的一些优化调整:F.eye原本放在functional/nn.py里,现在挪到了core/tensor_factory.py里;F.add_axis和F.remove_axis里强行限制只能传入int的axis,而不再允许传入list。
Bug Fix
- 在FuseConvBiasWithZ的pass里添加HSwish激活函数的支持,将QFUSE_ADD_H_SWISH折叠进conv bias算子,提升性能。
- 修复cuda-TopK算法在batch超过65535时会导致grid的y维超出限制,而报出invalid parameter的cuda错误。
- 解除 cuda-stub 中对 libcuda.so 的路径限制。
- 修复conv1x1错误使用了基类的is_prefered方法导致的性能问题。
- ConvDirectUnrollBuffer算法中,在load src时取出的数据会变成0,加入printf语句或者去掉循环的unroll优化可以避免这个问题。
- 修复paramfuse在处理midconst时,endpoint导致endpoint被replace两次的问题。
- 修复自8.3.0(包括)gopt中的ReorderArithChainPass BUG fix reorder arith chain pass。
- 修复cond op不支持空shape的问题。
- 修复SetMeshIndexing使用多个axis做indexing时的问题。
- 修复CompNode中assert locator.device < sd.MAX_NR_DEVICE 的书写错误 @zjd1988 。
- 修复voc和objects365的书写错误。
- 修复voc中错误的类名。
- 修复Tensor 的default_comp_graph 使用 。
- 修复Function中saved_tensors在静态图下无法copy而导致图优化失败的问题 。
- 修复 scatter 的API文档,避免在GPU上报错。
- 修复unused var ins的问题。
- 修复Module中字段的非str键错误。
- 修复QAT训练完的模型在eval模式下依然会更新scale和zero_point 的问题。
- 在所有mali系列机器上都关闭 image算法。
Thanks to our Contributors
- 本次release非常感谢@zjd1988 提交PR,期待更多的开发者一起共建MegEngine!
New Features
- Enable cuda algos for nchw quantized.
- Update conv1x1 to support gemv.
- NCHW4 layout is now supported on CPU(X86,ARM).
- Optimized nchw44-dot layout is available in Armv8.2-a+dotprod instruction set.
- nchw44 is incorporated to optimize float-typed calculation, including but not limited to direct convolution, channel wise convolution, hybrid layout convolution, winograd, mk4-matmul, along with algorithm optimization of pooling and elemwise.
- Graph optimization. Generalized conversion is supported both in runtime and dump phase.
- Synchronized BN statistics are now available on multi-device training tasks.
- PackAllReducePass is introduced into graph optimization on multi-device training.
- Calibration quantization training interface is now available.
- QAT quantization training updates: ConvBn is now able to conduct composed operation of BN and fake quantization/observer; enable quantization on Conv and Linear; quantize_qat is now allowed to skip Module on your needs
- API adjustments: F.eye is moved to core/tensor_factory.py from the previous location functional/nn.py. F.add_axis and F.remove_axis are now restricted to accept axis of int type only, which disables axis of list type.
Bug Fix
- HSwish activation function is enabled in pass of FuseConvBiasWithZ, and QFUSE_ADD_H_SWISH is wrapped into conv bias operator to enhance performance.
- Fix cuda error‘invalid parameter’raised from cuda-TopK when batch exceeds 65535 which violates the y dimension limit of grid.
- Drop path restriction of libcuda.so in cuda-stub.
- Fix impacted performance for conv1x1 mistakenly adopts is_prefered from its base class.
- Insert printf statements or removing looped unroll optimization to avoid the issue that data fetched through load src in ConvDirectUnrollBuffer are unexpectedly casted to 0.
- Fix issue that endpoint would be replaced twice when paramfuse was processing midconst.
- Fix ReorderArithChainPass in gopt raised since 8.3.0 (inclusive).
- Fix empty shape not recognized by cond op.
- Fix SetMeshIndexing uses multiple axes for indexing.
- Fix typo assert locator.device < sd.MAX_NR_DEVICE in CompNode @zjd1988 .
- Fix typo in voc and objects365.
- Fix incorrect class name in voc.
- Fix default_comp_graph of Tensor.
- Fix graph optimization failure on occasion that saved_tensors in Function is unable to copy in a static graph.
- Fix API documentation of scatter to circumvent exception on GPU environment.
- Fix issues of unused var ins.
- Fix none-str key exception in Module fields.
- Fix unexpected eval-mode scale and zero_point updates in models trained by QAT.
- Disable image algorithm on all of mali-series machines.
Thanks to our Contributors
- A kind acknowledgement to PR lodged by @zjd1988 , and we are genuinely welcoming more developers to co-build MegEngine!
MegEngine v0.4.0
New Features
- 支持 ARM 交叉编译
- 添加多线程 CompNode
- 支持 从 CPU 到 CUDA设备的 DeviceTensorND copy
- 动态图 profilng 功能
- 暴露 sublinear 参数 API
- 新增 objects365 数据集
- voc 数据集支持 detection
- 添加 arange, conv_transpose2d, isnan, isinf, normalize 算子
- axis 参数支持负值
- 均匀采样支持上界下届
- LeakyReLU 支持 negative slope 参数
- 支持根据用户定义 key 进行参数打包
- 增加量化训练接口
- ARM添加NCHW44的layout,并优化对应算子性能, 性能相比NCHW有加速
Bug Fix
- 修复 im2col 线程安全问题
- 修复 gcc5 -O0 编译错误
- create_mm_server 失败时返回 -1
- 修复 reduce mean 的求导函数
- 修复 F.clamp 的边界问题
- 修复 coco 数据集中的 NaN 问题
- 修复 batch_norm2d, linspace 的文档问题
- 释放老参数的显存
- F.where 支持 x 或 y 参数是空值的情况
- one_hot 移除 num_classes 参数默认值
- 修复voc中错误的类名
- 修复batchnorm docstring的momentum更新公式
致谢开源社区贡献者
截至目前,也收到了很多的宝贵的反馈和建议,感谢: clhne、cydiachen、daa233、DaHaiHuha、junjie18、KaiyuYue、mgno32、xiaoguai0992。部分反馈和建议在本次发版中解决,其余的反馈和建议将陆续在接下来的发版中解决
衷心感谢每一个提问的开发者
New Features
- Add cmake cross build ci
- Add inference cross build support
- Add compnode multithread in python
- Support copy DeviceTensorND from cpu to cuda
- Add profiler for dynamic graph
- Expose sublinear related parameters at mge api level
- Add objects365 dataset
- Voc dataset supports detection
- Add arange, conv_transpose2d, isnan, isinf, normalize opr
- Support negative axis in math.py
- Add lower bound and higher bound for uniform sampling
- Add negative slope attribute for LeakyReLU module
- Add user-defined key to pack params
- Add quantization interface
- Support NCHW44 layout for ARM
Bug Fix
- Fix cublas matmul on sm60
- Fix im2col thread safe problem
- Fix gcc5 -O0 compiler error
- Return -1 when create_mm_server failed
- Fix grad function of reduce mean
- Fix F.clamp
- Fix an nan bug in coco dataset
- Fix batch_norm2d, linspace doc
- Add the more warnings in load_state_dict
- Release old parameters
- F.where support x or y is empty
- Fix one_hot: no default value for num_classes
- Fix wrong class name in voc
- Fix momentum equation in batch norm doc
Thanks to our Contributors
many feedbacks from developers : clhne、cydiachen、daa233、DaHaiHuha、junjie18、KaiyuYue、mgno32、xiaoguai0992
We would also like to thank everyone who reported questions
MegEngine v0.3.4
Enhancement
- 公开发布sublinear API,并增加使用示例
Bug Fix
- 修复多卡无法使用问题
Enhancement
- expose sublinear related parameters at mge api level
- add example to use modified sublinear API
Bug Fix
- fix multi-machine macro and add test_distributed
MegEngine v0.3.2
Enhancement
- 使用 MegEngine 的随机分布生成函数替换 numpy,加速参数初始化
Bug Fix
- element-wise 函数增加数值输入参数的支持
- collator挪到data_collection_process里,在divide模式下不用额外写collate的逻辑
- 移除_ParallelDataLoaderIter内部的strict属性,在任意worker发生异常时可直接抛出异常
- 内部的随机数由np.random.RandomState替换成megengine.randon.rng模块生成,保证数据供给的可复现性
- 修改 load_and run README.md 的内部代码,提供 xornet.py 示例
- 修复 ImageNet 解压错误
- 修复 python3.7下多进程报错的问题[TypeError: can’t pickle weakref objects]
- 修复 InternalGraphGenerator 拼写错误
- 修复 batchnorm、loss、tranform、tensor API文档问题
Enhancement
- Speed up parameter initialization by using MegEngine's random distribution generator function instead of NumPy.
BugFix
- support scalar inputs in element-wise functions
- Remove internal codes in README.md and add missed example file xornet.py
- Move collator to data_collection_process, no need to write additional collate in divide mode
- Remove the strict attribute inside _ParallelDataLoaderIter, and throw an exception directly when any worker exception occurs
- The random number generation module is replaced by np.random.RandomState to megengine.randon.rng to ensure the reproducibility of data supply
- Fix imagenet extraction error
- Fix DataLoader error for Multi-process under python3.7
- Fix typo InternalGraphGenerator
- Fix docstring for batchnorm、loss、tranform、tensor
MegEngine v0.3.1: Hello World!
新特性:
- 同时支持动态(Imperative)、静态(Tracing) 模式的 Tensor 计算引擎,内建自动求导机制
- 实现了各类基础 Tensor 操作
- 通过装饰器 @jit.trace,Imperative 和 Tracing 模式下的代码可以高度一致
- 实现了基于 Module 的神经网络构建方式,并支持通过 save / load 持久化权重
- 在 megengine.functional 中提供了常见的计算函数
- 在 megengine.module 中提供了常见层的实现
- 提供了 X86 和 CUDA 下的高性能计算算子
- 支持 SSE4, VNNI 等指令集
- 广泛支持各种 NVIDIA GPU
- 内置测速模式实现自动算法选择
- 实现基本的数据加载机制(DataLoader)用于模型数据加载与预处理
- 实现 hub 协议支持,可以拉取在线模型和预训练模型
- 实现了 trace.dump() 对模型进行序列化,并提供 C++ 读取模型并运行的样例代码
实验性功能:
- 支持直接导入 PyTorch Module
已知问题:
- 动态模式显存占用和性能尚待进一步优化
- 动态模式下的显存占用较高
- 当前实现导致动态创建的 Tensor 显存不能自动回收,目前需要使用 set_value() 方法手动复用 Tensor 来避免显存持续增加
- PyTorch 子图,megengine.Funtion 等算子占用显存会持续增加
- 静态图相关
- trace 后的函数无法继续求导
- jit.trace 和 jit.sideeffect 不支持复杂嵌套
- 性能问题
- 当前 Adam 优化器的 step 性能不佳
- 模型参数随机初始化性能不佳
下阶段计划:
- 进一步提升动态模式的显存占用和性能
- 提供对 ARM CPU 等更多后端的支持
- 提供更多算子
- 完善文档、CI 和 构建工具等周边基础设施
New Features:
- An auto-differential numerical framework for Tensors, with Imperative and Tracing modes.
- Various Tensor-based operators.
- Unified code for both Imperative and Tracing modes via @jit.trace decorator.
- Module-based APIs to build neural networks, with save/load methods for parameters.
- Common mathematical functions in megengine.functional package.
- Basic modules in megengine.module package
- High performance operators for X86 and CUDA:
- Support instruction sets such as SSE4, VNNI, etc.
- Support NVIDIA GPUs.
- Automatic kernel selection by profiling.
- A DataLoader to provide support for loading and processing data.
- A hub module for load models and pre-trained models.
- trace.dump() for module serialization, and sample code for load module and do inference.
Experimental Features:
- Import PyTorch Modules.
Known Issues:
- Memory usage and performance need to be optimized for Imperative mode.
- Memory consumption might be high in the Imperative mode.
- Dynamically created Tensors in Imperative mode can not be automatically released for now. The user has to call set_value() method to reuse Tensors.
- Some operators such as PyTorchModule and megengine.Function will increasingly alloc more memory.
- Tracing related issues.
- Traced functions can not execute or perform backward().
- Multiple nested jit.trace and jit.sideeffect may result in undefined behaviors.
- Performance issues.
- Step() in Adam optimizer is slow.
- Random initialization for parameters is slow.
Next Plans:
- Improving memory consumption and speed for Imperative mode.
- Supporting more devices, e.g., ARM.
- More operators.
- Providing more docs and tools.