v0.1.4
What's Changed
- Update benchmark.yml by @oulgen in #570
- Update benchmark.yml by @oulgen in #571
- [Benchmark] Use custom kernel metric mappings list to accomodate for inconsistent namings by @oulgen in #567
- Add rms norm and cross entropy by @oulgen in #568
- Update benchmark_dispatch.yml by @oulgen in #573
- Update linters by @oulgen in #569
- Print config for PassManager::run triton errors by @jansel in #565
- Error when invalid loop reduction number config is generated by @oulgen in #572
- Add
skipIfLowVRAM
oruse_default_config=False
to specific unit tests to enable local testing by @yf225 in #574 - Fix bug with block_size smaller than minimum by @jansel in #575
- Better shape errors for mismatched tile sizes by @jansel in #566
- Print warning if block_size is specified in interpret mode. by @choijon5 in #576
- Run all shapes for benchmarks by @oulgen in #578
- [Benchmarks] Cooldown the GPU before recording results by @oulgen in #579
- [Benchmark] Fix layer_norm accuracy issue by @yf225 in #580
- [Benchmark] Remove hardcoded num_inputs for rms_norm kernel by @yf225 in #581
- Do not benchmark twice by @oulgen in #583
- Add missing functions to docs by @jansel in #586
- hl.atomic_add: support 1D tensor as index by @yf225 in #587
- Add atomic and/or/min/max/cas/xchg by @jansel in #589
- Add test shard with HELION_DEBUG_DTYPE_ASSERTS=1, only run one ref-eager shard by @jansel in #590
- Add link to github to docs by @jansel in #591
- Support layernorm without bias by @mengluy0125 in #585
- Allow passing tritonbench operator instance into kernel benchmark wrapper; Always return lambda for timing measurement by @yf225 in #596
- Add layer_norm backward kernels by @yf225 in #588
- Fix tf32 warning by @jansel in #592
- [Benchmark] geglu example and test by @Sibylau in #582
- Print default config when running with it by @oulgen in #599
- [Benchmark] swiglu example and test by @Sibylau in #584
- Login to Docker from the workflows by @huydhn in #601
- Add rms_norm backward kernels by @mengluy0125 in #597
- Revert "Login to Docker from the workflows" by @oulgen in #604
- Fix static shape typo by @oulgen in #609
- Add small dim size (<16) support to hl.dot and torch.addmm; Always prefer using
tl.dot(acc=...)
for addmm / baddbmm by @yf225 in #564 - Fix rms_norm and layer_norm by @mengluy0125 in #603
- [Benchmark] jsd kernel and test by @Sibylau in #611
- Refactor autotune error handling by @jansel in #595
- Possible fix for CI failures by @jansel in #617
- [Benchmark] Welford kernel and example by @karthickai in #614
- [Benchmark] kl_div kernel and test by @Sibylau in #615
- Ignore TServiceRouterException errors while autotuning by @jansel in #618
- [Example] int4_gemm kernel example and tritonbench integration by @yf225 in #613
- Set requires_grad=True for rms_norm backward inputs by @yf225 in #629
- Adjust tolerance for test_rms_norm_bwd_dx by @yf225 in #628
- Add more kernels to benchmarking by @oulgen in #632
- Reorder benchmarks by @oulgen in #633
- [Ref Mode] Fix hl.store for complex mask pattern by @yf225 in #621
- Support using block size var outside of hl.tile loop by @yf225 in #619
- [Benchmark CI] Print input shapes and surface problematic Helion config by @yf225 in #626
- Fix ValueError: numel (2097152) exceeds triton maximum tensor numel (1048576) by @mengluy0125 in #625
- Always clear inductor cache before benchmark by @yf225 in #608
- Make hl.specialize work on sequences by @jansel in #636
- Better error for passing Tile to hl.tile by @jansel in #640
- [Example] grouped_gemm kernel example and tritonbench integration by @yf225 in #620
- int4_gemm: remove use_default_config=True by @yf225 in #639
- [Easy][Benchmark CI] Exit job on any exception, for easier error catching by @yf225 in #643
- Avoid skipping CUDA errors that crashes the CUDA context by @yf225 in #645
- Add
HELION_AUTOTUNE_RANDOM_SEED
env var andautotune_random_seed
setting by @yf225 in #644 - Bump linter by @oulgen in #647
- Skip test_autotune_random_seed_from_env_var on rocm by @oulgen in #648
- Fix lint related to welford and also local_cache by @yf225 in #646
- Skip test_autotune_random_seed_from_settings on rocm by @yf225 in #651
- PT Sphinx Theme Test by @sekyondaMeta in #600
- Print
static_shapes
settings value along with config for accurate repro by @yf225 in #649 - [Benchmark] gather_gemv kernel and test by @Sibylau in #635
- Add HELION_SKIP_CACHE env by @jansel in #653
- [lint] Remove UP038 reference by @jansel in #650
- Fix
register_block_size
codegen by @yf225 in #659 - Raise better error when
hl.atomic_*
is used on device tensor by @yf225 in #658 - [Autotune] Filter bad config with accuracy check by @yf225 in #655
- Add hl.rand op with seed arg lowering to tl.rand by @karthickai in #652
- Log autotune random seed for easier repro by @yf225 in #661
- Fix misaligned address error for matmul by @yf225 in #662
- skip gather_gemv code check for B200 and fb_code by @Sibylau in #666
- rms_norm: get weight from function args by @yf225 in #664
- skip full autotune if configs are provided by @xuanzhang816 in #670
- [example] fused_linear_jsd by @v0i0 in #494
- Fix CI by moving B200 to cuda13 and downgrade a100/h100 to cuda12.8 by @oulgen in #674
- No background image by @sekyondaMeta in #663
- Remove github link from index.md by @oulgen in #675
- [Autotune] Allow skipping Triton compilation error by @yf225 in #679
- [Benchmark CI] Run one kernel per gpu to maximize successful kernel reporting by @yf225 in #681
- Fix missing block size constexpr assignment in host code by @yf225 in #678
- [CI] Fix missing setuptools by @yf225 in #680
- faster rms norm backwards kernel by @v0i0 in #624
- [Benchmark CI] Use do_bench cudagraph to avoid profiler failure; select specific kernel impls to run by @yf225 in #682
- [Benchmark CI] use --op instead of --kernel for better tritonbench compat by @yf225 in #694
- Increase tolerance for _validate_against_baseline by @jansel in #691
- [Benchmark CI] Use equally-spaced K input shapes by @yf225 in #689
- Print bad default config if compute baseline fails by @yf225 in #688
- Support HELION_AUTOTUNE_ACCURACY_CHECK=0 by @jansel in #692
- rms norm: improve fwd perf by @v0i0 in #669
- Revert "Add hl.rand op with seed arg lowering to tl.rand (#652)" by @jansel in #698
- [Autotune] Skip Triton shared memory OOM by @yf225 in #684
- [Benchmark CI] Customizable num_inputs for each kernel by @yf225 in #699
- Run autotune with TRITON_LOCAL_BUILD=1 by @jansel in #695
- Add test for no options found in autotuner by @jansel in #693
- [Benchmark CI] fix layer_norm mapping bug by @yf225 in #701
- [Benchmark CI] Change DEFAULT_NUM_INPUTS to MAX_NUM_INPUTS by @yf225 in #702
- [Benchmark CI] Allow customized mapping into tritonbench impls by @yf225 in #700
- [Benchmark CI] Add
--list-kernels-for-benchmark-ci
to benchmark runner by @yf225 in #703 - Increase test timeout to 60 by @jansel in #697
- [Benchmark CI] Run rms_norm-bwd and layer_norm-bwd kernels by @yf225 in #705
- Revert "[Benchmark CI] Run rms_norm-bwd and layer_norm-bwd kernels" by @yf225 in #706
- Revert "[Benchmark CI] Add
--list-kernels-for-benchmark-ci
to benchmark runner" by @yf225 in #707 - [Benchmark CI] Run
rms_norm-bwd
andlayer_norm-bwd
kernels by @yf225 in #708 - [Benchmark CI] Run fewer inputs for layer_norm-bwd to avoid job timeout by @yf225 in #709
- [Benchmark CI] Set tolerance values that match autotuner setting by @yf225 in #710
- [Benchmark CI] Skip last input shape for rms_norm-bwd by @yf225 in #712
- Rebenchmark configs to avoid noise by @jansel in #654
- Increase default num_generations to 40 by @jansel in #677
- Add PatternSearch autotuning algorithm by @jansel in #696
- [Benchmark CI] Simplify h100 display name by @oulgen in #713
New Contributors
- @choijon5 made their first contribution in #576
- @mengluy0125 made their first contribution in #585
- @huydhn made their first contribution in #601
- @v0i0 made their first contribution in #494
Full Changelog: v0.1.3...v0.1.4