Skip to content

v0.1.4

Compare
Choose a tag to compare
@oulgen oulgen released this 29 Sep 16:17
· 103 commits to main since this release
0428d5d

What's Changed

  • Update benchmark.yml by @oulgen in #570
  • Update benchmark.yml by @oulgen in #571
  • [Benchmark] Use custom kernel metric mappings list to accomodate for inconsistent namings by @oulgen in #567
  • Add rms norm and cross entropy by @oulgen in #568
  • Update benchmark_dispatch.yml by @oulgen in #573
  • Update linters by @oulgen in #569
  • Print config for PassManager::run triton errors by @jansel in #565
  • Error when invalid loop reduction number config is generated by @oulgen in #572
  • Add skipIfLowVRAM or use_default_config=False to specific unit tests to enable local testing by @yf225 in #574
  • Fix bug with block_size smaller than minimum by @jansel in #575
  • Better shape errors for mismatched tile sizes by @jansel in #566
  • Print warning if block_size is specified in interpret mode. by @choijon5 in #576
  • Run all shapes for benchmarks by @oulgen in #578
  • [Benchmarks] Cooldown the GPU before recording results by @oulgen in #579
  • [Benchmark] Fix layer_norm accuracy issue by @yf225 in #580
  • [Benchmark] Remove hardcoded num_inputs for rms_norm kernel by @yf225 in #581
  • Do not benchmark twice by @oulgen in #583
  • Add missing functions to docs by @jansel in #586
  • hl.atomic_add: support 1D tensor as index by @yf225 in #587
  • Add atomic and/or/min/max/cas/xchg by @jansel in #589
  • Add test shard with HELION_DEBUG_DTYPE_ASSERTS=1, only run one ref-eager shard by @jansel in #590
  • Add link to github to docs by @jansel in #591
  • Support layernorm without bias by @mengluy0125 in #585
  • Allow passing tritonbench operator instance into kernel benchmark wrapper; Always return lambda for timing measurement by @yf225 in #596
  • Add layer_norm backward kernels by @yf225 in #588
  • Fix tf32 warning by @jansel in #592
  • [Benchmark] geglu example and test by @Sibylau in #582
  • Print default config when running with it by @oulgen in #599
  • [Benchmark] swiglu example and test by @Sibylau in #584
  • Login to Docker from the workflows by @huydhn in #601
  • Add rms_norm backward kernels by @mengluy0125 in #597
  • Revert "Login to Docker from the workflows" by @oulgen in #604
  • Fix static shape typo by @oulgen in #609
  • Add small dim size (<16) support to hl.dot and torch.addmm; Always prefer using tl.dot(acc=...) for addmm / baddbmm by @yf225 in #564
  • Fix rms_norm and layer_norm by @mengluy0125 in #603
  • [Benchmark] jsd kernel and test by @Sibylau in #611
  • Refactor autotune error handling by @jansel in #595
  • Possible fix for CI failures by @jansel in #617
  • [Benchmark] Welford kernel and example by @karthickai in #614
  • [Benchmark] kl_div kernel and test by @Sibylau in #615
  • Ignore TServiceRouterException errors while autotuning by @jansel in #618
  • [Example] int4_gemm kernel example and tritonbench integration by @yf225 in #613
  • Set requires_grad=True for rms_norm backward inputs by @yf225 in #629
  • Adjust tolerance for test_rms_norm_bwd_dx by @yf225 in #628
  • Add more kernels to benchmarking by @oulgen in #632
  • Reorder benchmarks by @oulgen in #633
  • [Ref Mode] Fix hl.store for complex mask pattern by @yf225 in #621
  • Support using block size var outside of hl.tile loop by @yf225 in #619
  • [Benchmark CI] Print input shapes and surface problematic Helion config by @yf225 in #626
  • Fix ValueError: numel (2097152) exceeds triton maximum tensor numel (1048576) by @mengluy0125 in #625
  • Always clear inductor cache before benchmark by @yf225 in #608
  • Make hl.specialize work on sequences by @jansel in #636
  • Better error for passing Tile to hl.tile by @jansel in #640
  • [Example] grouped_gemm kernel example and tritonbench integration by @yf225 in #620
  • int4_gemm: remove use_default_config=True by @yf225 in #639
  • [Easy][Benchmark CI] Exit job on any exception, for easier error catching by @yf225 in #643
  • Avoid skipping CUDA errors that crashes the CUDA context by @yf225 in #645
  • Add HELION_AUTOTUNE_RANDOM_SEED env var and autotune_random_seed setting by @yf225 in #644
  • Bump linter by @oulgen in #647
  • Skip test_autotune_random_seed_from_env_var on rocm by @oulgen in #648
  • Fix lint related to welford and also local_cache by @yf225 in #646
  • Skip test_autotune_random_seed_from_settings on rocm by @yf225 in #651
  • PT Sphinx Theme Test by @sekyondaMeta in #600
  • Print static_shapes settings value along with config for accurate repro by @yf225 in #649
  • [Benchmark] gather_gemv kernel and test by @Sibylau in #635
  • Add HELION_SKIP_CACHE env by @jansel in #653
  • [lint] Remove UP038 reference by @jansel in #650
  • Fix register_block_size codegen by @yf225 in #659
  • Raise better error when hl.atomic_* is used on device tensor by @yf225 in #658
  • [Autotune] Filter bad config with accuracy check by @yf225 in #655
  • Add hl.rand op with seed arg lowering to tl.rand by @karthickai in #652
  • Log autotune random seed for easier repro by @yf225 in #661
  • Fix misaligned address error for matmul by @yf225 in #662
  • skip gather_gemv code check for B200 and fb_code by @Sibylau in #666
  • rms_norm: get weight from function args by @yf225 in #664
  • skip full autotune if configs are provided by @xuanzhang816 in #670
  • [example] fused_linear_jsd by @v0i0 in #494
  • Fix CI by moving B200 to cuda13 and downgrade a100/h100 to cuda12.8 by @oulgen in #674
  • No background image by @sekyondaMeta in #663
  • Remove github link from index.md by @oulgen in #675
  • [Autotune] Allow skipping Triton compilation error by @yf225 in #679
  • [Benchmark CI] Run one kernel per gpu to maximize successful kernel reporting by @yf225 in #681
  • Fix missing block size constexpr assignment in host code by @yf225 in #678
  • [CI] Fix missing setuptools by @yf225 in #680
  • faster rms norm backwards kernel by @v0i0 in #624
  • [Benchmark CI] Use do_bench cudagraph to avoid profiler failure; select specific kernel impls to run by @yf225 in #682
  • [Benchmark CI] use --op instead of --kernel for better tritonbench compat by @yf225 in #694
  • Increase tolerance for _validate_against_baseline by @jansel in #691
  • [Benchmark CI] Use equally-spaced K input shapes by @yf225 in #689
  • Print bad default config if compute baseline fails by @yf225 in #688
  • Support HELION_AUTOTUNE_ACCURACY_CHECK=0 by @jansel in #692
  • rms norm: improve fwd perf by @v0i0 in #669
  • Revert "Add hl.rand op with seed arg lowering to tl.rand (#652)" by @jansel in #698
  • [Autotune] Skip Triton shared memory OOM by @yf225 in #684
  • [Benchmark CI] Customizable num_inputs for each kernel by @yf225 in #699
  • Run autotune with TRITON_LOCAL_BUILD=1 by @jansel in #695
  • Add test for no options found in autotuner by @jansel in #693
  • [Benchmark CI] fix layer_norm mapping bug by @yf225 in #701
  • [Benchmark CI] Change DEFAULT_NUM_INPUTS to MAX_NUM_INPUTS by @yf225 in #702
  • [Benchmark CI] Allow customized mapping into tritonbench impls by @yf225 in #700
  • [Benchmark CI] Add --list-kernels-for-benchmark-ci to benchmark runner by @yf225 in #703
  • Increase test timeout to 60 by @jansel in #697
  • [Benchmark CI] Run rms_norm-bwd and layer_norm-bwd kernels by @yf225 in #705
  • Revert "[Benchmark CI] Run rms_norm-bwd and layer_norm-bwd kernels" by @yf225 in #706
  • Revert "[Benchmark CI] Add --list-kernels-for-benchmark-ci to benchmark runner" by @yf225 in #707
  • [Benchmark CI] Run rms_norm-bwd and layer_norm-bwd kernels by @yf225 in #708
  • [Benchmark CI] Run fewer inputs for layer_norm-bwd to avoid job timeout by @yf225 in #709
  • [Benchmark CI] Set tolerance values that match autotuner setting by @yf225 in #710
  • [Benchmark CI] Skip last input shape for rms_norm-bwd by @yf225 in #712
  • Rebenchmark configs to avoid noise by @jansel in #654
  • Increase default num_generations to 40 by @jansel in #677
  • Add PatternSearch autotuning algorithm by @jansel in #696
  • [Benchmark CI] Simplify h100 display name by @oulgen in #713

New Contributors

Full Changelog: v0.1.3...v0.1.4