-
torchtitan Public
Forked from pytorch/torchtitanA native PyTorch Library for large model training
Python BSD 3-Clause "New" or "Revised" License UpdatedFeb 18, 2025 -
torchft Public
Forked from meta-pytorch/torchftPyTorch per step fault tolerance (actively under development)
Python Other UpdatedDec 24, 2024 -
nvidia-resiliency-ext Public
Forked from NVIDIA/nvidia-resiliency-extNVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the effective training time by minimizing the downtime due to fa…
Python Other UpdatedDec 18, 2024 -
Megatron-LM Public
Forked from NVIDIA/Megatron-LMOngoing research training transformer models at scale
Python Other UpdatedNov 23, 2024 -
NeMo Public
Forked from NVIDIA-NeMo/NeMoA scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
Python Apache License 2.0 UpdatedNov 23, 2024 -
litgpt Public
Forked from Lightning-AI/litgpt20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
Python Apache License 2.0 UpdatedSep 17, 2024 -
pytorch-lightning Public
Forked from Lightning-AI/pytorch-lightningPretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
Python Apache License 2.0 UpdatedSep 16, 2024 -
veScale Public
Forked from volcengine/veScaleA PyTorch Native LLM Training Framework
Python Apache License 2.0 UpdatedAug 25, 2024 -
ai-on-gke Public
Forked from GoogleCloudPlatform/ai-on-gkeAI on GKE is a collection of examples, best-practices, and prebuilt solutions to help build, deploy, and scale AI Platforms on Google Kubernetes Engine
HCL Apache License 2.0 UpdatedAug 24, 2024 -
-
maxtext Public
Forked from AI-Hypercomputer/maxtextA simple, performant and scalable Jax LLM!
Python Apache License 2.0 UpdatedApr 2, 2024 -
xpk Public
Forked from AI-Hypercomputer/xpkxpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.
Python Apache License 2.0 UpdatedFeb 17, 2024 -
TinyLlama Public
Forked from jzhang38/TinyLlamaThe TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
Python Apache License 2.0 UpdatedFeb 14, 2024 -
-
tpu-tools Public
Forked from tensorflow/tpuReference models and tools for Cloud TPUs.
Jupyter Notebook Apache License 2.0 UpdatedAug 29, 2023 -
-
orbax Public
Forked from google/orbaxOrbax provides common utility libraries for JAX users.
Python Apache License 2.0 UpdatedAug 8, 2023 -
ray Public
Forked from ray-project/rayRay is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a toolkit of libraries (Ray AIR) for accelerating ML workloads.
Python Apache License 2.0 UpdatedApr 29, 2023 -
-
diffusers Public
Forked from huggingface/diffusers🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch
Python Apache License 2.0 UpdatedApr 8, 2023 -
-
-
serving Public
Forked from tensorflow/servingA flexible, high-performance serving system for machine learning models
C++ Apache License 2.0 UpdatedNov 8, 2022 -
jax Public
Forked from jax-ml/jaxComposable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
Python Apache License 2.0 UpdatedNov 4, 2022 -
ml-testing-accelerators Public
Forked from GoogleCloudPlatform/ml-testing-acceleratorsTesting framework for Deep Learning models (Tensorflow and PyTorch) on Google Cloud hardware accelerators (TPU and GPU)
Jsonnet Apache License 2.0 UpdatedNov 4, 2022 -
-
jaxformer Public
Forked from salesforce/jaxformerMinimal library to train LLMs on TPU in JAX with pjit().
Python BSD 3-Clause "New" or "Revised" License UpdatedOct 27, 2022 -
-
UCSD_BigData Public
Forked from yoavfreund/UCSD_BigDataA repository for scripts and notebooks for the UCSD big data course
Python UpdatedJun 14, 2014