-
SCUT
- Guangzhou
-
17:25
(UTC +08:00) - https://scholar.google.com/citations?user=dW7AgfgAAAAJ&hl=zh-CN
Stars
A latent text-to-image diffusion model
The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
12 Lessons to Get Started Building AI Agents
《开源大模型食用指南》针对中国宝宝量身打造的基于Linux环境快速微调(全参数/Lora)、部署国内外开源大模型(LLM)/多模态大模型(MLLM)教程
A simple screen parsing tool towards pure vision based GUI agent
The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use th…
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
High-Resolution Image Synthesis with Latent Diffusion Models
QLoRA: Efficient Finetuning of Quantized LLMs
[NeurIPS 2024 Best Paper Award][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". A…
MiniCPM4 & MiniCPM4.1: Ultra-Efficient LLMs on End Devices, achieving 3+ generation speedup on reasoning tasks
Reference PyTorch implementation and models for DINOv3
Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
This is the official code for MobileSAM project that makes SAM lightweight for mobile applications and beyond!
AIInfra(AI 基础设施)指AI系统从底层芯片等硬件,到上层软件栈支持AI大模型训练和推理。
Align Anything: Training All-modality Model with Feedback
OmniGen: Unified Image Generation. https://arxiv.org/pdf/2409.11340
Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
中文nlp解决方案(大模型、数据、模型、训练、推理)
Official repository of 'Visual-RFT: Visual Reinforcement Fine-Tuning' & 'Visual-ARFT: Visual Agentic Reinforcement Fine-Tuning'’
Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion)
Train a 1B LLM with 1T tokens from scratch by personal
This repository provides train&test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.
Run Segment Anything Model 2 on a live video stream
Turning a CLIP Model into a Scene Text Detector (CVPR2023) | Turning a CLIP Model into a Scene Text Spotter (TPAMI)