User profiles for Xiuye Gu
Xiuye GuGoogle Verified email at google.com Cited by 3313 |
Open-vocabulary object detection via vision and language knowledge distillation
We aim at advancing open-vocabulary object detection, which detects objects described by
arbitrary text inputs. The fundamental challenge is the availability of training data. It is costly …
arbitrary text inputs. The fundamental challenge is the availability of training data. It is costly …
Scaling open-vocabulary image segmentation with image-level labels
We design an open-vocabulary image segmentation model to organize an image into meaningful
regions indicated by arbitrary texts. Recent works (CLIP and ALIGN), despite attaining …
regions indicated by arbitrary texts. Recent works (CLIP and ALIGN), despite attaining …
Photorealistic video generation with diffusion models
We present WALT, a diffusion transformer for photorealistic video generation from text
prompts. Our approach has two key design decisions. First, we use a causal encoder to jointly …
prompts. Our approach has two key design decisions. First, we use a causal encoder to jointly …
Videopoet: A large language model for zero-shot video generation
We present VideoPoet, a language model capable of synthesizing high-quality video, with
matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-…
matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-…
F-vlm: Open-vocabulary object detection upon frozen vision and language models
We present F-VLM, a simple open-vocabulary object detection method built upon Frozen
Vision and Language Models. F-VLM simplifies the current multi-stage training pipeline by …
Vision and Language Models. F-VLM simplifies the current multi-stage training pipeline by …
Dataseg: Taming a universal multi-dataset multi-task segmentation model
Observing the close relationship among panoptic, semantic and instance segmentation
tasks, we propose to train a universal multi-dataset multi-task segmentation model: DaTaSeg. …
tasks, we propose to train a universal multi-dataset multi-task segmentation model: DaTaSeg. …
Hplflownet: Hierarchical permutohedral lattice flownet for scene flow estimation on large-scale point clouds
We present a novel deep neural network architecture for end-to-end scene flow estimation
that directly operates on large-scale 3D point clouds. Inspired by Bilateral Convolutional …
that directly operates on large-scale 3D point clouds. Inspired by Bilateral Convolutional …
Language Model Beats Diffusion--Tokenizer is Key to Visual Generation
While Large Language Models (LLMs) are the dominant models for generative tasks in
language, they do not perform as well as diffusion models on image and video generation. To …
language, they do not perform as well as diffusion models on image and video generation. To …
Pixel-aligned language model
Large language models have achieved great success in recent years so as their variants in
vision. Existing vision-language models can describe images in natural languages answer …
vision. Existing vision-language models can describe images in natural languages answer …
Password-conditioned anonymization and deanonymization with face identity transformers
Cameras are prevalent in our daily lives, and enable many useful systems built upon computer
vision technologies such as smart cameras and home robots for service applications. …
vision technologies such as smart cameras and home robots for service applications. …