Google Scholar

User profiles for Xiuye Gu

Xiuye Gu

Google

Verified email at google.com

Cited by 3313

[PDF] arxiv.org

Open-vocabulary object detection via vision and language knowledge distillation

X Gu, TY Lin, W Kuo, Y Cui - arXiv preprint arXiv:2104.13921, 2021 - arxiv.org

We aim at advancing open-vocabulary object detection, which detects objects described by
arbitrary text inputs. The fundamental challenge is the availability of training data. It is costly …

Save Cite Cited by 1095 Related articles All 4 versions View as HTML

[PDF] arxiv.org

Scaling open-vocabulary image segmentation with image-level labels

G Ghiasi, X Gu, Y Cui, TY Lin - European conference on computer vision, 2022 - Springer

We design an open-vocabulary image segmentation model to organize an image into meaningful
regions indicated by arbitrary texts. Recent works (CLIP and ALIGN), despite attaining …

Save Cite Cited by 519 Related articles All 7 versions

[PDF] arxiv.org

Photorealistic video generation with diffusion models

A Gupta, L Yu, K Sohn, X Gu, M Hahn, FF Li… - … on Computer Vision, 2024 - Springer

We present WALT, a diffusion transformer for photorealistic video generation from text
prompts. Our approach has two key design decisions. First, we use a causal encoder to jointly …

Save Cite Cited by 192 Related articles All 10 versions

[PDF] arxiv.org

Videopoet: A large language model for zero-shot video generation

D Kondratyuk, L Yu, X Gu, J Lezama, J Huang… - arXiv preprint arXiv …, 2023 - arxiv.org

We present VideoPoet, a language model capable of synthesizing high-quality video, with
matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-…

Save Cite Cited by 279 Related articles All 8 versions View as HTML

[PDF] arxiv.org

F-vlm: Open-vocabulary object detection upon frozen vision and language models

W Kuo, Y Cui, X Gu, AJ Piergiovanni… - arXiv preprint arXiv …, 2022 - arxiv.org

We present F-VLM, a simple open-vocabulary object detection method built upon Frozen
Vision and Language Models. F-VLM simplifies the current multi-stage training pipeline by …

Save Cite Cited by 233 Related articles All 4 versions View as HTML

[PDF] neurips.cc

Dataseg: Taming a universal multi-dataset multi-task segmentation model

X Gu, Y Cui, J Huang, A Rashwan… - Advances in …, 2023 - proceedings.neurips.cc

Observing the close relationship among panoptic, semantic and instance segmentation
tasks, we propose to train a universal multi-dataset multi-task segmentation model: DaTaSeg. …

Save Cite Cited by 33 Related articles All 6 versions View as HTML

[PDF] thecvf.com

Hplflownet: Hierarchical permutohedral lattice flownet for scene flow estimation on large-scale point clouds

X Gu, Y Wang, C Wu, YJ Lee… - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com

We present a novel deep neural network architecture for end-to-end scene flow estimation
that directly operates on large-scale 3D point clouds. Inspired by Bilateral Convolutional …

Save Cite Cited by 270 Related articles All 12 versions View as HTML

[PDF] arxiv.org

Language Model Beats Diffusion--Tokenizer is Key to Visual Generation

…, D Minnen, Y Cheng, V Birodkar, A Gupta, X Gu… - arXiv preprint arXiv …, 2023 - arxiv.org

While Large Language Models (LLMs) are the dominant models for generative tasks in
language, they do not perform as well as diffusion models on image and video generation. To …

Save Cite Cited by 325 Related articles All 7 versions View as HTML

[PDF] thecvf.com

Pixel-aligned language model

J Xu, X Zhou, S Yan, X Gu, A Arnab… - Proceedings of the …, 2024 - openaccess.thecvf.com

Large language models have achieved great success in recent years so as their variants in
vision. Existing vision-language models can describe images in natural languages answer …

Save Cite Cited by 27 Related articles All 7 versions View as HTML

[PDF] arxiv.org

Password-conditioned anonymization and deanonymization with face identity transformers

X Gu, W Luo, MS Ryoo, YJ Lee - European conference on computer vision, 2020 - Springer

Cameras are prevalent in our daily lives, and enable many useful systems built upon computer
vision technologies such as smart cameras and home robots for service applications. …

Save Cite Cited by 66 Related articles All 7 versions

Create alert

Cite

Advanced search

Saved to My library

User profiles for Xiuye Gu

Xiuye Gu

Open-vocabulary object detection via vision and language knowledge distillation

Scaling open-vocabulary image segmentation with image-level labels

Photorealistic video generation with diffusion models

Videopoet: A large language model for zero-shot video generation

F-vlm: Open-vocabulary object detection upon frozen vision and language models

Dataseg: Taming a universal multi-dataset multi-task segmentation model

Hplflownet: Hierarchical permutohedral lattice flownet for scene flow estimation on large-scale point clouds

Language Model Beats Diffusion--Tokenizer is Key to Visual Generation

Pixel-aligned language model

Password-conditioned anonymization and deanonymization with face identity transformers