Teaching machines to describe visual images is one of the most long-standing challenges in the field of Machine Learning. This thesis tackles the problem of describing images via natural language: how to build machine learning models to read visual images and describe their content as well as answer relevant questions. From the application perspective, strong image captioning systems can contribute to applications such as visual question answering, dialogue systems and visual-based robotics. For a long term goal, if we can build such systems (e.g., GPT-4V and beyond), they would be a crucial step towards building Artificial General Intelligence: computers that can perceive and explore the world as humans do.
This thesis focus on neural models: building vision understanding and language generation models with deep neural networks. It mainly consists of three parts.
First, we will introduce concept bottleneck models, a class of models that build concept layers for visual understanding. We will present our work on learning a concise concept space, and follow-up applications for medical imaging to gain robustness.
In the second part of this thesis, we investigate how we can build practical image captioning systemsbased on different neural text generation architectures, from LSTM to transformers and pre-trained language models. In particular, we will cover four different tasks: 1) how we can describe the visual difference of two images; 2) how we can write medical reports given Chest X-rays to assist doctors; 3) how to generate personalized explanations for recommender systems; 4) how to augment text generation with visual imagination generated from vision diffusion models.
In the third part, we will discuss recent advances, future directions and openquestions in this field, focusing on aspects of datasets, models, and applications. We will also introduce some of our on-going attempts for these directions: for example, how to navigate phone screens and complete mobile tasks with GPT-4V.
In summary, my research contributes to the field of vision and language, specifically visual understanding via natural language, from the aspects of data curation, algorithm designing, model training, as well as various downstream applications.