Abstract: The objective of Reinforcement learning is to find the optimal policy that maximizes rewards in the long run. In this talk I will talk about 3 types of RL algorithms: 1. Policy gradient; 2. Actor-Critic; 3. Q-learning. Concepts will be explained with illustration and papers from OpenAI will be shared.
View SlidesAbstract: Video feature learning for action recognition is a challenging task that has been extensively studied in the research community. How to properly exploit the motion and temporal information are key to the design of the models. In this talk, I will review some famous CNN/LSTM based networks designed for action recognition, including multi-stream CNNs, 3D convolution with its variants, and non-local neural networks, etc.
View SlidesAbstract: The target of image captioning is to generate a syntactically and semantically correct sentence which can describe the main content of the given image. Compared with early image captioners which are rules/templates based, the modern captioning models have achieved striking advances by three key techniques, i.e., encoder-decoder based pipeline, attention technique, and RL-based training objective. However, these image captioners lack the ability of commonsense reasoning, which is one important inductivce bias owned by humans. For exploiting such language inductive bias, Scene Graph Auto-Encoder (SGAE) is proposed for generating more descriptive captions.
View SlidesAbstract: Visual dialog is a challenging vision-language task, which requires the agent to answer multi-round questions about an image. It typically needs to address two major problems: (1) How to answer visually-grounded questions, which is the core challenge in visual question answering (VQA); (2) How to infer the co-reference between questions and the dialog history. An example of visual co-reference is: pronouns (e.g., `they') in the question (e.g., `Are they on or off?') are linked with nouns (e.g., `lamps') appearing in the dialog history (e.g., `How many lamps are there?') and the object grounded in the image. In this work, to resolve the visual co-reference for visual dialog, we propose a novel attention mechanism called Recursive Visual Attention (RvA). Specifically, our dialog agent browses the dialog history until the agent has sufficient confidence in the visual co-reference resolution, and refines the visual attention recursively. The quantitative and qualitative experimental results on the large-scale VisDial v0.9 and v1.0 datasets demonstrate that the proposed RvA not only outperforms the state-of-the-art methods, but also achieves reasonable recursion and interpretable attention maps without additional annotations.
View SlidesAbstract: Instance segmentation is a long-lasting problem in computer vision and a basic component of many applications, such as autonomous driving. Current instance segmentation methods based on deep neural networks can be categorized into two types, depending on how the method approaches the problem by starting from either detection or segmentation modules. In this presentation, I will give introduction of thses two kinds of methods and futhur cover some ideas about panoptic segmentation.
View SlidesAbstract: Visual relationships represent the visible and detectable various interactions between each object pair. Also, reasoning those relationships- formalized as Visual Relation Detection (VRD) task- can be fed into higher level tasks such as image captioning, visual question answering, image-text matching, as the intermediate building block. Reviewing of the task’s challenges, available datasets, and some state-of-the-arts is the purpose of seminar ahead.
View SlidesAbstract: The appearance of AlexNet relight the popularity of deep learning, and many tasks in computer vision had obtained significant improvements based on these basic CNN architectures pretrained on ImageNet. However, even now in 2019, many researchers still only familiar with the ResNet (the winner of 2015 ILSVRC challenges). In this presentation, I will review some famous CNN architectures and analysis the philosophy of these designs, to help design better CNN architecture in our own specific tasks.
View SlidesAbstract: Visual Question Answering is an important step from low-level cognition tasks, like visual recognition/detection, sentence analysis, towards general artificial intelligence. Although current VQA system is still not perfect, it motivates the community to think about how to build a bridge between visual information and textual information, which are two most important source of information the human can absorb.
View SlidesAbstract: Visual Dialog is one of the prototype tasks introduced in recent years, which can be viewed as the multi-round VQA. It aims to give a proper answer based on visual and textual contents in the dialog. In this seminar, I will give a brief introduction of its definition, dataset, metrics and methods.
View SlidesAbstract: The semantic image segmentation is the most elaborate one of three core tasks in computer vision, which aims to assign a correct semantic label to every pixel in an image. In this seminar, I will explain the definition of semantic segmentation and introduce the corresponding datasets. Besides, I will make an analysis and summary to mainstream image semantic segmentation models. At last, I will report to you on my progress in this task. Your comments and criticism are greatly welcomed.
View SlidesAbstract: Visual grounding is a task to localize an object in an image based on a query in nature language. It has attracted a lot of attention in the recent years. In this seminar, I'll introduce visual grounding including a) the task definition, b) the datasets, c) a series of papers in mainstream, and d) my recent works. Welcome to join in and please feel free to ask any questions.
View SlidesAbstract: Visual reasoning aims to answer questions about complicated interactions between visual objects. Existing models can be divided into two categories: holistic approaches and module approaches. I will introduce typical works of these two categories.
View SlidesThe code is implemented based on Ruotian Luo's implementation of image captioning in https://github.com/ruotianluo/self-critical.pytorch. And we use the visual features provided by paper Bottom-up and top-down attention for image captioning and visual question answering in https://github.com/peteanderson80/bottom-up-attention. If you like this code, please consider to cite their corresponding papers and my CVPR paper.
[View Github]The code is implemented by pytorch. And we separate the visual question answering and scene graph generation into two repositories in github. The VQA code is directly modified from the project Cyanogenoid/vqa-counting. The SGG code is directly modified from the project rowanz/neural-motifs. If you like this work, cite their corresponding papers and my CVPR paper
[View VQA Code] [View SGG Code]