PhD Proposal: Applications of Deep Neural Networks in Vision and NLP problems

Talk
Varun Manjunatha
Time: 
05.30.2017 12:00 to 13:30
Location: 

AVW 3450

Recent advances in machine learning, especially problems in Computer Vision and Natural Language have involved deep neural networks and training such networks with an enormous amount of data. The first frontier for deep networks was in uni-modal classification and detection problems, which were directed more towards "intelligent robotics" and surveillance applications, while the next wave involves deploying deep networks on more creative tasks and common-sense reasoning. In this proposal, we explore three directions in deep learning research involving novel tasks that deep networks can be expected to perform, as well as the amount of data required to train them - a) Can a deep neural network learn and infer from bi-model Vision+NLP datasets, such as comic books. b) In natural language tasks involving deep networks, does the order of the words in a sentence matter ? c) In training a deep network, how much data is required for a deep network to learn generic features?

In the first part, I shall explore whether deep networks can predict the next panel in bi-modal comic book data. In comics, most movements in time and space are hidden in the “gutters” between panels. While computers can now describe what is explicitly depicted in natural images, in this work we examine whether they can understand the closure-driven narratives conveyed by stylized artwork and dialogue in comic book panels. We construct a dataset, COMICS, that consists of over 1.2 million panels (120 GB) paired with automatic textbox transcriptions. An in-depth analysis of COMICS demonstrates that neither text nor image alone can tell a comic book story, so a computer must understand both modalities to keep up with the plot. Various deep neural architectures under-perform human baselines on these tasks, suggesting that COMICS contains fundamental challenges for both vision and language.

In the second part, I shall explore the role of word order in NLP tasks. For many NLP tasks, ordered models, which explicitly encode word order information, do not significantly outperform unordered (bag-of-words) models. One potential explanation is that the tasks themselves do not require word order to solve. To test whether this explanation is valid, we perform several time-controlled human experiments with scrambled language inputs. We compare human accuracies to those of both ordered and unordered neural models. Our results contradict the initial hypothesis, suggesting instead that humans may be less robust to word order variation than computers.

In the third part, I shall explore the role of classes in training generic fc-features in deep CNNs. In recent years, it is common practice to extract fully-connected-layer (fc) features that were learned while performing image classification on a source dataset, such as ImageNet, and apply them generally to a wide range of other tasks. The general usefulness of large training datasets for feature extraction is not yet well understood, and raises a number of questions. For example, in the context of transfer learning, what is the role of a specific class in the source dataset, and how is the transferability of fc features affected when they are trained using various subsets of the set of all classes in the source dataset? In this paper, we address the question of how to select an optimal subset of the set of classes, subject to a budget constraint, that will more likely generate good features for other tasks.

Avenues for future work in this proposal are : a) Can a machine learn to correctly order a sequence of 'n' panels in a comic book while maintaining narrative coherence as judged by humans ? b) Can a machine learn to generate dialogue between characters in a panel that is consistant with the art-work in a panel ? c) To train a CNN, can we propose a much smaller subset of the ImageNet dataset that results in "equally powerful" fully-connected features as network trained on the entire dataset ?

Examining Committee:

Chair: Dr. Larry Davis

Dept rep: Dr. V.S. Subrahmanian

Member: Dr. David Jacobs