Language and Vision Workshop

A workshop on

language and vision

at CVPR 2017

21 July 2017
8:45am to 6:00pm
Hawaii Convention Center

This workshop is co-organized by the Center for Brains, Minds, and Machines.

See the 2015 version of the workshop


8:50 Introduction
9:00 Lawson Wong, Stephanie Tellex
Intepreting Human-Robot Instructions: Grounding natural language (and perhaps perception) to robot actions
9:30 Dhruv Batra
Visual Dialog: Towards AI Agents That Can See, Talk, and Act
10:00 David Hogg
Visual and language concept discovery
10:30 Morning Break
11:00 Alan Yuille
11:30 Ev Fedorenko, Human language as a Code for Thought
12:00 Poster highlights
12:30 Lunch
2:00 Poster session
3:00 Ted Gibson
Color naming across languages reflects language use
3:30 Devi Parikh
Towards Theory of AI's Mind
4:00 Afternoon Break
4:30 Song-Chun Zhu
5:00 Tao Mei, Video-To-Text Corpus
5:20 Andrei Barbu, Robots that communicate
5:30 Vision-language panel
6:00 Closing remarks


The interaction between language and vision, despite seeing traction as of late, is still largely unexplored. This is a particularly relevant topic to the vision community because humans routinely perform tasks which involve both modalities. We do so largely without even noticing. Every time you ask for an object, ask someone to imagine a scene, or describe what you're seeing, you're performing a task which bridges a linguistic and a visual representation. The importance of vision-language interaction can also be seen by the numerous approaches that often cross domains, such as the popularity of image grammars. More concretely, we've recently seen a renewed interest in one-shot learning for object and event models. Humans go further than this using our linguistic abilities; we perform zero-shot learning without seeing a single example. You can recognize a picture of a zebra after hearing the description "horse-like animal with black and white stripes" without ever having seen one.

Furthermore, integrating language with vision brings with it the possibility of expanding the horizons and tasks of the vision community. We have seen significant growth in image and video-to-text tasks but many other potential applications of such integration – answering questions, dialog systems, and grounded language acquisition – remain largely unexplored. Going beyond such novel tasks, language can make a deeper contribution to vision: it provides a prism through which to understand the world. A major difference between human and machine vision is that humans form a coherent and global understanding of a scene. This process is facilitated by our ability to affect our perception with high-level knowledge which provides resilience in the face of errors from low-level perception. It also provides a framework through which one can learn about the world: language can be used to describe many phenomena succinctly thereby helping filter out irrelevant details.

Topics covered (non-exhaustive):

  • language as a mechanism to structure and reason about visual perception,
  • language as a learning bias to aid vision in both machines and humans,
  • novel tasks which combine language and vision,
  • dialogue as means of sharing knowledge about visual perception,
  • stories as means of abstraction,
  • transfer learning across language and vision,
  • understanding the relationship between language and vision in humans,
  • reasoning visually about language problems,
  • visual captioning, dialogue, and question-answering,
  • visual synthesis from language,
  • sequence learning towards bridging vision and language,
  • joint video and language alignment and parsing, and
  • video sentiment analysis.

The workshop will also include presentations related to the MSR Video to Language Challenge. This challenge aims to foster the development of new techniques for video understanding, in particular video captioning, with the goal of automatically generating a complete, natural, and salient sentence describing a video.

We are calling for 2 to 4 page extended abstracts to be showcased at a poster session along with short talk spotlights. Abstracts are not archival and will not be included in the Proceedings of CVPR 2017. In the interests of fostering a freer exchange of ideas we welcome both novel and previously-published work.

We are also accepting full submissions which will not be included in the Proceedings of CVPR 2017 but we will at the option of the authors provide a link to the relevant arXiv submission.


  • Andrei Barbu
    Research Scientist
  • Tao Mei
    Senior Researcher
    Microsoft Research, China
  • Siddharth Narayanaswamy
    Postdoctoral Scholar
    University of Oxford
  • Puneet Kumar Dokania
    Postdoctoral Associate
    University of Oxford
  • Quanshi Zhang
    Postdoctoral Researcher
    University of California, Los Angeles (UCLA)
  • Nishant Shukla
    Graduate Research Assistant
    University of California, Los Angeles (UCLA)
  • Jiebo Luo
    University of Rochester
  • Rahul Sukthankar
    Adjunct Research Professor
    Google Research and CMU