Thursday, 11 June 2015
Room 201; 8:50am to 5:30pm
Boston, Massachusetts
This workshop is co-organized by the Center for Brains, Minds, and Machines.
Invited speakers
-
Fei-Fei Li
-
Song-Chun Zhu
-
Tomaso Poggio
-
Linda Smith
-
Tony Cohn
-
Jeffrey M. Siskind
-
Stefanie Tellex
-
Jason J. Corso
-
Patrick H. Winston
-
Joyce Chai
-
Kristen Grauman
Schedule
8:50 Introduction
9:00
Song-Chun Zhu
9:30
Linda Smith
10:00
Morning Break
10:15
Kristen Grauman
10:45
Jason J. Corso
11:15
Stefanie Tellex
11:45
Image Annotation Challenge Summary
12:15
Lunch
1:00
Poster session in the main ballroom
2:00
Jeffrey Mark Siskind
2:30
Joyce Chai
3:00
Patrick Winston
3:30
Afternoon Break
3:45
Tony Cohn
4:15
Fei-Fei Li
4:45
Tomaso Poggio
5:15
Discussion and Wrap Up
Accepted submissions
We accepted 14 submissions as posters. Note that this
workshop has no archival copies, but we do link to
arXiv when the authors requested we do so.
Tell and Predict: Kernel Classifier Prediction for Unseen Visual Classes from Unstructured Text Descriptions
Long-term Recurrent Convolutional Networks for Visual Description
Holistic Scene Understanding via Multiple Structured Hypotheses from Perception Modules
Language and Robots: An Extensible Language Interface for Robot Interaction
Interleaved Text/Image Deep Mining on a Large-Scale Radiology Database for Automated Image Interpretation
VQA: Visual Question Answering
Beyond “Single Snippet-Single Sentence” Video Description
Multimodal Stacked Denoising Autoencoders
Zero-Shot Recognition with Unreliable Attributes
Vision and Language, Helping a Robot to Reason about its Environment
Extending The Guesser Based Model: Adding Absolute Location and Relative Attributes to Referring Expressions
Viralets: Learning from Viral Videos to Identify Semantic Highlight in Personal Videos
Semantic Fusion of FMV and Chat Data for Activity Recognition
Sequence to Sequence Video to Text
Call
The interaction between language and vision, despite seeing traction
as of late, is still largely unexplored. This is a particularly
relevant topic to the vision community because humans routinely
perform tasks which involve both modalities. We do so largely without
even noticing. Every time you ask for an object, ask someone to
imagine a scene, or describe what you're seeing, you're performing a
task which bridges a linguistic and a visual representation. The
importance of vision-language interaction can also be seen by the
numerous approaches that often cross domains, such as the popularity
of image grammars. More concretely, we've recently seen a renewed
interest in one-shot learning for object and event models. Humans go
further than this using our linguistic abilities; we perform zero-shot
learning without seeing a single example. You can recognize a picture
of a zebra after hearing the description "horse-like animal with black
and white stripes" without ever having seen one.
Furthermore, integrating language with vision brings with it the
possibility of expanding the horizons and tasks of the vision
community. We have seen significant growth in image and video-to-text
tasks but many other potential applications of such integration –
answering questions, dialog systems, and grounded language acquisition
– remain unexplored. Going beyond such novel tasks, language can make
a deeper contribution to vision: it provides a prism through which to
understand the world. A major difference between human and machine
vision is that humans form a coherent and global understanding of a
scene. This process is facilitated by our ability to affect our
perception with high-level knowledge which provides resilience in the
face of errors from low-level perception. It also provides a framework
through which one can learn about the world: language can be used to
describe many phenomena succinctly thereby helping filter out
irrelevant details.
Topics covered:
- language as a mechanism to structure and reason about visual perception,
- language as a learning bias to aid vision in both machines and humans,
- novel tasks which combine language and vision,
- dialog as means of sharing knowledge about visual perception,
- stories as means of abstraction,
- transfer learning across language and vision,
- understanding the relationship between language and vision in humans,
- reasoning visually about language problems, and
- joint video and language parsing.
The workshop will also include a challenge related to the 4th edition of the
Scalable Concept Image Annotation
Challenge one of the tasks
of ImageCLEF.
The Scalable Concept Image Annotation task aims to develop techniques to allow
computers to reliably describe images, localize the different concepts depicted
in the images and generate a description of the scene.
The task directly related to this workshop is Generation of Textual
Descriptions of Images.
We are calling for 1 to 2 page extended abstracts to be
showcased at a poster session. Abstracts are not archival
and will not be included in the Proceedings of CVPR
2015. We welcome both novel and previously-published work.
Contributions to the Generation of Textual Descriptions challenge will also be
showcased at the poster session, and a summary of the results will be presented
at the workshop.
Organizers
-
Andrei Barbu
Postdoctoral Associate
MIT
-
Georgios Evangelopoulos
Postdoctoral Fellow
Istituto Italiano di Tecnologia and MIT
-
Daniel Harari
Postdoctoral Associate
MIT
-
Krystian Mikolajczyk
Reader in Robot Vision
University of Surrey
-
Siddharth Narayanaswamy
Postdoctoral Scholar
Stanford University
-
Caiming Xiong
Postdoctoral Associate
UCLA
-
Yibiao Zhao
PhD student
UCLA
Invited speakers
- Fei-Fei Li
- Song-Chun Zhu
- Tomaso Poggio
- Linda Smith
- Tony Cohn
- Jeffrey M. Siskind
- Stefanie Tellex
- Jason J. Corso
- Patrick H. Winston
- Joyce Chai
- Kristen Grauman
Schedule
8:50 | Introduction |
---|---|
9:00 | Song-Chun Zhu |
9:30 | Linda Smith |
10:00 | Morning Break |
10:15 | Kristen Grauman |
10:45 | Jason J. Corso |
11:15 | Stefanie Tellex |
11:45 | Image Annotation Challenge Summary |
12:15 | Lunch |
1:00 | Poster session in the main ballroom |
2:00 | Jeffrey Mark Siskind |
2:30 | Joyce Chai |
3:00 | Patrick Winston |
3:30 | Afternoon Break |
3:45 | Tony Cohn |
4:15 | Fei-Fei Li |
4:45 | Tomaso Poggio |
5:15 | Discussion and Wrap Up |
Accepted submissions
We accepted 14 submissions as posters. Note that this workshop has no archival copies, but we do link to arXiv when the authors requested we do so.Call
The interaction between language and vision, despite seeing traction as of late, is still largely unexplored. This is a particularly relevant topic to the vision community because humans routinely perform tasks which involve both modalities. We do so largely without even noticing. Every time you ask for an object, ask someone to imagine a scene, or describe what you're seeing, you're performing a task which bridges a linguistic and a visual representation. The importance of vision-language interaction can also be seen by the numerous approaches that often cross domains, such as the popularity of image grammars. More concretely, we've recently seen a renewed interest in one-shot learning for object and event models. Humans go further than this using our linguistic abilities; we perform zero-shot learning without seeing a single example. You can recognize a picture of a zebra after hearing the description "horse-like animal with black and white stripes" without ever having seen one.
Furthermore, integrating language with vision brings with it the possibility of expanding the horizons and tasks of the vision community. We have seen significant growth in image and video-to-text tasks but many other potential applications of such integration – answering questions, dialog systems, and grounded language acquisition – remain unexplored. Going beyond such novel tasks, language can make a deeper contribution to vision: it provides a prism through which to understand the world. A major difference between human and machine vision is that humans form a coherent and global understanding of a scene. This process is facilitated by our ability to affect our perception with high-level knowledge which provides resilience in the face of errors from low-level perception. It also provides a framework through which one can learn about the world: language can be used to describe many phenomena succinctly thereby helping filter out irrelevant details.
Topics covered:
- language as a mechanism to structure and reason about visual perception,
- language as a learning bias to aid vision in both machines and humans,
- novel tasks which combine language and vision,
- dialog as means of sharing knowledge about visual perception,
- stories as means of abstraction,
- transfer learning across language and vision,
- understanding the relationship between language and vision in humans,
- reasoning visually about language problems, and
- joint video and language parsing.
The workshop will also include a challenge related to the 4th edition of the Scalable Concept Image Annotation Challenge one of the tasks of ImageCLEF. The Scalable Concept Image Annotation task aims to develop techniques to allow computers to reliably describe images, localize the different concepts depicted in the images and generate a description of the scene. The task directly related to this workshop is Generation of Textual Descriptions of Images.
We are calling for 1 to 2 page extended abstracts to be showcased at a poster session. Abstracts are not archival and will not be included in the Proceedings of CVPR 2015. We welcome both novel and previously-published work.
Contributions to the Generation of Textual Descriptions challenge will also be showcased at the poster session, and a summary of the results will be presented at the workshop.
- Andrei Barbu Postdoctoral Associate MIT
- Georgios Evangelopoulos Postdoctoral Fellow Istituto Italiano di Tecnologia and MIT
- Daniel Harari Postdoctoral Associate MIT
- Krystian Mikolajczyk Reader in Robot Vision University of Surrey
- Siddharth Narayanaswamy Postdoctoral Scholar Stanford University
- Caiming Xiong Postdoctoral Associate UCLA
- Yibiao Zhao PhD student UCLA