Research

GUIDANCE - Debugging Computer Vision Models via Controlled Cross-modal Generation

Back to all projects

While content is increasingly available with mixed modalities, such as text, audio, images and videos, most efforts have focused on monomodal deep learning approaches which deal primarily with either text or images. The need to process mixed content is particularly important when it comes to creative contexts, in which the combination of text, audio and visual content can enable new possibilities.

The convergence of deep learning with Computer Vision and Natural Language Processing has made it possible to empower not only an effective understanding and retrieval, but also the generation of textual and visual information. However, it is missing a unified methodology which provides a seamless integration between the different modalities based on multimodal processing of visual, audio and textual data. This project will make a quantum leap by investigating and developing innovative cross-modal neural models which can manipulate and transform different types of data seamlessly enabling: 1. cross-modal processing of textual, audio and visual input to create efficient and reusable representations in a shared space; 2. cross-modal understanding of textual, audio, and visual content for retrieval of digital data; 3. cross-modal generation, e.g. producing images/video from textual/audio content and vice versa, including mixed content. At the core of the project lies a new unifying paradigm that aims to find synergies between supervised neural networks (going beyond current convolutive autoencoders, GANs, Transformer-based NNs, Capsules and graph-based networks) and symbolic representations, as those obtained from multilingual lexical-semantic knowledge graphs.