Ayuda
Ir al contenido

Dialnet


Resumen de Image and video object segmentation in low supervision scenarios

Míriam Bellver Bueno

  • Computer vision plays a key role in Artificial Intelligence because of the rich semantic information contained in pixels and the ubiquity of cameras nowadays. Multimedia content is on a rise since social networks have such a strong impact in our society and access to the internet becomes more widespread. This context allows the gathering of large datasets which have fostered great advancements in the computer vision field thanks to deep neural networks. These models can effectively exploit large amounts of data to reach a high expressive power. Since the breakout of Imagenet, a large dataset for image classification, most computer vision tasks have benefited from deep neural networks. Among the different tasks in the computer vision field, locating objects in images and videos is a central one, as it has many applications in autonomous driving, surveillance, image and video edition, medical diagnosis and biometrics along with others. Localization of objects can be obtained with bounding boxes around the target objects, or with accurate pixel-level masks that delineate the instances. The latter is a more challenging task, but fundamental for certain applications where edges of objects need to be determined. The main task addressed in this thesis is instance segmentation, that consists in, given an image or video, providing pixel-level masks for each instance of certain semantic object classes.

    In order to train a segmentation model, current solutions rely on large amounts of pixel-wise annotations, which demand significant human effort to collect. Furthermore, expert knowledge is needed to gather certain annotations, such as labels for medical images. In consequence, there is a huge interest for systems that work with less-demanding forms of supervision, such as weakly or semi-supervised pipelines. Besides, in some segmentation tasks, human effort is not only needed for training the models, but also at inference. In semi-automatic systems, user input may be required as guidance to start the system. One example is the task of one-shot Video Object Segmentation~(osVOS), which expects that the end-user provides a pixel-level mask for each object to be tracked in the first frame of the video. Following, the model must predict the segmentation mask of the tracked objects for the remaining frames. These initialization cues are crucial for high accuracy, but they are arduous to obtain. An alternative are models that depend on weaker input signals that are user-friendlier.

    This thesis explores different supervision scenarios for the instance segmentation task, distinguishing between supervision during training and at inference, and focusing on low-supervision setups. In the first part of the thesis we present a novel recurrent architecture for video object segmentation that is end-to-end trainable in a fully-supervised setup, and that does not require any post-processing step, i.e., the output of the model directly solves the addressed task. The second part of the thesis aims at lowering the annotation cost, in terms of labeling time, needed to train image segmentation models. We explore semi-supervised pipelines and show results when a very limited budget is available. The third part of the dissertation attempts to alleviate the supervision required by semi-automatic systems at inference time. Particularly, we focus on semi-supervised video object segmentation, which typically requires generating a binary mask for each instance to be tracked. In contrast, we present a model for language-guided video object segmentation, which identifies the object to segment with a natural language expression. We study current benchmarks, propose a novel categorization of referring expressions for video, and identify the main challenges posed by the video task.


Fundación Dialnet

Dialnet Plus

  • Más información sobre Dialnet Plus