Explora I+D+i UPV

Volver atrás Proyecto

LIGHTWEIGHT VISION ENCODER DECODER FOR MULTI-MODAL TRANSCRIPTION PROBLEMS

Centro Propio de Investigación Pattern Recognition and Human Language Technology

Compartir
Año de inicio

2024

Organismo financiador

CONSELLERIA DE EDUCACION, UNIVERSIDADES Y EMPLEO

Tipo de proyecto

INV. COMPETITIVA PROYECTOS

Responsable científico

Paredes Palacios Roberto

Resumen

This project focuses on innovating Vision Encoder-Decoder (VED) models in artificial intelligence, specifically targeting their size and efficiency challenges. VED models, which bridge visual perception and language processing, are being transformed by replacing their transformer-based decoders with MLP-Mixer layers and employing Connectionist Temporal Classification (CTC) loss for training. This approach aims to reduce the model complexity and computational demands, making them more efficient for tasks like image captioning and text recognition. Additionally, the integration of reinforcement learning as a training method, using classifiers as reward functions, represents a significant shift from traditional training methods. This novel strategy is expected to enhance the model ability to generate high-quality, domain-specific text. The project methodology is systematically structured into seven Work Packages (WPs), covering literature review, fundamental research on VED model innovation, advanced training techniques, and specific applications. These applications include line-based and full-page handwritten text recognition, medical report generation from X-ray images, and lip reading. Each WP is designed to thoroughly explore the potential and limitations of the new VED model approach. The main advantage of this approach is its balance between computational efficiency and the capability to accurately process complex visual-textual data, which is crucial for practical and powerful AI solutions in real-world scenarios.