Explora R+D+I UPV

Torna arrere Projecte

LIGHTWEIGHT VISION ENCODER DECODER FOR MULTI-MODAL TRANSCRIPTION PROBLEMS

Centro Propio de Investigación Pattern Recognition and Human Language Technology

Comparteix
Any d'inici

2024

Organisme finançador

CONSELLERIA DE EDUCACION, UNIVERSIDADES Y EMPLEO

Tipus de projecte

INV. COMPETITIVA PROYECTOS

Responsable científic

Paredes Palacios Roberto

Resum

This project focuses on innovating Vision Encoder-Decoder (VED) models in artificial intelligence, specifically targeting their size and efficiency challenges. VED models, which bridge visual perception and language processing, are being transformed by replacing their transformer-based decoders with MLP-Mixer layers and employing Connectionist Temporal Classification (CTC) loss for training. This approach aims to reduce the model complexity and computational demands, making them more efficient for tasks like image captioning and text recognition. Additionally, the integration of reinforcement learning as a training method, using classifiers as reward functions, represents a significant shift from traditional training methods. This novel strategy is expected to enhance the model ability to generate high-quality, domain-specific text. The project methodology is systematically structured into seven Work Packages (WPs), covering literature review, fundamental research on VED model innovation, advanced training techniques, and specific applications. These applications include line-based and full-page handwritten text recognition, medical report generation from X-ray images, and lip reading. Each WP is designed to thoroughly explore the potential and limitations of the new VED model approach. The main advantage of this approach is its balance between computational efficiency and the capability to accurately process complex visual-textual data, which is crucial for practical and powerful AI solutions in real-world scenarios.