Resumen
This project focuses on innovating Vision Encoder-Decoder (VED) models in artificial intelligence, specifically targeting their
size and efficiency challenges. VED models, which bridge visual perception and language processing, are being transformed
by replacing their transformer-based decoders with MLP-Mixer layers and employing Connectionist Temporal Classification
(CTC) loss for training. This approach aims to reduce the model complexity and computational demands, making them more
efficient for tasks like image captioning and text recognition. Additionally, the integration of reinforcement learning as a
training method, using classifiers as reward functions, represents a significant shift from traditional training methods. This
novel strategy is expected to enhance the model ability to generate high-quality, domain-specific text.
The project methodology is systematically structured into seven Work Packages (WPs), covering literature review,
fundamental research on VED model innovation, advanced training techniques, and specific applications. These applications
include line-based and full-page handwritten text recognition, medical report generation from X-ray images, and lip reading.
Each WP is designed to thoroughly explore the potential and limitations of the new VED model approach. The main
advantage of this approach is its balance between computational efficiency and the capability to accurately process complex
visual-textual data, which is crucial for practical and powerful AI solutions in real-world scenarios.