Semantic Entity Alignment and Non-Corresponding Reasoning for Text-to-Image Person Re-identification
Abstract: With the rapid development of intelligent surveillance technology, the massive amount of multimodal data (e.g., videos, images, and text) has imposed higher demands on efficient information ...
Abstract: Text-based Visual Question Answering (TextVQA) focuses on answering questions about the scene text in images. Most works in this field uses transformer based models to modeling the ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results