Abstract: One fundamental task of multimodal models is to translate referred image regions to human preferred language descriptions. Existing methods, however, ignore the resolution adaptability needs ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results