Abstract: Extending large image-text pre-trained models (e.g., CLIP) for video understanding has made significant advancements. To enable the capability of CLIP to perceive dynamic information in ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results