A hierarchical interaction multimodal model for feature fusion based on RoBERTa-Keyword-ViT | Publicación