Multimodal grid features and cell pointers for scene text visual question answering | Publicación