Listen as you wish: Fusion of audio and text for cross-modal event detection in smart cities | Publicación