Multi-Modality

PDiT: Interleaving Perception and Decision-making Transformers for Deep Reinforcement Learning

Designing better deep networks and better reinforcement learning (RL) algorithms are both important for deep RL. This work studies the former. Specifically, the Perception and Decision-making Interleaving Transformer (PDiT) network is proposed, which …

Relation-Aware Distribution Representation Network for Person Clustering With Multiple Modalities

Person clustering with multi-modal clues, including faces, bodies, and voices, is critical for various tasks, such as movie parsing and identity-based movie editing. Related methods such as multi-view clustering mainly project multi-modal features …