We introduce EDGE, a powerful method for editable dance generation that is capable of creating realistic, physically-plausible dances while remaining faithful to arbitrary input music. EDGE uses a transformer-based diffusion model paired with Jukebox, a strong music feature extractor, and confers powerful editing capabilities well-suited to dance, including joint-wise conditioning, motion in-betweening, and dance continuation. We compare EDGE to recent methods Bailando and FACT, and find that human raters strongly prefer dances generated by EDGE.
EDGE uses music embeddings from the powerful Jukebox model to gain a broad understanding of music and create high-quality dances even for in-the-wild music samples. (Unmute for music)
EDGE uses a frozen Jukebox model to encode input music into embeddings. A conditional diffusion model learns to map the music embedding into a series of 5-second dance clips. At inference time, temporal constraints are applied to batches of multiple clips to enforce temporal consistency before stitching them into an arbitrary-length full video.
EDGE avoids unintentional foot sliding and is trained with physical realism in mind.
Dance is full of complex, intentional, sliding foot-ground contact. EDGE learns when feet should and shouldn't slide using our new Contact Consistency Loss, which significantly improves physical realism while keeping sliding intact.
Human Raters strongly prefer dances generated by EDGE over those of previous work.