EDGE: Editable Dance Generation from Music

CVPR 2023

Jonathan Tseng
Rodrigo Castellon
Stanford University
C. Karen Liu
Paper Code Demo

We introduce EDGE, a powerful method for editable dance generation that is capable of creating realistic, physically-plausible dances while remaining faithful to arbitrary input music. EDGE uses a transformer-based diffusion model paired with Jukebox, a strong music feature extractor, and confers powerful editing capabilities well-suited to dance, including joint-wise conditioning, motion in-betweening, and dance continuation. We compare EDGE to recent methods Bailando and FACT, and find that human raters strongly prefer dances generated by EDGE.

Dances generated by EDGE
Pictured: 100 noncurated dance samples from EDGE conditioned on unseen music.

EDGE generates choreographies from music

EDGE uses music embeddings from the powerful Jukebox model to gain a broad understanding of music and create high-quality dances even for in-the-wild music samples. (Unmute for music)

The EDGE Model, Explained

Responsive image
Pictured: Although EDGE is trained on 5-second dance clips, it can generate dances of any length by imposing temporal constraints on batches of sequences. In the pictured example, EDGE constrains the first half of each sequence to match the second half of the previous one.

EDGE uses a frozen Jukebox model to encode input music into embeddings. A conditional diffusion model learns to map the music embedding into a series of 5-second dance clips. At inference time, temporal constraints are applied to batches of multiple clips to enforce temporal consistency before stitching them into an arbitrary-length full video.

Joint-Wise Constraint: Generate lower body from upper body

Editable Synthesis

EDGE supports arbitrary spatial and temporal constraints. This can be used to support many end-user applications, including:

  1. Arbitrarily long dances, by enforcing temporal continuity between batches of multiple sequences.
  2. Dances subject to joint-wise constraints, i.e. lower body generation given upper body motion, or vice versa.
  3. Motion In-Betweening: Dances that start and end with prespecified motions.
  4. Dance Continuation: Dances that start with a prespecified motion.
  5. And many more!

Physical Plausibility

EDGE avoids unintentional foot sliding and is trained with physical realism in mind.

Dance is full of complex, intentional, sliding foot-ground contact. EDGE learns when feet should and shouldn't slide using our new Contact Consistency Loss, which significantly improves physical realism while keeping sliding intact.

Responsive image

Human Raters strongly prefer dances generated by EDGE over those of previous work.

This website draws heavy design inspiration from the excellent Imagen site.