Accelerating Diffusion Transformer-Based Text-to-Speech with Transformer Layer Caching
(F5-TTS + SmoothCache)

Demo  Paper

This page is for project demonstration purposes only.

Caching Strategies

Prompt:
Text to Generate: I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring.

Approach Cache Schedule
(🟧 = Compute Attn Layer, 🟩 = Compute FFN Layer, ⬜ = Cached Layer)
Generated Audio Inference Time (s)
No Cache No Cache 6.71
Original Cache Schedule Original Cache Schedule 5.09
Cache Attn Only Cache Attn Only 5.64
Unified Schedule (Attn Base) Same Schedule (Attn Base) 5.15
Cache FFN Only Cache FFN Only 6.24
Unified Schedule (FFN Base) Same Schedule (FFN Base) 4.64

Caching Thresholds

Prompt and text from the demo page of Seed-TTS.

Prompt Text 32 NFE (No Cache) 32 NFE (α = 0.15) 32 NFE (α = 0.25)
I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring.
Inference Time = 6.71 s

Inference Time = 5.22 s

Inference Time = 4.14 s
Perhaps they are driven by the delicious blend of flavors, or it could be the appealing visual presentation. At the end of the day, our choices in food reflect our personal preferences and sometimes, even our lifestyle or belief system.
Inference Time = 9.85 s

Inference Time = 8.73 s

Inference Time = 7.58 s
Your safety and the pack's reputation are at stake. Your bravery is admirable, but sometimes bravery is knowing when to retreat. Please, consider returning with me. We can work out a plan, but only if you're willing to listen.
Inference Time = 10.41 s

Inference Time = 9.26 s

Inference Time = 7.62 s
Prompt Text 16 NFE (No Cache) 16 NFE (α = 0.30) 16 NFE (α = 0.50)
I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring.
Inference Time = 3.41 s

Inference Time = 2.65 s

Inference Time = 1.92 s
Perhaps they are driven by the delicious blend of flavors, or it could be the appealing visual presentation. At the end of the day, our choices in food reflect our personal preferences and sometimes, even our lifestyle or belief system.
Inference Time = 5.22 s

Inference Time = 4.54 s

Inference Time = 3.64 s
Your safety and the pack's reputation are at stake. Your bravery is admirable, but sometimes bravery is knowing when to retreat. Please, consider returning with me. We can work out a plan, but only if you're willing to listen.
Inference Time = 5.15 s

Inference Time = 4.51 s

Inference Time = 3.64 s

Ablation of Caching Steps

Prompt and text chosen for the user study.

Prompt Text 32 NFE (α = 0.15) 24 NFE (No Cache)
The album received many good reviews and entered the charts at high positions.
Inference Time = 1.98 s

Inference Time = 1.88 s
The area was swirling in dust so intense that it hid the moon from view.
Inference Time = 1.62 s

Inference Time = 1.57 s
Real schools, secondary schools giving a general practical education.
Inference Time = 1.80 s

Inference Time = 1.62 s
Prompt Text 16 NFE (α = 0.30) 12 NFE (No Cache)
He enters the hotel room but finds that everyone already escaped.
Inference Time = 1.05 s

Inference Time = 0.98 s
He did find it, soon after dawn, and not far from the sand pits.
Inference Time = 0.92 s

Inference Time = 0.84 s
The beam was bent down, perpendicular to the magnetic field.
Inference Time = 1.08 s

Inference Time = 1.04 s