Spatial Text-to-Speech

Spatial TTS (Text-to-Speech) aims to produce high-uality speech enriched with spatial cues, there by enhancing immersion and realism in AR/VR.

headphone

Please wear headphones to listen.

Demo

Text: 我情愿像甫乐一样死了一了百了而且革离信任你

Phoneme Sequence: uo, q, ing, van, x, iang, f, u, l, e, i, iang, s, i, l, e, <SP>, i, l, iao, b, ai, l, iao, <SP>, er, q, ie, g, e, l, i, x, in, r, en, <SP>, n, i, <SP> (<SP> represents silence segments, and <AP> breaths sound)

Spatial Prompt: [DYNAMIC] Moves from front-right to further front-right.

GT

Mono + SP

CosyVoice + SP

F5-TTS + SP

ISDrama(speech)

Text: 乔伊别说了爸爸他有点不舒服

Phoneme Sequence: q, iao, i, <SP>, b, ie, sh, uo, l, e, <SP>, b, a, b, a, t, a, iou, d, ian, b, u, sh, u, f, u, <SP> (<SP> represents silence segments, and <AP> breaths sound)

Spatial Prompt: [Static] Source locates at right up quadrant and pauses in right up quadrant.

GT

Mono + SP

CosyVoice + SP

F5-TTS + SP

ISDrama(speech)

Text: 不必介意我用二楼卧室的电话打一下说罢走上楼

Phoneme Sequence: b, u, b, i, j, ie, i, <SP>, uo, iong, er, l, ou, uo, sh, i, d, e, d, ian, h, ua, d, a, i, <SP>, x, ia, <SP>, sh, uo, b, a, z, ou, sh, ang, <SP>, l, ou, <SP> (<SP> represents silence segments, and <AP> breaths sound)

Spatial Prompt: [Static] Source locates at left-front up quadrant, and pauses in left-front up quadrant.

GT

Mono + SP

CosyVoice + SP

F5-TTS + SP

ISDrama(speech)

Text: 阿米这两天来我已经跑过三十五个地方了那些家伙大部分连门都不让我进

Phoneme Sequence: a, m, i, <SP>, zh, e, l, iang, t, ian, l, ai, <SP>, uo, i, j, ing, p, ao, g, uo, s, an, sh, i, u, g, e, d, i, f, ang, l, e, <SP>, n, ei, <SP>, x, ie, j, ia, h, uo, d, a, b, u, f, en, l, ian, m, en, <SP>, d, ou, b, u, r, ang, uo, j, in, <SP> (<SP> represents silence segments, and <AP> breaths sound)

Spatial Prompt: [DYNAMIC] Moves from behind-right to front-right, passing through center right, then pauses.

GT

Mono + SP

CosyVoice + SP

F5-TTS + SP

ISDrama(speech)