Instruct Text-to-Speech (InstructTTS) leverages natural language descriptions as style prompts to guide speech synthesis. However, existing InstructTTS methods mainly rely on a direct combination of audio-related labels or their diverse rephrasings, making it difficult to handle flexible, high-level instructions. Such rigid control is insufficient for users such as content creators who wish to steer generation with descriptive instructions. To address these constraints, we introduce OV-InstructTTS, a new paradigm for open-vocabulary InstructTTS. We propose a comprehensive solution comprising a newly curated dataset, OV-Speech, and a novel reasoning-driven framework. The OV-Speech dataset pairs speech with open-vocabulary instructions, each augmented with a reasoning process that connects high-level instructions to acoustic features. The reasoning-driven framework infers emotional, acoustic, and paralinguistic information from open-vocabulary instructions before synthesizing speech. Evaluations show that this reasoning-driven approach significantly improves instruction-following fidelity and speech expressiveness. We believe this work can inspire the next user-friendly InstructTTS systems with stronger generalization and real-world applicability.
To enable open-vocabulary InstructTTS, we constructed the OV-Speech dataset. Built upon the 476.8-hour ContextSpeech corpus, our dataset construction follows a five-stage pipeline. The stages are: 1. Contextual Information Extraction from novels; 2. Open-Vocabulary Instruction Generation using an LLM to mimic a director's instructions; 3. Consistency Filtering to ensure alignment between instructions and audio; 4. Reasoning Process Annotation to bridge high-level instructions with low-level acoustic features; and 5. Paralinguistic-Aware Transcription Annotation to enrich the transcription with paralinguistic tags.
Our proposed framework, OV-InstructTTS-TEP, first generates a textual reasoning process from the open-vocabulary instruction, then produces an interleaved sequence of enriched text and audio tokens for synthesis. This reasoning-driven approach is key to interpreting complex instructions and generating expressive speech.
| Ground Truth | CosyVoice2 | GPT4o | Higgs-Audio-V2 | Step-Audio-2-mini | OV-InstructTTS-TEP (Ours) |
|---|---|---|---|---|---|
|
Sample 1
Instruction: "在法则震颤的神庙前,一个高傲冷漠的强大存在,用俯视众生的姿态宣告:" Transcription: "不好意思,本大王从来不记无关人等,还请做一下自我介绍。" |
|||||
|
Sample 2
Instruction: "在硝烟弥漫的冰面战场,面对铁骑溃败与伤亡过半的现实,刚愎自用的将领咬牙怒吼着命令决死冲锋:" Transcription: "萧尚书,如今我等切不可瞻前顾后。唯一的生机就是决死一搏,将这些逆贼踏平!" |
|||||
|
Sample 3
Instruction: "在悬浮光球与倒挂和尚的压迫空间中,面对周楚豪的质问和乾武国代表的挑衅,一个自负炼丹师说:" Transcription: "周楚豪,你不要欺人太甚!" |
|||||
|
Sample 4
Instruction: "在血腥与佛光交织的神域战场废墟中,一个秉持佛门正统、态度强硬的僧侣代表说:" Transcription: "苏大王,你所谓的尸体在哪里?" |
|||||
|
Sample 5
Instruction: "昏暗烛光下冷酷权谋家封野站在染血衣袖前,用爱欲包裹利刃的语气说:" Transcription: "我是你的夫君。 当我二人为敌时,你该帮谁?" |
|||||
|
Sample 6
Instruction: "在昏暗的卧室里,药草气息弥漫,压抑的沉默中,一位刚直清正、深谋远虑的父亲说:" Transcription: "那小殿下也非池中之物,希望他不是记仇的人吧。" |
|||||
|
Sample 7
Instruction: "在弥漫着灵力波动的大王山武馆门前,面对'地级下品'标识时,一个自负跋扈的天级武馆长老不屑地冷笑:" Transcription: "我是来踢馆的!你是不是对踢馆有什么误解?" |
|||||
|
Sample 8
Instruction: "在弥漫着丹香与白色丹云的炼丹房内,一个对炼丹术极端敬仰的狂信徒激动地喊道:" Transcription: "大王!如果我没有猜错,您一定就是万中无一的炼丹天才!" |
|||||
|
Sample 9
Instruction: "在烛火摇曳的中军帐内,帷帐低垂营造压迫感,一个习惯用理性克制情感的谋士,带着被现实所伤的隐忍说:" Transcription: "你我之间,谈不上恨。" |
|||||
|
Sample 10
Instruction: "在昏暗压抑的冥王领地内,面对挑衅的苏宇和哭泣的莫妮卡,一个极度自负且言语带嘲讽的人试图掩饰动摇,用威胁语气说:" Transcription: "难怪敢这么猖狂,原来有所依仗!不过单凭肉身能翻起什么浪来?" |
|||||