OV-InstructTTS: Towards Open-Vocabulary Instruct Text-to-Speech

Abstract

Instruct Text-to-Speech (InstructTTS) leverages natural language descriptions as style prompts to guide speech synthesis. However, existing InstructTTS methods mainly rely on a direct combination of audio-related labels or their diverse rephrasings, making it difficult to handle flexible, high-level instructions. Such rigid control is insufficient for users such as content creators who wish to steer generation with descriptive instructions. To address these constraints, we introduce OV-InstructTTS, a new paradigm for open-vocabulary InstructTTS. We propose a comprehensive solution comprising a newly curated dataset, OV-Speech, and a novel reasoning-driven framework. The OV-Speech dataset pairs speech with open-vocabulary instructions, each augmented with a reasoning process that connects high-level instructions to acoustic features. The reasoning-driven framework infers emotional, acoustic, and paralinguistic information from open-vocabulary instructions before synthesizing speech. Evaluations show that this reasoning-driven approach significantly improves instruction-following fidelity and speech expressiveness. We believe this work can inspire the next user-friendly InstructTTS systems with stronger generalization and real-world applicability.

Dataset and Framework

To enable open-vocabulary InstructTTS, we constructed the OV-Speech dataset. Built upon the 476.8-hour ContextSpeech corpus, our dataset construction follows a five-stage pipeline. The stages are: 1. Contextual Information Extraction from novels; 2. Open-Vocabulary Instruction Generation using an LLM to mimic a director's instructions; 3. Consistency Filtering to ensure alignment between instructions and audio; 4. Reasoning Process Annotation to bridge high-level instructions with low-level acoustic features; and 5. Paralinguistic-Aware Transcription Annotation to enrich the transcription with paralinguistic tags.

Our proposed framework, OV-InstructTTS-TEP, first generates a textual reasoning process from the open-vocabulary instruction, then produces an interleaved sequence of enriched text and audio tokens for synthesis. This reasoning-driven approach is key to interpreting complex instructions and generating expressive speech.

Framework Architecture of OV-InstructTTS-TEP

Demo

Ground Truth CosyVoice2 GPT4o Higgs-Audio-V2 Step-Audio-2-mini OV-InstructTTS-TEP (Ours)
Sample 1

Instruction: "在法则震颤的神庙前,一个高傲冷漠的强大存在,用俯视众生的姿态宣告:"

Transcription: "不好意思,本大王从来不记无关人等,还请做一下自我介绍。"

Sample 2

Instruction: "在硝烟弥漫的冰面战场,面对铁骑溃败与伤亡过半的现实,刚愎自用的将领咬牙怒吼着命令决死冲锋:"

Transcription: "萧尚书,如今我等切不可瞻前顾后。唯一的生机就是决死一搏,将这些逆贼踏平!"

Sample 3

Instruction: "在悬浮光球与倒挂和尚的压迫空间中,面对周楚豪的质问和乾武国代表的挑衅,一个自负炼丹师说:"

Transcription: "周楚豪,你不要欺人太甚!"

Sample 4

Instruction: "在血腥与佛光交织的神域战场废墟中,一个秉持佛门正统、态度强硬的僧侣代表说:"

Transcription: "苏大王,你所谓的尸体在哪里?"

Sample 5

Instruction: "昏暗烛光下冷酷权谋家封野站在染血衣袖前,用爱欲包裹利刃的语气说:"

Transcription: "我是你的夫君。 当我二人为敌时,你该帮谁?"

Sample 6

Instruction: "在昏暗的卧室里,药草气息弥漫,压抑的沉默中,一位刚直清正、深谋远虑的父亲说:"

Transcription: "那小殿下也非池中之物,希望他不是记仇的人吧。"

Sample 7

Instruction: "在弥漫着灵力波动的大王山武馆门前,面对'地级下品'标识时,一个自负跋扈的天级武馆长老不屑地冷笑:"

Transcription: "我是来踢馆的!你是不是对踢馆有什么误解?"

Sample 8

Instruction: "在弥漫着丹香与白色丹云的炼丹房内,一个对炼丹术极端敬仰的狂信徒激动地喊道:"

Transcription: "大王!如果我没有猜错,您一定就是万中无一的炼丹天才!"

Sample 9

Instruction: "在烛火摇曳的中军帐内,帷帐低垂营造压迫感,一个习惯用理性克制情感的谋士,带着被现实所伤的隐忍说:"

Transcription: "你我之间,谈不上恨。"

Sample 10

Instruction: "在昏暗压抑的冥王领地内,面对挑衅的苏宇和哭泣的莫妮卡,一个极度自负且言语带嘲讽的人试图掩饰动摇,用威胁语气说:"

Transcription: "难怪敢这么猖狂,原来有所依仗!不过单凭肉身能翻起什么浪来?"