OV-InstructTTS: Towards Open-Vocabulary Instruct Text-to-Speech

🤗 Dataset 🤗 Checkpoints 💻 Code

Abstract

Instruct Text-to-Speech (InstructTTS) leverages natural language descriptions as style prompts to guide speech synthesis. However, existing InstructTTS methods mainly rely on a direct combination of audio-related labels or their diverse rephrasings, making it difficult to handle flexible, high-level instructions. Such rigid control is insufficient for users such as content creators who wish to steer generation with descriptive instructions. To address these constraints, we introduce OV-InstructTTS, a new paradigm for open-vocabulary InstructTTS. We propose a comprehensive solution comprising a newly curated dataset, OV-Speech, and a novel reasoning-driven framework. The OV-Speech dataset pairs speech with open-vocabulary instructions, each augmented with a reasoning process that connects high-level instructions to acoustic features. The reasoning-driven framework infers emotional, acoustic, and paralinguistic information from open-vocabulary instructions before synthesizing speech. Evaluations show that this reasoning-driven approach significantly improves instruction-following fidelity and speech expressiveness. We believe this work can inspire the next user-friendly InstructTTS systems with stronger generalization and real-world applicability.

Dataset and Framework

To enable open-vocabulary InstructTTS, we constructed the OV-Speech dataset. Built upon the 476.8-hour ContextSpeech corpus, our dataset construction follows a five-stage pipeline. The stages are: 1. Contextual Information Extraction from novels; 2. Open-Vocabulary Instruction Generation using an LLM to mimic a director's instructions; 3. Consistency Filtering to ensure alignment between instructions and audio; 4. Reasoning Process Annotation to bridge high-level instructions with low-level acoustic features; and 5. Paralinguistic-Aware Transcription Annotation to enrich the transcription with paralinguistic tags.

Our proposed framework, OV-InstructTTS-TEP, first generates a textual reasoning process from the open-vocabulary instruction, then produces an interleaved sequence of enriched text and audio tokens for synthesis. This reasoning-driven approach is key to interpreting complex instructions and generating expressive speech.

Framework Architecture of OV-InstructTTS-TEP

Demo

Ground Truth	CosyVoice2	GPT4o	Higgs-Audio-V2	Step-Audio-2-mini	OV-InstructTTS-TEP (Ours)
Sample 1 Instruction: "在法则震颤的神庙前，一个高傲冷漠的强大存在，用俯视众生的姿态宣告：" Transcription: "不好意思，本大王从来不记无关人等，还请做一下自我介绍。"

Sample 2 Instruction: "在硝烟弥漫的冰面战场，面对铁骑溃败与伤亡过半的现实，刚愎自用的将领咬牙怒吼着命令决死冲锋：" Transcription: "萧尚书，如今我等切不可瞻前顾后。唯一的生机就是决死一搏，将这些逆贼踏平！"

Sample 3 Instruction: "在悬浮光球与倒挂和尚的压迫空间中，面对周楚豪的质问和乾武国代表的挑衅，一个自负炼丹师说：" Transcription: "周楚豪，你不要欺人太甚！"

Sample 4 Instruction: "在血腥与佛光交织的神域战场废墟中，一个秉持佛门正统、态度强硬的僧侣代表说：" Transcription: "苏大王，你所谓的尸体在哪里？"

Sample 5 Instruction: "昏暗烛光下冷酷权谋家封野站在染血衣袖前，用爱欲包裹利刃的语气说：" Transcription: "我是你的夫君。当我二人为敌时，你该帮谁？"

Sample 6 Instruction: "在昏暗的卧室里，药草气息弥漫，压抑的沉默中，一位刚直清正、深谋远虑的父亲说：" Transcription: "那小殿下也非池中之物，希望他不是记仇的人吧。"

Sample 7 Instruction: "在弥漫着灵力波动的大王山武馆门前，面对'地级下品'标识时，一个自负跋扈的天级武馆长老不屑地冷笑：" Transcription: "我是来踢馆的！你是不是对踢馆有什么误解？"

Sample 8 Instruction: "在弥漫着丹香与白色丹云的炼丹房内，一个对炼丹术极端敬仰的狂信徒激动地喊道：" Transcription: "大王！如果我没有猜错，您一定就是万中无一的炼丹天才！"

Sample 9 Instruction: "在烛火摇曳的中军帐内，帷帐低垂营造压迫感，一个习惯用理性克制情感的谋士，带着被现实所伤的隐忍说：" Transcription: "你我之间，谈不上恨。"

Sample 10 Instruction: "在昏暗压抑的冥王领地内，面对挑衅的苏宇和哭泣的莫妮卡，一个极度自负且言语带嘲讽的人试图掩饰动摇，用威胁语气说：" Transcription: "难怪敢这么猖狂，原来有所依仗！不过单凭肉身能翻起什么浪来？"