spring AI (八) 語音轉錄與TTS

搞完圖片的部分就該搞語音了，語音這裡面有 2 個方法：

Transcription API 用來轉錄文本，就是把語音生成字幕，使用的是 whisper 模型
Text-To-Speech (TTS) API 簡稱 TTS，就是使用文本生成語音

Transcription#

直接放代碼把，都是一樣的操作

private final OpenAiAudioTranscriptionModel openAiAudioTranscriptionModel;

/**
     * 語音轉錄
     * @param file 語音文件
     * @return String
     */
    @PostMapping(value = "/transcriptions")
    public String transcriptions(@RequestPart("file") MultipartFile file) {

        var transcriptionOptions = OpenAiAudioTranscriptionOptions.builder()
                .withResponseFormat(OpenAiAudioApi.TranscriptResponseFormat.TEXT)
                .withTemperature(0f)
                .build();

        AudioTranscriptionPrompt transcriptionRequest = new AudioTranscriptionPrompt(file.getResource(), transcriptionOptions);
        AudioTranscriptionResponse response = openAiAudioTranscriptionModel.call(transcriptionRequest);
        return response.getResult().getOutput();
    }

裡面最重要的參數就是 ResponseFormat 生成的格式，可選 txt , json, srt 等。srt 就是通用的字幕文件格式了。其餘的參數配置可看官方文檔

唯一需要注意的是，需要的文件格式是 Resource

如果使用的是中轉 API，測試前請查看是否支持此模型

TTS#

TTS 有 2 種返回，一種普通的，還有一種是流式返回。普通的就不說了，只講流式的。代碼如下:

private final OpenAiAudioSpeechModel openAiAudioSpeechModel;

/**
     * TTS實時流
     * @param message 文本
     * @return SseEmitter
     */
    @GetMapping(value = "/tts", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    public SseEmitter openImage(@RequestParam String message) {
        OpenAiAudioSpeechOptions speechOptions = OpenAiAudioSpeechOptions.builder()
                .withVoice(OpenAiAudioApi.SpeechRequest.Voice.ALLOY)
                .withSpeed(1.0f)
                .withResponseFormat(OpenAiAudioApi.SpeechRequest.AudioResponseFormat.MP3)
                .withModel(OpenAiAudioApi.TtsModel.TTS_1_HD.value)
                .build();

        SpeechPrompt speechPrompt = new SpeechPrompt(message, speechOptions);

        String uuid = UUID.randomUUID().toString();
        SseEmitter emitter = SseEmitterUtils.connect(uuid);

        Flux<SpeechResponse> responseStream = openAiAudioSpeechModel.stream(speechPrompt);
        responseStream.subscribe(response -> {
            byte[] output = response.getResult().getOutput();
            String base64Audio = Base64.getEncoder().encodeToString(output);
            SseEmitterUtils.sendMessage(uuid, base64Audio);
        });
        return emitter;
    }

各個參數的解釋如下：

參數	解釋
Voice	講述人語音
Speed	語音合成的速度。可接受的範圍是從 0.0（最慢）到 1.0（最快）
ResponseFormat	音頻輸出的格式，支持的格式有 mp3、opus、aac、flac、wav 和 pcm。
Model	模型，有 TTS_1 和 TTS_1_HD，HD 生成的效果更好

需要注意的是，音頻輸出的格式目前只有前 4 個，後 2 個是沒有的

這個影響還是挺大的，因為 PCM 是可以在瀏覽器直接解碼的，適合流，MP3 還需要轉碼。

可以看到，TTS 的生成結果是一個 byte[] 陣列，返回時轉成了 Base64，最後通過 SSE 發送到前端。

前端還需要解碼，我直接讓 Claude 寫了個測試頁面：

<!DOCTYPE html>
<html>
<head>
    <meta charset="UTF-8">
    <title>實時流式 MP3 TTS 播放器</title>
</head>
<body>
<h1>實時流式 MP3 TTS 播放器</h1>
<input type="text" id="textInput" placeholder="輸入要轉換的文本">
<button onclick="startStreaming()">開始播放</button>
<audio id="audioPlayer" controls></audio>

<script>
    let mediaSource;
    let sourceBuffer;
    let audioQueue = [];
    let isPlaying = false;

    function startStreaming() {
        const text = document.getElementById('textInput').value;
        const encodedText = encodeURIComponent(text);
        const eventSource = new EventSource(`http://127.0.0.1:8868/audio/tts?message=${encodedText}`);

        const audio = document.getElementById('audioPlayer');
        mediaSource = new MediaSource();
        audio.src = URL.createObjectURL(mediaSource);

        mediaSource.addEventListener('sourceopen', function() {
            sourceBuffer = mediaSource.addSourceBuffer('audio/mpeg');
            sourceBuffer.addEventListener('updateend', playNextChunk);
        });

        audio.play();

        eventSource.onopen = function(event) {
            console.log('Connection opened');
        };

        eventSource.onmessage = function(event) {
            const audioChunk = base64ToArrayBuffer(event.data);
            audioQueue.push(audioChunk);
            if (!isPlaying) {
                playNextChunk();
            }
        };

        eventSource.onerror = function(error) {
            console.error('Error:', error);
            if (eventSource.readyState === EventSource.CLOSED) {
                console.log('Connection closed');
            }
            eventSource.close();
        };
    }

    function base64ToArrayBuffer(base64) {
        const binaryString = window.atob(base64);
        const len = binaryString.length;
        const bytes = new Uint8Array(len);
        for (let i = 0; i < len; i++) {
            bytes[i] = binaryString.charCodeAt(i);
        }
        return bytes.buffer;
    }

    function playNextChunk() {
        if (audioQueue.length > 0 && !sourceBuffer.updating) {
            isPlaying = true;
            const chunk = audioQueue.shift();
            sourceBuffer.appendBuffer(chunk);
        } else {
            isPlaying = false;
        }
    }
</script>
</body>
</html>