这篇文章主要介绍语音转文本的相关服务。

快速语音转文本

如果想将语音快速转为文本,或者针对多个发言人进行区分,可以使用语音快速转录

这个是一个 API 接口,上传音频文件就可以直接将语音转录为文字。

以微软官方提供的示例音频 aboutSpeechSdk.wav 为例,52s 的音频。

在指定语言(无需自动识别语言)的情况下,转录时间在 3s - 4s;在不指定语言(需要自动识别语言)的情况下,转录时间在 4s - 5s

这个接口支持:

  1. 指定语言区域进行识别,如 en-USzh-CNes-ES 等等。如果多语言混合,比如中英文混说,可以同时传入多个语言区域。
  2. 自动识别语言,即传参的时候不指定语言区域,这时候就会触发自动识别。不过自动识别处理速度要稍微慢一丢丢。
  3. 可以分离多个发言人,这种适用于会议或者多人交谈等形式。
  4. 如果是双声道的音频,可以指定只转录一个声道中的音频。默认是双声道都进行转录。

如果需要转录的音频比较多,且对转录结果的实时性没有太高的要求,可以尝试批量转录

只需要将音频文件放在 OSS 上,然后把访问链接传给 API,服务会自动处理。

发音评估 / 语音转文本

60s 短语音转文本 / 发音评估

如果针对一个短音频除了要转文本之外,还需要做简短的发音评估,可以使用 rest-speech-to-text-short 接口。

相比 SDK 底层调用发音评估时会使用 WebSocket,这个是直接使用 HTTP 请求。

接口支持分片传输,以微软官方提供的示例音频 aboutSpeechSdk.wav 为例,52s 的音频,不使用分片传输大约 10s 返回结果,使用分片传输大约 8s 返回结果。

下面是 go 的请求示例代码,以微软官方提供的 what's the weather like 音频为例。

第一个是不使用分片传输的代码:

package main

import (
	"bytes"
	"encoding/base64"
	"encoding/json"
	"fmt"
	"io/ioutil"
	"log"
	"net/http"
	"net/url"
	"os"
)

type PronunciationAssessmentParams struct {
	ReferenceText string `json:"ReferenceText"`           // 必选,音频对应的参考文本。如果不希望指定文本,可以传空。
	GradingSystem string `json:"GradingSystem,omitempty"` // 可选,评分分值范围,FivePoint 为 0-5 的浮点分数,HundredMark 为 0-100 的浮点分数。默认值:FivePoint。
	Granularity   string `json:"Granularity,omitempty"`   // 可选,评分颗粒度。Phoneme 最细,显示全文、单词和音素级别的分数;Word 适中,显示全文和单词级别的分数。FullText 最粗,仅显示全文的分数。默认值: Phoneme。
	Dimension     string `json:"Dimension,omitempty"`     // 可选,分数显示范围。Basic,仅显示准确度分数;Comprehensive,显示全文级别的流利度分数和完整性分数,以及单词级别的错误类型。默认值: Basic。
	EnableMiscue  string `json:"EnableMiscue,omitempty"`  // 可选,错误识别。将发音的单词将与参考文本进行比较,对比出出错的地方。True 为开启错误识别,默认值 False 不开启。
	ScenarioId    string `json:"ScenarioId,omitempty"`    // 可选,使用自定义评分系统的 GUID。
}

func getPronunciationAssessmentParams(params *PronunciationAssessmentParams) string {
	pronAssessmentParamsBytes, err := json.Marshal(params)
	if err != nil {
		log.Panicf("Error marshaling JSON: %s", err.Error())
	}
	return base64.StdEncoding.EncodeToString(pronAssessmentParamsBytes)
}

func main() {
	// 设置订阅密钥和服务区域
	subscriptionKey := "订阅密钥"
	region := "服务区域"

	// 读取音频文件
	audioData, err := os.ReadFile("wav 格式的音频文件")
	if err != nil {
		log.Panicf("read wav file error: %s", err.Error())
	}

	// 构建请求 URL
	u := url.URL{}
	u.Scheme = "https"
	u.Host = fmt.Sprintf("%s.stt.speech.microsoft.com", region)
	u.Path = "/speech/recognition/conversation/cognitiveservices/v1"

	// 拼接query参数
	queryValues := url.Values{}
	queryValues.Add("language", "en-US")   // 必选,指定识别语言。
	queryValues.Add("format", "detailed")  // 可选,指定结果格式。simple 为简易结果,只包含 RecognitionStatus、DisplayText、Offset和Duration 。detailed 为复杂结果,除简易结果外,还包含显示文本的四种不同表示形式。
	queryValues.Add("profanity", "masked") // 可选,对不友善语言的处理方式。masked 为用 * 打码。removed 为删除不友善语言。raw 为不做任何处理。 默认为 masked 打码。
	u.RawQuery = queryValues.Encode()

	// 创建 HTTP 请求
	req, err := http.NewRequest("POST", u.String(), bytes.NewReader(audioData))
	if err != nil {
		log.Panicf("Error creating request: %s", err.Error())
	}

	// 设置请求头
	//    语音服务的资源密钥。如果担心密钥泄漏,可以使用 Authorization 传入临时鉴权 token。临时鉴权 token 获取方法:https://learn.microsoft.com/en-us/azure/ai-services/speech-service/rest-speech-to-text-short#how-to-get-an-access-token
	req.Header.Set("Ocp-Apim-Subscription-Key", subscriptionKey)
	//    可选,这个参数来开启发音评估的分数,具体参数见:https://learn.microsoft.com/en-us/azure/ai-services/speech-service/rest-speech-to-text-short#pronunciation-assessment-parameters
	req.Header.Set("Pronunciation-Assessment", getPronunciationAssessmentParams(&PronunciationAssessmentParams{
		ReferenceText: "", // 不指定文本
		GradingSystem: "HundredMark",
		Granularity:   "Phoneme",
		Dimension:     "Comprehensive",
		EnableMiscue:  "True",
	}))
	//    必传,指定上传音频文件的格式和编解码器
	req.Header.Set("Content-Type", "audio/wav; codecs=audio/pcm; samplerate=16000")
	//    必传,指定 json 格式。
	req.Header.Set("Accept", "application/json")

	// 发送请求
	client := &http.Client{}
	resp, err := client.Do(req)
	if err != nil {
		log.Panicf("Error sending request: %s", err.Error())
	}
	defer resp.Body.Close()

	// 检查响应状态码
	if resp.StatusCode != http.StatusOK {
		body, _ := ioutil.ReadAll(resp.Body)
		log.Panicf("Response status %s, body %s", resp.Status, string(body))
	}

    // 读取响应内容
	responseData, err := ioutil.ReadAll(resp.Body)
	if err != nil {
		log.Panicf("Error reading response: %s", err.Error())
		return
	}

	// 输出响应内容
	log.Println(string(responseData))
}

第二个是分片传输的代码:

package main

import (
	"bufio"
	"encoding/base64"
	"encoding/json"
	"fmt"
	"io"
	"log"
	"net/http"
	"net/url"
	"os"
	"time"
)

type PronunciationAssessmentParams struct {
	ReferenceText string `json:"ReferenceText"`           // 必选,音频对应的参考文本。
	GradingSystem string `json:"GradingSystem,omitempty"` // 可选,评分分值范围,FivePoint 为 0-5 的浮点分数,HundredMark 为 0-100 的浮点分数。默认值:FivePoint。
	Granularity   string `json:"Granularity,omitempty"`   // 可选,评分颗粒度。Phoneme 最细,显示全文、单词和音素级别的分数;Word 适中,显示全文和单词级别的分数。FullText 最粗,仅显示全文的分数。默认值: Phoneme。
	Dimension     string `json:"Dimension,omitempty"`     // 可选,分数显示范围。Basic,仅显示准确度分数;Comprehensive,显示全文级别的流利度分数和完整性分数,以及单词级别的错误类型。默认值: Basic。
	EnableMiscue  string `json:"EnableMiscue,omitempty"`  // 可选,错误识别。将发音的单词将与参考文本进行比较,对比出出错的地方。True 为开启错误识别,默认值 False 不开启。
	ScenarioId    string `json:"ScenarioId,omitempty"`    // 可选,使用自定义评分系统的 GUID。
}

func getPronunciationAssessmentParams(params *PronunciationAssessmentParams) string {
	pronAssessmentParamsBytes, err := json.Marshal(params)
	if err != nil {
		log.Panicf("Error marshaling JSON: %s", err.Error())
	}
	return base64.StdEncoding.EncodeToString(pronAssessmentParamsBytes)
}

func main() {
	// 设置订阅密钥和服务区域
    subscriptionKey := "订阅密钥"
	region := "服务区域"

	// 打开音频文件
	audioFile, err := os.Open("wav 格式的音频文件")
	if err != nil {
		log.Panicf("无法打开音频文件: %s", err.Error())
	}
	defer audioFile.Close()

	// 构建请求 URL
	u := url.URL{
		Scheme: "https",
		Host:   fmt.Sprintf("%s.stt.speech.microsoft.com", region),
		Path:   "/speech/recognition/conversation/cognitiveservices/v1",
	}

	// 拼接 query 参数
	queryValues := url.Values{}
	queryValues.Add("language", "en-US")   // 必选,指定识别语言。
	queryValues.Add("format", "detailed")  // 可选,指定结果格式。simple 为简易结果,detailed 为复杂结果。
	queryValues.Add("profanity", "masked") // 可选,对不友善语言的处理方式。masked 为用 * 打码。
	u.RawQuery = queryValues.Encode()

	// 创建管道
	pr, pw := io.Pipe()

	// 创建 HTTP 请求
	req, err := http.NewRequest("POST", u.String(), pr)
	if err != nil {
		log.Panicf("无法创建请求: %s", err.Error())
	}

	// 设置请求头
	req.Header.Set("Ocp-Apim-Subscription-Key", subscriptionKey)
	req.Header.Set("Pronunciation-Assessment", getPronunciationAssessmentParams(&PronunciationAssessmentParams{
		ReferenceText: "",
		GradingSystem: "HundredMark",
		Granularity:   "Phoneme",
		Dimension:     "Comprehensive",
		EnableMiscue:  "True",
	}))
	req.Header.Set("Content-Type", "audio/wav; codecs=audio/pcm; samplerate=16000")
	req.Header.Set("Accept", "application/json")
	req.Header.Set("Transfer-Encoding", "chunked")
	req.Header.Set("Expect", "100-continue")

	// 创建 HTTP 客户端
	client := &http.Client{}

	// 使用 goroutine 异步写入音频数据
	go func() {
		defer pw.Close()
		bufferedWriter := bufio.NewWriter(pw)
		_, err := io.Copy(bufferedWriter, audioFile)
		if err != nil {
			log.Printf("写入音频数据失败: %s", err.Error())
			return
		}
		bufferedWriter.Flush()
	}()

	// 发送请求
	resp, err := client.Do(req)
	if err != nil {
		log.Panicf("请求失败: %s", err.Error())
	}
	defer resp.Body.Close()

	// 检查响应状态码
	if resp.StatusCode != http.StatusOK {
		body, _ := io.ReadAll(resp.Body)
		log.Panicf("请求失败,状态码: %d,响应内容: %s", resp.StatusCode, string(body))
	}

	// 读取响应内容
	responseData, err := io.ReadAll(resp.Body)
	if err != nil {
		log.Panicf("读取响应失败: %s", err.Error())
	}

	// 输出响应内容
	log.Println(string(responseData))
}

以下是请求后的返回的 json 信息。

{
  "RecognitionStatus" : "Success", // 识别状态,表示语音识别的结果状态。Success 表示识别成功。
  "Offset" : 500000, // 语音在音频流中开始的时间(以 100 纳秒为单位)。如果转换为毫秒时,需要 / 10000。
  "Duration" : 13700000, // 语音在音频流中持续的时间(以 100 纳秒为单位)。如果转换为毫秒时,需要 / 10000。
  "DisplayText" : "What's the weather like?",
  "NBest" : [ {
    "Confidence" : 0.94718826, // 识别结果的置信度分数,0 为不可靠,1 为十分可靠。如果这个分数低于 0.8 基本上就代表识别结果有问题,这时候就不建议采信识别结果。
    "Lexical" : "what's the weather like", // 按词语顺序显示的识别文本,不带标点符号。
    "ITN" : "what's the weather like", // 非标记化文本(Inverse Text Normalization),可用于进一步处理的文本格式。
    "MaskedITN" : "what's the weather like", // ITN 文本打码后结果
    "Display" : "What's the weather like?", // 适用于直接展示的识别结果,包含标点符号。
    "AccuracyScore" : 100.0, // 发音准确度得分。
    "FluencyScore" : 100.0, // 流利度得分。
    "CompletenessScore" : 100.0, // 完整度得分。
    "PronScore" : 100.0, // 总体发音得分。由 AccuracyScore、FluencyScore 和 CompletenessScore 加权汇总而成。
    "Words" : [ { // 识别结果分词后的单词列表
      "Word" : "what's", // 单词,注意:微软分词出来的单词可能是一个短语。
      "Offset" : 500000, 
      "Duration" : 3900000,
      "Confidence" : 0.0,
      "AccuracyScore" : 100.0, // 发音准确度评分
      "ErrorType" : "None", // 与文本对比发音错误类型,None 为此单词没有错误,Omission 为单词被省略,Insertion 为插入,Mispronunciation 为发音错误。
      "Syllables" : [ { // 音节信息
        "Syllable" : "wahts",
        "Grapheme" : "what's",
        "Offset" : 500000,
        "Duration" : 3900000,
        "AccuracyScore" : 100.0
      } ],
      "Phonemes" : [ { // 音素信息
        "Phoneme" : "w",
        "Offset" : 500000,
        "Duration" : 1700000,
        "AccuracyScore" : 100.0
      }, {
        "Phoneme" : "ah",
        "Offset" : 2300000,
        "Duration" : 700000,
        "AccuracyScore" : 100.0
      }, {
        "Phoneme" : "t",
        "Offset" : 3100000,
        "Duration" : 700000,
        "AccuracyScore" : 100.0
      }, {
        "Phoneme" : "s",
        "Offset" : 3900000,
        "Duration" : 500000,
        "AccuracyScore" : 100.0
      } ]
    }, {
      "Word" : "the",
      "Offset" : 4500000,
      "Duration" : 1300000,
      "Confidence" : 0.0,
      "AccuracyScore" : 100.0,
      "ErrorType" : "None",
      "Syllables" : [ {
        "Syllable" : "dhah",
        "Grapheme" : "the",
        "Offset" : 4500000,
        "Duration" : 1300000,
        "AccuracyScore" : 100.0
      } ],
      "Phonemes" : [ {
        "Phoneme" : "dh",
        "Offset" : 4500000,
        "Duration" : 300000,
        "AccuracyScore" : 100.0
      }, {
        "Phoneme" : "ah",
        "Offset" : 4900000,
        "Duration" : 900000,
        "AccuracyScore" : 100.0
      } ]
    }, {
      "Word" : "weather",
      "Offset" : 5900000,
      "Duration" : 2900000,
      "Confidence" : 0.0,
      "AccuracyScore" : 100.0,
      "ErrorType" : "None",
      "Syllables" : [ {
        "Syllable" : "weh",
        "Grapheme" : "weath",
        "Offset" : 5900000,
        "Duration" : 1100000,
        "AccuracyScore" : 100.0
      }, {
        "Syllable" : "dhaxr",
        "Grapheme" : "er",
        "Offset" : 7100000,
        "Duration" : 1700000,
        "AccuracyScore" : 100.0
      } ],
      "Phonemes" : [ {
        "Phoneme" : "w",
        "Offset" : 5900000,
        "Duration" : 300000,
        "AccuracyScore" : 100.0
      }, {
        "Phoneme" : "eh",
        "Offset" : 6300000,
        "Duration" : 700000,
        "AccuracyScore" : 100.0
      }, {
        "Phoneme" : "dh",
        "Offset" : 7100000,
        "Duration" : 500000,
        "AccuracyScore" : 100.0
      }, {
        "Phoneme" : "ax",
        "Offset" : 7700000,
        "Duration" : 500000,
        "AccuracyScore" : 100.0
      }, {
        "Phoneme" : "r",
        "Offset" : 8300000,
        "Duration" : 500000,
        "AccuracyScore" : 100.0
      } ]
    }, {
      "Word" : "like",
      "Offset" : 8900000,
      "Duration" : 5300000,
      "Confidence" : 0.0,
      "AccuracyScore" : 100.0,
      "ErrorType" : "None",
      "Syllables" : [ {
        "Syllable" : "layk",
        "Grapheme" : "like",
        "Offset" : 8900000,
        "Duration" : 5300000,
        "AccuracyScore" : 100.0
      } ],
      "Phonemes" : [ {
        "Phoneme" : "l",
        "Offset" : 8900000,
        "Duration" : 900000,
        "AccuracyScore" : 100.0
      }, {
        "Phoneme" : "ay",
        "Offset" : 9900000,
        "Duration" : 1100000,
        "AccuracyScore" : 100.0
      }, {
        "Phoneme" : "k",
        "Offset" : 11100000,
        "Duration" : 3100000,
        "AccuracyScore" : 100.0
      } ]
    } ]
  } ]
}

这个接口基本上覆盖了发音评估 SDK 中 80% 的数据。

相比使用 SDK 调用的发音评估,这个接口少了一些韵律相关的主观评估,比如:

  1. ProsodyScore:整个话语的韵律分数。
  2. TopicScore:对主题的理解和参与程度,可以了解演讲者有效表达思想和观点的能力以及参与主题的能力。
  3. 词汇储备及表达能力,比如 VocabularyScore(评估说话者在特定语境中有效使用词汇的能力、词汇的恰当性,以及词汇的复杂程度)、GrammarScore(词汇的准确性、语法的准确性和句子结构的多样性共同增加了语法错误的发生率)。
  4. 单词连读停顿的判断,比如是 UnexpectedBreak(单词前本应连读,但实际有意外停顿)、MissingBreak(单词前本应停顿却没有停顿,常见于意群之间的停顿)。
  5. 单词的重(zhong)读问题,也就是重音相关的判断。

SDK 语音转文本 / 发音评估

而评估一段英文口语说的好不好,不仅仅是发音的准确性、流利度和完整性,还涉及到韵律相关的主观评估。比如给定一个主题,口语表达与主题的契合度、词汇的储备量、语法表达能力、整个表达是否更自然(连读停顿以及重音读)

Go 的语音 SDK 不提供发音评估,考虑到节约开发和维护的心力成本,所以比较推荐使用 PythonNode.jsJava 去做。

下面是用 Node.js + TypeScript 写的示例代码:

import fs from "fs";
import * as sdk from "microsoft-cognitiveservices-speech-sdk";
import {
    Recognizer,
    SpeechRecognitionEventArgs
} from "microsoft-cognitiveservices-speech-sdk/distrib/lib/src/sdk/Exports";

async function start(): Promise<void> {
    // 设置订阅密钥和服务区域
    const subscriptionKey = "订阅密钥"
    const region = "服务区域"

    // 读取文件并创建 audioConfig。注意只能使用 wav 格式的文件。
    const audioConfig = sdk.AudioConfig.fromWavFileInput(fs.readFileSync("wav 音频文件"));

    // 创建 speechConfig
    const speechConfig = sdk.SpeechConfig.fromSubscription(subscriptionKey, region);
    //    指定语音的语言区域,如不指定则默认为 en-US
    speechConfig.speechRecognitionLanguage = "en-US";

    // 创建发音评估配置,这几个参数在上面的 go 代码中也有用到
    const pronunciationAssessmentConfig = new sdk.PronunciationAssessmentConfig(
        "", // 参数名为:ReferenceText。音频对应的参考文本,如果为空则可获取韵律相关分数,如果不为空则无韵律相关分数。
        sdk.PronunciationAssessmentGradingSystem.HundredMark, // 参数名为:GradingSystem。评分分值范围,FivePoint 为 0-5 的浮点分数,HundredMark 为 0-100 的浮点分数。
        sdk.PronunciationAssessmentGranularity.Phoneme, // 参数名为:Granularity。评分颗粒度。Phoneme 最细,显示全文、单词和音素级别的分数;Word 适中,显示全文和单词级别的分数。FullText 最粗,仅显示全文的分数。
        true // 参数名为:EnableMiscue。错误识别。将发音的单词将与参考文本进行比较,对比出出错的地方。true 为开启, false 为不开启。
    );

    //     韵律及内容评估开启参数。注意,如果上面的 ReferenceText 对应的值不为空字符串,则这里即使开启也不会生效。
    //         enableContentAssessmentWithTopic 为指定主题提供描述,可以借此增强评估对所讨论的特定主题的理解。仅限 en-US 语言环境使用。
    //         enableProsodyAssessment 为启用韵律评估来评估您的发音。
    pronunciationAssessmentConfig.enableContentAssessmentWithTopic("greeting") // 官方给的示例为 greeting,也可以写 free talk,或者使用 ChatGpt 等语言大模型生成简短的主题描述。
    pronunciationAssessmentConfig.enableProsodyAssessment = true;

    //     下面这两个参数是针对音素相关的
    //        phonemeAlphabet 的值可为 SAPI 或 IPA。建议使用 IPA 格式。
    //            IPA 介绍参建:https://en.wikipedia.org/wiki/IPA
    //            SAPI 介绍参建:https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ee431828(v=vs.85)#american-english-phoneme-table
    //            简单来说,比如 Hello 这个单词,SAPI 格式为 /h/ /eh/ /l/ /ow/,IPA 格式为 /h/ /ɛ/ /l/ /oʊ/。
    //        nbestPhonemeCount 为候选匹配音素展示,比如 /ɛ/ 这个音素,候选可能还有 /ə/、/æ/,返回时会给每个候选音素打个分。
    //            简单解释就是,预期的发音音素应该是 /ɛ/,但实际发音因素可能是 /æ/,那么 /ɛ/ 的得分可能只有 32,但是 /æ/ 可能有 67。
    //            也就是实际发音更像哪个音素。详细内容参见:https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-pronunciation-assessment?pivots=programming-language-python#assess-spoken-phonemes
    //            nbestPhonemeCount 参数建议 3 - 5 个就够了,如果不需要使用,也可以设置为 0 。
    pronunciationAssessmentConfig.phonemeAlphabet = "IPA";
    pronunciationAssessmentConfig.nbestPhonemeCount = 5;


    // 从 speechConfig 和 audioConfig 创建评估请求对象并发起请求
    const speechRecognizer = new sdk.SpeechRecognizer(speechConfig, audioConfig);
    pronunciationAssessmentConfig.applyTo(speechRecognizer);

    // 创建 session
    speechRecognizer.sessionStarted = (_, event) => {
        console.log(`speechRecognizer sessionStarted SESSION ID: ${event.sessionId}`);
    };

    // 建立连接
    speechRecognizer.speechStartDetected = (_, event) => {
        console.log(`speechRecognizer speechStartDetected event: ${JSON.stringify(event)}`);
    };

    // 流式数据获取,可以借助这个实现流式的语音转文本。
    //     recognizing 为处理过程中文本的返回。
    speechRecognizer.recognizing = (_, event: SpeechRecognitionEventArgs)=>{
        console.log(`speechRecognizer recognizing text: ${event.result.text}`);
    }

    // 流式数据获取,可以借助这个实现流式的语音转文本。
    //     recognized 为文本最后一个字符的返回。
    speechRecognizer.recognized = (_, event: SpeechRecognitionEventArgs)=>{
        console.log(`speechRecognizer recognized text: ${event.result.text}`);
    }

    let successfulResult: sdk.SpeechRecognitionResult
    // 响应结果。
    try {
        successfulResult = await new Promise<sdk.SpeechRecognitionResult>((resolve, reject) => {
            speechRecognizer.recognizeOnceAsync(
                (result) => {
                    console.log(`speechRecognizer recognizeOnceAsync`);
                    resolve(result);
                },
                (error) => {
                    reject(error);
                }
            );
        })
    }catch(err) {
        console.error(`speechRecognizer recognizeOnceAsync error: ${JSON.stringify(err)}`);
        throw err;
    }


    // 将响应结果进行解析
    const assessmentResult = sdk.PronunciationAssessmentResult.fromResult(successfulResult);

    // 这里之所以这样处理一下,是因为 assessmentResult.detailResult 的最外层是没有 Duration、Offset 参数的。
    const result = JSON.parse(JSON.stringify(assessmentResult.detailResult))
    result["Duration"] = successfulResult.duration;
    result["Offset"] = successfulResult.offset;

    console.log(`result: ${JSON.stringify(result, null, 2)}`);

    // 最后关闭 websocket 链接
    speechRecognizer.close();
}

start().then(_ => {})

以微软官方提供的 talk for a few seconds 音频为例,请求处理打印的结果如下:

speechRecognizer sessionStarted SESSION ID: 2NU6BZ9SB6Q36T1DGK4X7FZ2B5JLG3VP
speechRecognizer speechStartDetected event: {"privSessionId":"2NU6BZ9SB6Q36T1DGK4X7FZ2B5JLG3VP","privOffset":7800000}
speechRecognizer recognizing text: i'll talk
speechRecognizer recognizing text: i'll talk for a
speechRecognizer recognizing text: i'll talk for a few
speechRecognizer recognizing text: i'll talk for a few seconds
speechRecognizer recognizing text: i'll talk for a few seconds so
speechRecognizer recognizing text: i'll talk for a few seconds so you can
speechRecognizer recognizing text: i'll talk for a few seconds so you can recognize
speechRecognizer recognizing text: i'll talk for a few seconds so you can recognize my voice
speechRecognizer recognizing text: i'll talk for a few seconds so you can recognize my voice in the
speechRecognizer recognizing text: i'll talk for a few seconds so you can recognize my voice in the future
speechRecognizer recognized text: I'll talk for a few seconds so you can recognize my voice in the future.
speechRecognizer recognizeOnceAsync
# 具体的结果我放在下面了,这里用 [object] 占位一下
result: [object] 

从打印的结果中可以看出来,recognizing 会流式输出转换的文本内容,只不过这个流式输出的结果是叠加的,而非只输出增量结果。

recognized 则是输出完整的文本结果,并对整个文本进行了润色。

实际的话,不太建议使用 recognizing + recognized 做文字的流式输出,因为可以看到后面返回的文本可能会有调整,比如 iI

返回的结果:

{
  "Confidence": 0.9434535,
  "Lexical": "i'll talk for a few seconds so you can recognize my voice in the future",
  "ITN": "i'll talk for a few seconds so you can recognize my voice in the future",
  "MaskedITN": "i'll talk for a few seconds so you can recognize my voice in the future",
  "Display": "I'll talk for a few seconds so you can recognize my voice in the future.",
  "PronunciationAssessment": {
    "AccuracyScore": 100,
    "FluencyScore": 100,
    "ProsodyScore": 89.5,
    "CompletenessScore": 100,
    "PronScore": 93.7
  },
  "ContentAssessment": { // 这里是韵律评分,这个评分即使将 Granularity 设置为 FullText 也会存在。
    "GrammarScore": 0, // 语法的正确性和句型的多样性。
    "VocabularyScore": 0, // 评估说话者在特定语境中有效使用词汇的能力、词汇的恰当性,以及词汇的复杂程度。
    "TopicScore": 0 // 主题的契合度。
  },
  "Words": [
    {
      "Word": "i'll",
      "Offset": 7800000,
      "Duration": 3000000,
      "PronunciationAssessment": {
        "AccuracyScore": 100,
        "ErrorType": "None",
        "Feedback": {
          "Prosody": {
            "Break": {
              "ErrorTypes": [
                "None"
              ],
              "BreakLength": 0
            },
            "Intonation": {
              "ErrorTypes": [],
              "Monotone": {
                "SyllablePitchDeltaConfidence": 0.3209303
              }
            }
          }
        }
      },
      "Syllables": [
        {
          "Syllable": "aɪl",
          "Grapheme": "i'll",
          "PronunciationAssessment": {
            "AccuracyScore": 100
          },
          "Offset": 7800000,
          "Duration": 3000000
        }
      ],
      "Phonemes": [
        {
          "Phoneme": "aɪ",
          "PronunciationAssessment": { // 这就是上面代码里说的候选音素
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "aɪ",
                "Score": 100
              },
              {
                "Phoneme": "h",
                "Score": 46
              },
              {
                "Phoneme": "æ",
                "Score": 28
              },
              {
                "Phoneme": "ɔ",
                "Score": 25
              },
              {
                "Phoneme": "oʊ",
                "Score": 24
              }
            ]
          },
          "Offset": 7800000,
          "Duration": 2000000
        },
        {
          "Phoneme": "l",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "l",
                "Score": 100
              },
              {
                "Phoneme": "t",
                "Score": 54
              },
              {
                "Phoneme": "aɪ",
                "Score": 4
              },
              {
                "Phoneme": "d",
                "Score": 3
              },
              {
                "Phoneme": "aʊ",
                "Score": 3
              }
            ]
          },
          "Offset": 9900000,
          "Duration": 900000
        }
      ]
    },
    {
      "Word": "talk",
      "Offset": 10900000,
      "Duration": 3700000,
      "PronunciationAssessment": { // 韵律评估结果
        "AccuracyScore": 100,
        "ErrorType": "None", // 
        "Feedback": {
          "Prosody": {
            "Break": {
              "ErrorTypes": [
                "None"
              ],
              "UnexpectedBreak": {
                "Confidence": 3.6121673e-8
              },
              "MissingBreak": {
                "Confidence": 1
              },
              "BreakLength": 0
            },
            "Intonation": {
              "ErrorTypes": [],
              "Monotone": {
                "SyllablePitchDeltaConfidence": 0.3209303
              }
            }
          }
        }
      },
      "Syllables": [
        {
          "Syllable": "tɔk",
          "Grapheme": "talk",
          "PronunciationAssessment": {
            "AccuracyScore": 100
          },
          "Offset": 10900000,
          "Duration": 3700000
        }
      ],
      "Phonemes": [
        {
          "Phoneme": "t",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [ 
              {
                "Phoneme": "t",
                "Score": 100
              },
              {
                "Phoneme": "ɔ",
                "Score": 19
              },
              {
                "Phoneme": "d",
                "Score": 7
              },
              {
                "Phoneme": "k",
                "Score": 3
              },
              {
                "Phoneme": "l",
                "Score": 2
              }
            ]
          },
          "Offset": 10900000,
          "Duration": 700000
        },
        {
          "Phoneme": "ɔ",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "ɔ",
                "Score": 100
              },
              {
                "Phoneme": "k",
                "Score": 27
              },
              {
                "Phoneme": "ɑ",
                "Score": 26
              },
              {
                "Phoneme": "t",
                "Score": 9
              },
              {
                "Phoneme": "ʌ",
                "Score": 2
              }
            ]
          },
          "Offset": 11700000,
          "Duration": 1500000
        },
        {
          "Phoneme": "k",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "k",
                "Score": 100
              },
              {
                "Phoneme": "ɔ",
                "Score": 12
              },
              {
                "Phoneme": "f",
                "Score": 8
              },
              {
                "Phoneme": "t",
                "Score": 4
              },
              {
                "Phoneme": "ɡ",
                "Score": 2
              }
            ]
          },
          "Offset": 13300000,
          "Duration": 1300000
        }
      ]
    },
    {
      "Word": "for",
      "Offset": 14700000,
      "Duration": 2500000,
      "PronunciationAssessment": {
        "AccuracyScore": 100,
        "ErrorType": "None",
        "Feedback": {
          "Prosody": {
            "Break": {
              "ErrorTypes": [
                "None"
              ],
              "UnexpectedBreak": {
                "Confidence": 3.6121673e-8
              },
              "MissingBreak": {
                "Confidence": 1
              },
              "BreakLength": 0
            },
            "Intonation": {
              "ErrorTypes": [],
              "Monotone": {
                "SyllablePitchDeltaConfidence": 0.3209303
              }
            }
          }
        }
      },
      "Syllables": [
        {
          "Syllable": "fər",
          "Grapheme": "for",
          "PronunciationAssessment": {
            "AccuracyScore": 100
          },
          "Offset": 14700000,
          "Duration": 2500000
        }
      ],
      "Phonemes": [
        {
          "Phoneme": "f",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "f",
                "Score": 100
              },
              {
                "Phoneme": "k",
                "Score": 41
              },
              {
                "Phoneme": "s",
                "Score": 3
              },
              {
                "Phoneme": "ə",
                "Score": 1
              },
              {
                "Phoneme": "t",
                "Score": 1
              }
            ]
          },
          "Offset": 14700000,
          "Duration": 700000
        },
        {
          "Phoneme": "ə",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "ə",
                "Score": 100
              },
              {
                "Phoneme": "r",
                "Score": 49
              },
              {
                "Phoneme": "f",
                "Score": 5
              },
              {
                "Phoneme": "ɝ",
                "Score": 2
              },
              {
                "Phoneme": "ɔ",
                "Score": 0
              }
            ]
          },
          "Offset": 15500000,
          "Duration": 500000
        },
        {
          "Phoneme": "r",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "r",
                "Score": 100
              },
              {
                "Phoneme": "ə",
                "Score": 57
              },
              {
                "Phoneme": "w",
                "Score": 2
              },
              {
                "Phoneme": "ɔ",
                "Score": 2
              },
              {
                "Phoneme": "u",
                "Score": 2
              }
            ]
          },
          "Offset": 16100000,
          "Duration": 1100000
        }
      ]
    },
    {
      "Word": "a",
      "Offset": 17300000,
      "Duration": 900000,
      "PronunciationAssessment": {
        "AccuracyScore": 100,
        "ErrorType": "None",
        "Feedback": {
          "Prosody": {
            "Break": {
              "ErrorTypes": [
                "None"
              ],
              "UnexpectedBreak": {
                "Confidence": 3.6121673e-8
              },
              "MissingBreak": {
                "Confidence": 1
              },
              "BreakLength": 0
            },
            "Intonation": {
              "ErrorTypes": [],
              "Monotone": {
                "SyllablePitchDeltaConfidence": 0.3209303
              }
            }
          }
        }
      },
      "Syllables": [
        {
          "Syllable": "ə",
          "Grapheme": "a",
          "PronunciationAssessment": {
            "AccuracyScore": 100
          },
          "Offset": 17300000,
          "Duration": 900000
        }
      ],
      "Phonemes": [
        {
          "Phoneme": "ə",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "ə",
                "Score": 100
              },
              {
                "Phoneme": "ɪ",
                "Score": 3
              },
              {
                "Phoneme": "f",
                "Score": 3
              },
              {
                "Phoneme": "r",
                "Score": 1
              },
              {
                "Phoneme": "ʌ",
                "Score": 0
              }
            ]
          },
          "Offset": 17300000,
          "Duration": 900000
        }
      ]
    },
    {
      "Word": "few",
      "Offset": 18300000,
      "Duration": 2500000,
      "PronunciationAssessment": {
        "AccuracyScore": 100,
        "ErrorType": "None",
        "Feedback": {
          "Prosody": {
            "Break": {
              "ErrorTypes": [
                "None"
              ],
              "UnexpectedBreak": {
                "Confidence": 3.6121673e-8
              },
              "MissingBreak": {
                "Confidence": 1
              },
              "BreakLength": 0
            },
            "Intonation": {
              "ErrorTypes": [],
              "Monotone": {
                "SyllablePitchDeltaConfidence": 0.3209303
              }
            }
          }
        }
      },
      "Syllables": [
        {
          "Syllable": "fju",
          "Grapheme": "few",
          "PronunciationAssessment": {
            "AccuracyScore": 100
          },
          "Offset": 18300000,
          "Duration": 2500000
        }
      ],
      "Phonemes": [
        {
          "Phoneme": "f",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "f",
                "Score": 100
              },
              {
                "Phoneme": "ə",
                "Score": 47
              },
              {
                "Phoneme": "j",
                "Score": 32
              },
              {
                "Phoneme": "h",
                "Score": 3
              },
              {
                "Phoneme": "v",
                "Score": 0
              }
            ]
          },
          "Offset": 18300000,
          "Duration": 900000
        },
        {
          "Phoneme": "j",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "j",
                "Score": 100
              },
              {
                "Phoneme": "u",
                "Score": 22
              },
              {
                "Phoneme": "f",
                "Score": 2
              },
              {
                "Phoneme": "h",
                "Score": 1
              },
              {
                "Phoneme": "θ",
                "Score": 0
              }
            ]
          },
          "Offset": 19300000,
          "Duration": 600000
        },
        {
          "Phoneme": "u",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "u",
                "Score": 100
              },
              {
                "Phoneme": "j",
                "Score": 58
              },
              {
                "Phoneme": "s",
                "Score": 7
              },
              {
                "Phoneme": "i",
                "Score": 1
              },
              {
                "Phoneme": "r",
                "Score": 0
              }
            ]
          },
          "Offset": 20000000,
          "Duration": 800000
        }
      ]
    },
    {
      "Word": "seconds",
      "Offset": 20900000,
      "Duration": 7900000,
      "PronunciationAssessment": {
        "AccuracyScore": 98,
        "ErrorType": "None",
        "Feedback": {
          "Prosody": {
            "Break": {
              "ErrorTypes": [
                "None"
              ],
              "UnexpectedBreak": {
                "Confidence": 3.6121673e-8
              },
              "MissingBreak": {
                "Confidence": 1
              },
              "BreakLength": 0
            },
            "Intonation": {
              "ErrorTypes": [],
              "Monotone": {
                "SyllablePitchDeltaConfidence": 0.3209303
              }
            }
          }
        }
      },
      "Syllables": [
        {
          "Syllable": "sɛ",
          "Grapheme": "sec",
          "PronunciationAssessment": {
            "AccuracyScore": 98
          },
          "Offset": 20900000,
          "Duration": 2300000
        },
        {
          "Syllable": "kəndz",
          "Grapheme": "onds",
          "PronunciationAssessment": {
            "AccuracyScore": 98
          },
          "Offset": 23300000,
          "Duration": 5500000
        }
      ],
      "Phonemes": [
        {
          "Phoneme": "s",
          "PronunciationAssessment": {
            "AccuracyScore": 98,
            "NBestPhonemes": [
              {
                "Phoneme": "s",
                "Score": 100
              },
              {
                "Phoneme": "t",
                "Score": 6
              },
              {
                "Phoneme": "u",
                "Score": 4
              },
              {
                "Phoneme": "ɛ",
                "Score": 3
              },
              {
                "Phoneme": "θ",
                "Score": 1
              }
            ]
          },
          "Offset": 20900000,
          "Duration": 1100000
        },
        {
          "Phoneme": "ɛ",
          "PronunciationAssessment": {
            "AccuracyScore": 98,
            "NBestPhonemes": [
              {
                "Phoneme": "ɛ",
                "Score": 100
              },
              {
                "Phoneme": "s",
                "Score": 20
              },
              {
                "Phoneme": "k",
                "Score": 17
              },
              {
                "Phoneme": "æ",
                "Score": 3
              },
              {
                "Phoneme": "ʌ",
                "Score": 1
              }
            ]
          },
          "Offset": 22100000,
          "Duration": 1100000
        },
        {
          "Phoneme": "k",
          "PronunciationAssessment": {
            "AccuracyScore": 98,
            "NBestPhonemes": [
              {
                "Phoneme": "k",
                "Score": 100
              },
              {
                "Phoneme": "ɛ",
                "Score": 12
              },
              {
                "Phoneme": "ə",
                "Score": 3
              },
              {
                "Phoneme": "d",
                "Score": 0
              },
              {
                "Phoneme": "s",
                "Score": 0
              }
            ]
          },
          "Offset": 23300000,
          "Duration": 900000
        },
        {
          "Phoneme": "ə",
          "PronunciationAssessment": {
            "AccuracyScore": 98,
            "NBestPhonemes": [
              {
                "Phoneme": "ə",
                "Score": 100
              },
              {
                "Phoneme": "n",
                "Score": 2
              },
              {
                "Phoneme": "k",
                "Score": 1
              },
              {
                "Phoneme": "ɛ",
                "Score": 1
              },
              {
                "Phoneme": "ɪ",
                "Score": 0
              }
            ]
          },
          "Offset": 24300000,
          "Duration": 500000
        },
        {
          "Phoneme": "n",
          "PronunciationAssessment": {
            "AccuracyScore": 98,
            "NBestPhonemes": [
              {
                "Phoneme": "n",
                "Score": 100
              },
              {
                "Phoneme": "d",
                "Score": 84
              },
              {
                "Phoneme": "ə",
                "Score": 75
              },
              {
                "Phoneme": "k",
                "Score": 2
              },
              {
                "Phoneme": "ɪ",
                "Score": 2
              }
            ]
          },
          "Offset": 24900000,
          "Duration": 300000
        },
        {
          "Phoneme": "d",
          "PronunciationAssessment": {
            "AccuracyScore": 98,
            "NBestPhonemes": [
              {
                "Phoneme": "d",
                "Score": 100
              },
              {
                "Phoneme": "z",
                "Score": 42
              },
              {
                "Phoneme": "t",
                "Score": 26
              },
              {
                "Phoneme": "n",
                "Score": 7
              },
              {
                "Phoneme": "θ",
                "Score": 1
              }
            ]
          },
          "Offset": 25300000,
          "Duration": 900000
        },
        {
          "Phoneme": "z",
          "PronunciationAssessment": {
            "AccuracyScore": 98,
            "NBestPhonemes": [
              {
                "Phoneme": "z",
                "Score": 100
              },
              {
                "Phoneme": "s",
                "Score": 39
              },
              {
                "Phoneme": "d",
                "Score": 21
              },
              {
                "Phoneme": "t",
                "Score": 9
              },
              {
                "Phoneme": "oʊ",
                "Score": 4
              }
            ]
          },
          "Offset": 26300000,
          "Duration": 2500000
        }
      ]
    },
    {
      "Word": "so",
      "Offset": 29100000,
      "Duration": 3100000,
      "PronunciationAssessment": {
        "AccuracyScore": 100,
        "ErrorType": "None",
        "Feedback": {
          "Prosody": {
            "Break": {
              "ErrorTypes": [
                "None"
              ],
              "UnexpectedBreak": {
                "Confidence": 0.072243385
              },
              "MissingBreak": {
                "Confidence": 0.96387833
              },
              "BreakLength": 200000
            },
            "Intonation": {
              "ErrorTypes": [],
              "Monotone": {
                "SyllablePitchDeltaConfidence": 0.3209303
              }
            }
          }
        }
      },
      "Syllables": [
        {
          "Syllable": "soʊ",
          "Grapheme": "so",
          "PronunciationAssessment": {
            "AccuracyScore": 100
          },
          "Offset": 29100000,
          "Duration": 3100000
        }
      ],
      "Phonemes": [
        {
          "Phoneme": "s",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "s",
                "Score": 100
              },
              {
                "Phoneme": "t",
                "Score": 23
              },
              {
                "Phoneme": "oʊ",
                "Score": 5
              },
              {
                "Phoneme": "ɪ",
                "Score": 3
              },
              {
                "Phoneme": "d",
                "Score": 3
              }
            ]
          },
          "Offset": 29100000,
          "Duration": 1700000
        },
        {
          "Phoneme": "oʊ",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "oʊ",
                "Score": 100
              },
              {
                "Phoneme": "j",
                "Score": 16
              },
              {
                "Phoneme": "l",
                "Score": 4
              },
              {
                "Phoneme": "u",
                "Score": 1
              },
              {
                "Phoneme": "i",
                "Score": 1
              }
            ]
          },
          "Offset": 30900000,
          "Duration": 1300000
        }
      ]
    },
    {
      "Word": "you",
      "Offset": 32300000,
      "Duration": 1300000,
      "PronunciationAssessment": {
        "AccuracyScore": 100,
        "ErrorType": "None",
        "Feedback": {
          "Prosody": {
            "Break": {
              "ErrorTypes": [
                "None"
              ],
              "UnexpectedBreak": {
                "Confidence": 3.6121673e-8
              },
              "MissingBreak": {
                "Confidence": 1
              },
              "BreakLength": 0
            },
            "Intonation": {
              "ErrorTypes": [],
              "Monotone": {
                "SyllablePitchDeltaConfidence": 0.3209303
              }
            }
          }
        }
      },
      "Syllables": [
        {
          "Syllable": "ju",
          "Grapheme": "you",
          "PronunciationAssessment": {
            "AccuracyScore": 100
          },
          "Offset": 32300000,
          "Duration": 1300000
        }
      ],
      "Phonemes": [
        {
          "Phoneme": "j",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "j",
                "Score": 100
              },
              {
                "Phoneme": "i",
                "Score": 2
              },
              {
                "Phoneme": "ʊ",
                "Score": 1
              },
              {
                "Phoneme": "u",
                "Score": 1
              },
              {
                "Phoneme": "oʊ",
                "Score": 1
              }
            ]
          },
          "Offset": 32300000,
          "Duration": 300000
        },
        {
          "Phoneme": "u",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "u",
                "Score": 100
              },
              {
                "Phoneme": "j",
                "Score": 70
              },
              {
                "Phoneme": "ʊ",
                "Score": 52
              },
              {
                "Phoneme": "ə",
                "Score": 1
              },
              {
                "Phoneme": "i",
                "Score": 1
              }
            ]
          },
          "Offset": 32700000,
          "Duration": 900000
        }
      ]
    },
    {
      "Word": "can",
      "Offset": 33700000,
      "Duration": 2300000,
      "PronunciationAssessment": {
        "AccuracyScore": 100,
        "ErrorType": "None",
        "Feedback": {
          "Prosody": {
            "Break": {
              "ErrorTypes": [
                "None"
              ],
              "UnexpectedBreak": {
                "Confidence": 3.6121673e-8
              },
              "MissingBreak": {
                "Confidence": 1
              },
              "BreakLength": 0
            },
            "Intonation": {
              "ErrorTypes": [],
              "Monotone": {
                "SyllablePitchDeltaConfidence": 0.3209303
              }
            }
          }
        }
      },
      "Syllables": [
        {
          "Syllable": "kən",
          "Grapheme": "can",
          "PronunciationAssessment": {
            "AccuracyScore": 100
          },
          "Offset": 33700000,
          "Duration": 2300000
        }
      ],
      "Phonemes": [
        {
          "Phoneme": "k",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "k",
                "Score": 100
              },
              {
                "Phoneme": "u",
                "Score": 22
              },
              {
                "Phoneme": "ʊ",
                "Score": 21
              },
              {
                "Phoneme": "d",
                "Score": 5
              },
              {
                "Phoneme": "ə",
                "Score": 1
              }
            ]
          },
          "Offset": 33700000,
          "Duration": 700000
        },
        {
          "Phoneme": "ə",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "ə",
                "Score": 100
              },
              {
                "Phoneme": "ʊ",
                "Score": 10
              },
              {
                "Phoneme": "k",
                "Score": 8
              },
              {
                "Phoneme": "n",
                "Score": 3
              },
              {
                "Phoneme": "ɪ",
                "Score": 0
              }
            ]
          },
          "Offset": 34500000,
          "Duration": 500000
        },
        {
          "Phoneme": "n",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "n",
                "Score": 100
              },
              {
                "Phoneme": "r",
                "Score": 8
              },
              {
                "Phoneme": "ə",
                "Score": 4
              },
              {
                "Phoneme": "d",
                "Score": 1
              },
              {
                "Phoneme": "t",
                "Score": 0
              }
            ]
          },
          "Offset": 35100000,
          "Duration": 900000
        }
      ]
    },
    {
      "Word": "recognize",
      "Offset": 36100000,
      "Duration": 7800000,
      "PronunciationAssessment": {
        "AccuracyScore": 98,
        "ErrorType": "None",
        "Feedback": {
          "Prosody": {
            "Break": {
              "ErrorTypes": [
                "None"
              ],
              "UnexpectedBreak": {
                "Confidence": 3.6121673e-8
              },
              "MissingBreak": {
                "Confidence": 1
              },
              "BreakLength": 0
            },
            "Intonation": {
              "ErrorTypes": [],
              "Monotone": {
                "SyllablePitchDeltaConfidence": 0.3209303
              }
            }
          }
        }
      },
      "Syllables": [
        {
          "Syllable": "rɛ",
          "Grapheme": "rec",
          "PronunciationAssessment": {
            "AccuracyScore": 92
          },
          "Offset": 36100000,
          "Duration": 1300000
        },
        {
          "Syllable": "kə",
          "Grapheme": "og",
          "PronunciationAssessment": {
            "AccuracyScore": 100
          },
          "Offset": 37500000,
          "Duration": 1500000
        },
        {
          "Syllable": "ɡnaɪz",
          "Grapheme": "nize",
          "PronunciationAssessment": {
            "AccuracyScore": 100
          },
          "Offset": 39100000,
          "Duration": 4800000
        }
      ],
      "Phonemes": [
        {
          "Phoneme": "r",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "r",
                "Score": 100
              },
              {
                "Phoneme": "n",
                "Score": 8
              },
              {
                "Phoneme": "ɛ",
                "Score": 0
              },
              {
                "Phoneme": "d",
                "Score": 0
              },
              {
                "Phoneme": "t",
                "Score": 0
              }
            ]
          },
          "Offset": 36100000,
          "Duration": 700000
        },
        {
          "Phoneme": "ɛ",
          "PronunciationAssessment": {
            "AccuracyScore": 82,
            "NBestPhonemes": [
              {
                "Phoneme": "ɛ",
                "Score": 100
              },
              {
                "Phoneme": "r",
                "Score": 26
              },
              {
                "Phoneme": "i",
                "Score": 0
              },
              {
                "Phoneme": "k",
                "Score": 0
              },
              {
                "Phoneme": "ʌ",
                "Score": 0
              }
            ]
          },
          "Offset": 36900000,
          "Duration": 500000
        },
        {
          "Phoneme": "k",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "k",
                "Score": 100
              },
              {
                "Phoneme": "ɛ",
                "Score": 47
              },
              {
                "Phoneme": "ə",
                "Score": 2
              },
              {
                "Phoneme": "i",
                "Score": 0
              },
              {
                "Phoneme": "p",
                "Score": 0
              }
            ]
          },
          "Offset": 37500000,
          "Duration": 700000
        },
        {
          "Phoneme": "ə",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "ə",
                "Score": 100
              },
              {
                "Phoneme": "k",
                "Score": 44
              },
              {
                "Phoneme": "ɡ",
                "Score": 2
              },
              {
                "Phoneme": "ɪ",
                "Score": 0
              },
              {
                "Phoneme": "i",
                "Score": 0
              }
            ]
          },
          "Offset": 38300000,
          "Duration": 700000
        },
        {
          "Phoneme": "ɡ",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "ɡ",
                "Score": 100
              },
              {
                "Phoneme": "n",
                "Score": 21
              },
              {
                "Phoneme": "ə",
                "Score": 20
              },
              {
                "Phoneme": "k",
                "Score": 1
              },
              {
                "Phoneme": "d",
                "Score": 0
              }
            ]
          },
          "Offset": 39100000,
          "Duration": 500000
        },
        {
          "Phoneme": "n",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "n",
                "Score": 100
              },
              {
                "Phoneme": "ɡ",
                "Score": 40
              },
              {
                "Phoneme": "aɪ",
                "Score": 6
              },
              {
                "Phoneme": "ə",
                "Score": 0
              },
              {
                "Phoneme": "ɪ",
                "Score": 0
              }
            ]
          },
          "Offset": 39700000,
          "Duration": 1300000
        },
        {
          "Phoneme": "aɪ",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "aɪ",
                "Score": 100
              },
              {
                "Phoneme": "n",
                "Score": 27
              },
              {
                "Phoneme": "z",
                "Score": 24
              },
              {
                "Phoneme": "ɪ",
                "Score": 3
              },
              {
                "Phoneme": "ə",
                "Score": 3
              }
            ]
          },
          "Offset": 41100000,
          "Duration": 1500000
        },
        {
          "Phoneme": "z",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "z",
                "Score": 100
              },
              {
                "Phoneme": "m",
                "Score": 32
              },
              {
                "Phoneme": "d",
                "Score": 7
              },
              {
                "Phoneme": "s",
                "Score": 5
              },
              {
                "Phoneme": "ə",
                "Score": 4
              }
            ]
          },
          "Offset": 42700000,
          "Duration": 1200000
        }
      ]
    },
    {
      "Word": "my",
      "Offset": 44000000,
      "Duration": 2200000,
      "PronunciationAssessment": {
        "AccuracyScore": 100,
        "ErrorType": "None",
        "Feedback": {
          "Prosody": {
            "Break": {
              "ErrorTypes": [
                "None"
              ],
              "UnexpectedBreak": {
                "Confidence": 3.6121673e-8
              },
              "MissingBreak": {
                "Confidence": 1
              },
              "BreakLength": 0
            },
            "Intonation": {
              "ErrorTypes": [],
              "Monotone": {
                "SyllablePitchDeltaConfidence": 0.3209303
              }
            }
          }
        }
      },
      "Syllables": [
        {
          "Syllable": "maɪ",
          "Grapheme": "my",
          "PronunciationAssessment": {
            "AccuracyScore": 100
          },
          "Offset": 44000000,
          "Duration": 2200000
        }
      ],
      "Phonemes": [
        {
          "Phoneme": "m",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "m",
                "Score": 100
              },
              {
                "Phoneme": "z",
                "Score": 44
              },
              {
                "Phoneme": "d",
                "Score": 6
              },
              {
                "Phoneme": "t",
                "Score": 5
              },
              {
                "Phoneme": "ə",
                "Score": 3
              }
            ]
          },
          "Offset": 44000000,
          "Duration": 1000000
        },
        {
          "Phoneme": "aɪ",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "aɪ",
                "Score": 100
              },
              {
                "Phoneme": "m",
                "Score": 64
              },
              {
                "Phoneme": "v",
                "Score": 27
              },
              {
                "Phoneme": "eɪ",
                "Score": 0
              },
              {
                "Phoneme": "ɪ",
                "Score": 0
              }
            ]
          },
          "Offset": 45100000,
          "Duration": 1100000
        }
      ]
    },
    {
      "Word": "voice",
      "Offset": 46300000,
      "Duration": 4300000,
      "PronunciationAssessment": {
        "AccuracyScore": 100,
        "ErrorType": "None",
        "Feedback": {
          "Prosody": {
            "Break": {
              "ErrorTypes": [
                "None"
              ],
              "UnexpectedBreak": {
                "Confidence": 3.6121673e-8
              },
              "MissingBreak": {
                "Confidence": 1
              },
              "BreakLength": 0
            },
            "Intonation": {
              "ErrorTypes": [],
              "Monotone": {
                "SyllablePitchDeltaConfidence": 0.3209303
              }
            }
          }
        }
      },
      "Syllables": [
        {
          "Syllable": "vɔɪs",
          "Grapheme": "voice",
          "PronunciationAssessment": {
            "AccuracyScore": 100
          },
          "Offset": 46300000,
          "Duration": 4300000
        }
      ],
      "Phonemes": [
        {
          "Phoneme": "v",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "v",
                "Score": 100
              },
              {
                "Phoneme": "aɪ",
                "Score": 2
              },
              {
                "Phoneme": "ə",
                "Score": 1
              },
              {
                "Phoneme": "b",
                "Score": 0
              },
              {
                "Phoneme": "r",
                "Score": 0
              }
            ]
          },
          "Offset": 46300000,
          "Duration": 1100000
        },
        {
          "Phoneme": "ɔɪ",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "ɔɪ",
                "Score": 100
              },
              {
                "Phoneme": "v",
                "Score": 43
              },
              {
                "Phoneme": "s",
                "Score": 4
              },
              {
                "Phoneme": "aɪ",
                "Score": 1
              },
              {
                "Phoneme": "eɪ",
                "Score": 0
              }
            ]
          },
          "Offset": 47500000,
          "Duration": 2100000
        },
        {
          "Phoneme": "s",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "s",
                "Score": 100
              },
              {
                "Phoneme": "ɪ",
                "Score": 42
              },
              {
                "Phoneme": "ɔɪ",
                "Score": 13
              },
              {
                "Phoneme": "z",
                "Score": 3
              },
              {
                "Phoneme": "ɛ",
                "Score": 1
              }
            ]
          },
          "Offset": 49700000,
          "Duration": 900000
        }
      ]
    },
    {
      "Word": "in",
      "Offset": 50700000,
      "Duration": 2300000,
      "PronunciationAssessment": {
        "AccuracyScore": 100,
        "ErrorType": "None",
        "Feedback": {
          "Prosody": {
            "Break": {
              "ErrorTypes": [
                "None"
              ],
              "UnexpectedBreak": {
                "Confidence": 3.6121673e-8
              },
              "MissingBreak": {
                "Confidence": 1
              },
              "BreakLength": 0
            },
            "Intonation": {
              "ErrorTypes": [],
              "Monotone": {
                "SyllablePitchDeltaConfidence": 0.3209303
              }
            }
          }
        }
      },
      "Syllables": [
        {
          "Syllable": "ɪn",
          "Grapheme": "in",
          "PronunciationAssessment": {
            "AccuracyScore": 100
          },
          "Offset": 50700000,
          "Duration": 2300000
        }
      ],
      "Phonemes": [
        {
          "Phoneme": "ɪ",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "ɪ",
                "Score": 100
              },
              {
                "Phoneme": "s",
                "Score": 7
              },
              {
                "Phoneme": "ə",
                "Score": 1
              },
              {
                "Phoneme": "ɛ",
                "Score": 1
              },
              {
                "Phoneme": "n",
                "Score": 0
              }
            ]
          },
          "Offset": 50700000,
          "Duration": 1300000
        },
        {
          "Phoneme": "n",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "n",
                "Score": 100
              },
              {
                "Phoneme": "ɪ",
                "Score": 61
              },
              {
                "Phoneme": "ð",
                "Score": 0
              },
              {
                "Phoneme": "ə",
                "Score": 0
              },
              {
                "Phoneme": "ɛ",
                "Score": 0
              }
            ]
          },
          "Offset": 52100000,
          "Duration": 900000
        }
      ]
    },
    {
      "Word": "the",
      "Offset": 53100000,
      "Duration": 1100000,
      "PronunciationAssessment": {
        "AccuracyScore": 100,
        "ErrorType": "None",
        "Feedback": {
          "Prosody": {
            "Break": {
              "ErrorTypes": [
                "None"
              ],
              "UnexpectedBreak": {
                "Confidence": 3.6121673e-8
              },
              "MissingBreak": {
                "Confidence": 1
              },
              "BreakLength": 0
            },
            "Intonation": {
              "ErrorTypes": [],
              "Monotone": {
                "SyllablePitchDeltaConfidence": 0.3209303
              }
            }
          }
        }
      },
      "Syllables": [
        {
          "Syllable": "ðə",
          "Grapheme": "the",
          "PronunciationAssessment": {
            "AccuracyScore": 100
          },
          "Offset": 53100000,
          "Duration": 1100000
        }
      ],
      "Phonemes": [
        {
          "Phoneme": "ð",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "ð",
                "Score": 100
              },
              {
                "Phoneme": "n",
                "Score": 19
              },
              {
                "Phoneme": "ə",
                "Score": 14
              },
              {
                "Phoneme": "ʌ",
                "Score": 1
              },
              {
                "Phoneme": "ɛ",
                "Score": 0
              }
            ]
          },
          "Offset": 53100000,
          "Duration": 400000
        },
        {
          "Phoneme": "ə",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "ə",
                "Score": 100
              },
              {
                "Phoneme": "ð",
                "Score": 72
              },
              {
                "Phoneme": "f",
                "Score": 7
              },
              {
                "Phoneme": "ʌ",
                "Score": 1
              },
              {
                "Phoneme": "r",
                "Score": 1
              }
            ]
          },
          "Offset": 53600000,
          "Duration": 600000
        }
      ]
    },
    {
      "Word": "future",
      "Offset": 54300000,
      "Duration": 6200000,
      "PronunciationAssessment": {
        "AccuracyScore": 100,
        "ErrorType": "None",
        "Feedback": {
          "Prosody": {
            "Break": {
              "ErrorTypes": [
                "None"
              ],
              "UnexpectedBreak": {
                "Confidence": 3.6121673e-8
              },
              "MissingBreak": {
                "Confidence": 1
              },
              "BreakLength": 0
            },
            "Intonation": {
              "ErrorTypes": [],
              "Monotone": {
                "SyllablePitchDeltaConfidence": 0.3209303
              }
            }
          }
        }
      },
      "Syllables": [
        {
          "Syllable": "fju",
          "Grapheme": "fu",
          "PronunciationAssessment": {
            "AccuracyScore": 100
          },
          "Offset": 54300000,
          "Duration": 2300000
        },
        {
          "Syllable": "tʃər",
          "Grapheme": "ture",
          "PronunciationAssessment": {
            "AccuracyScore": 100
          },
          "Offset": 56700000,
          "Duration": 3800000
        }
      ],
      "Phonemes": [
        {
          "Phoneme": "f",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "f",
                "Score": 100
              },
              {
                "Phoneme": "j",
                "Score": 36
              },
              {
                "Phoneme": "ə",
                "Score": 1
              },
              {
                "Phoneme": "r",
                "Score": 0
              },
              {
                "Phoneme": "v",
                "Score": 0
              }
            ]
          },
          "Offset": 54300000,
          "Duration": 700000
        },
        {
          "Phoneme": "j",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "j",
                "Score": 100
              },
              {
                "Phoneme": "u",
                "Score": 13
              },
              {
                "Phoneme": "i",
                "Score": 1
              },
              {
                "Phoneme": "f",
                "Score": 1
              },
              {
                "Phoneme": "ə",
                "Score": 0
              }
            ]
          },
          "Offset": 55100000,
          "Duration": 500000
        },
        {
          "Phoneme": "u",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "u",
                "Score": 100
              },
              {
                "Phoneme": "j",
                "Score": 15
              },
              {
                "Phoneme": "tʃ",
                "Score": 4
              },
              {
                "Phoneme": "d",
                "Score": 2
              },
              {
                "Phoneme": "i",
                "Score": 1
              }
            ]
          },
          "Offset": 55700000,
          "Duration": 900000
        },
        {
          "Phoneme": "tʃ",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "tʃ",
                "Score": 100
              },
              {
                "Phoneme": "ə",
                "Score": 59
              },
              {
                "Phoneme": "u",
                "Score": 29
              },
              {
                "Phoneme": "dʒ",
                "Score": 11
              },
              {
                "Phoneme": "ʒ",
                "Score": 1
              }
            ]
          },
          "Offset": 56700000,
          "Duration": 700000
        },
        {
          "Phoneme": "ə",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "ə",
                "Score": 100
              },
              {
                "Phoneme": "r",
                "Score": 26
              },
              {
                "Phoneme": "ɝ",
                "Score": 7
              },
              {
                "Phoneme": "ɛ",
                "Score": 1
              },
              {
                "Phoneme": "tʃ",
                "Score": 1
              }
            ]
          },
          "Offset": 57500000,
          "Duration": 900000
        },
        {
          "Phoneme": "r",
          "PronunciationAssessment": {
            "AccuracyScore": 100,
            "NBestPhonemes": [
              {
                "Phoneme": "r",
                "Score": 100
              },
              {
                "Phoneme": "ə",
                "Score": 14
              },
              {
                "Phoneme": "oʊ",
                "Score": 7
              },
              {
                "Phoneme": "n",
                "Score": 5
              },
              {
                "Phoneme": "ɝ",
                "Score": 5
              }
            ]
          },
          "Offset": 58500000,
          "Duration": 2000000
        }
      ]
    }
  ],
  "Duration": 52700000,
  "Offset": 7800000
}

可以看到,相比上面的 HTTP 接口请求的结果,这里多了韵律相关的评估结果。

但是,这个返回结果中仔细看完后会发现 ContentAssessment 下面三个韵律分数都是 0 ,但是代码中已经明确指定了使用韵律评估,并且没有指定文本。

这个问题可以看下面的内容。

常见问题

一、韵律分数为 0

调用接口时会遇到:代码中已经明确指定了使用韵律评估,并且没有指定预设文本的情况下,ContentAssessment 下面三个韵律分数都是 0

这是因为当使用过短的内容时,微软那边无法通过有限的信息来判断语义,从而就无法进行 VocabularyScoreGrammarScore TopicScore 三项指标的判断运算。

如果要准确评估语音的内容,需要保证音频的时间在 15 秒(相当于 50 个以上的单词)至 10 分钟之间。如果要获得 Topic 分数,口语音频应包含至少 3 个句子。

具体的文档说明参见:conduct-an-unscripted-assessment

而上面测试的是一个 7s 的音频,总共有 15 个单词,自然就没有 ContentAssessment 下面的三个韵律评分。

如果使用微软官方提供的 Pronunciation Assessment Fall 音频,则可以获取到相关的评分。

这个音频为 28s,总共有 80 个单词,完全符合官方文档的要求。

{
    "Confidence": 0.89063513,
    "Lexical": "the fall is a time of cozy contemplation for some a sad time for others no longer the far flung ecstasy of summer and preceding the hibernation period of winter it's a time of slowing down to reflect it's no wonder the fall is called the spring of the philosopher this season confronts us with what we've cultivated within the ancients of many traditions honored this sacred time by taking autumns rich external symbolism as a tool for inner rebirth",
    "ITN": "the fall is a time of cozy contemplation for some a sad time for others no longer the far flung ecstasy of summer and preceding the hibernation period of winter it's a time of slowing down to reflect it's no wonder the fall is called the spring of the philosopher this season confronts us with what we've cultivated within the ancients of many traditions honored this sacred time by taking autumns rich external symbolism as a tool for inner rebirth",
    "MaskedITN": "the fall is a time of cozy contemplation for some a sad time for others no longer the far-flung ecstasy of summer and preceding the hibernation period of winter it's a time of slowing down to reflect it's no wonder the fall is called the spring of the philosopher this season confronts us with what we've cultivated within the ancients of many traditions honored this sacred time by taking autumns rich external symbolism as a tool for inner rebirth",
    "Display": "The fall is a time of cozy contemplation for some, a sad time for others. No longer the far-flung ecstasy of summer and preceding the hibernation period of winter, it's a time of slowing down to reflect. It's no wonder the fall is called the Spring of the philosopher. This season confronts us with what we've cultivated within. The ancients of many traditions honored this sacred time by taking autumns rich external symbolism as a tool for inner rebirth.",
    "PronunciationAssessment": Object{...},
    "ContentAssessment": { // 三个分数都不为 0
        "GrammarScore": 56,
        "VocabularyScore": 58,
        "TopicScore": 80
    },
    "Words": Array[80],
    "Duration": 281900000,
    "Offset": 1100000
}

二、长语音的音频怎么处理

如果涉及到长语音音频的处理,可以使用连续识别

三、返回的音素大部分是空

之前还遇到过一个问题,就是返回的音素都是空的。比如下面这种的。

{
    "Confidence": 0.98069525,
    "Lexical": "hello hello hello",
    "ITN": "hello hello hello",
    "MaskedITN": "hello hello hello",
    "Display": "Hello Hello Hello。",
    "PronunciationAssessment": Object{...},
    "ContentAssessment": Object{...},
    "Words": [
        {
            "Word": "hello",
            "Offset": 2000000,
            "Duration": 3800000,
            "PronunciationAssessment": Object{...},
            "Syllables": Array[1],
            "Phonemes": [ // hello 总共是四个音素,但是下面的音素列表是五个,且其中四个为空,候选音素也为空
                {
                    "Phoneme": "", 
                    "PronunciationAssessment": {
                        "AccuracyScore": 88,
                        "NBestPhonemes": [
                            {
                                "Phoneme": "",
                                "Score": 100
                            },
                            {
                                "Phoneme": "",
                                "Score": 2
                            },
                            {
                                "Phoneme": "",
                                "Score": 2
                            },
                            {
                                "Phoneme": "",
                                "Score": 1
                            },
                            {
                                "Phoneme": "k",
                                "Score": 1
                            }
                        ]
                    },
                    "Offset": 2000000,
                    "Duration": 1100000
                },
                {
                    "Phoneme": "",
                    "PronunciationAssessment": {
                        "AccuracyScore": 88,
                        "NBestPhonemes": [
                            {
                                "Phoneme": "",
                                "Score": 100
                            },
                            {
                                "Phoneme": "l",
                                "Score": 69
                            },
                            {
                                "Phoneme": "",
                                "Score": 53
                            },
                            {
                                "Phoneme": "",
                                "Score": 2
                            },
                            {
                                "Phoneme": "",
                                "Score": 1
                            }
                        ]
                    },
                    "Offset": 3200000,
                    "Duration": 600000
                },
                {
                    "Phoneme": "l",
                    "PronunciationAssessment": {
                        "AccuracyScore": 88,
                        "NBestPhonemes": [
                            {
                                "Phoneme": "l",
                                "Score": 100
                            },
                            {
                                "Phoneme": "",
                                "Score": 29
                            },
                            {
                                "Phoneme": "",
                                "Score": 5
                            },
                            {
                                "Phoneme": "",
                                "Score": 3
                            },
                            {
                                "Phoneme": "",
                                "Score": 1
                            }
                        ]
                    },
                    "Offset": 3900000,
                    "Duration": 400000
                },
                {
                    "Phoneme": "",
                    "PronunciationAssessment": {
                        "AccuracyScore": 88,
                        "NBestPhonemes": [
                            {
                                "Phoneme": "",
                                "Score": 100
                            },
                            {
                                "Phoneme": "",
                                "Score": 61
                            },
                            {
                                "Phoneme": "",
                                "Score": 50
                            },
                            {
                                "Phoneme": "",
                                "Score": 4
                            },
                            {
                                "Phoneme": "l",
                                "Score": 1
                            }
                        ]
                    },
                    "Offset": 4400000,
                    "Duration": 700000
                },
                {
                    "Phoneme": "",
                    "PronunciationAssessment": {
                        "AccuracyScore": 88,
                        "NBestPhonemes": [
                            {
                                "Phoneme": "",
                                "Score": 100
                            },
                            {
                                "Phoneme": "",
                                "Score": 25
                            },
                            {
                                "Phoneme": "",
                                "Score": 18
                            },
                            {
                                "Phoneme": "",
                                "Score": 6
                            },
                            {
                                "Phoneme": "",
                                "Score": 2
                            }
                        ]
                    },
                    "Offset": 5200000,
                    "Duration": 600000
                }
            ]
        },
        Object{...},
        Object{...}
    ],
    "Duration": 10800000,
    "Offset": 2000000
}

出现这个问题,是当时在翻阅示例代码和文档是,在某个犄角旮旯发现了可以使用多语言进行发音评估的方法。

也就是使用下面这两行代码,在指定多个语言的情况下,创建 speechRecognizer 实例。

    const autoDetectSourceLanguageConfig = sdk.AutoDetectSourceLanguageConfig.fromLanguages(["zh-CN", "en-US"]);
    const speechRecognizer = sdk.SpeechRecognizer.FromConfig(speechConfig, autoDetectSourceLanguageConfig, audioConfig);

当涉及到中英文混说的时候,比如 你好,hello,这种方法确实能够保证转录后的结果是 你好,hello,而非 nihao, hello

但是这个方法其实并非是官方支持的方法,可能是之前废弃的方法。咨询过微软官方,得到的答复是发音评估只能指定一种语言,不能指定多种语言。

这个问题不仅仅是微软不支持,声通和其他的模型也都不支持。其实也很好理解,如果涉及到混说,是没法评估你的表达式不是有问题的,或者说中英文混说的表达方式本身就是有问题的。

四、混合使用多接口进行发音评估

在实际的开发中,为了优化用户体验,贴合应用使用场景,往往会涉及到多个接口混合调用来实现一个功能。

这时候需要注意,同一段音频文件在不同的接口,使用不同的参数转录出来的文本可能会不一样。

这个文本不一致可能会出现在以下地方:

  1. 标点符号、大小写不一致。
  2. 中英文混说时,可能会出现解析的内容不一致。比如 你好nihao
  3. 单词漏、缺,或直接少一句或几句话。

前面两个还好,最后一个是很难处理的。目前发现最常出现的是使用 SDK 进行发音评估时,如果不指定文本或者使用比较粗的颗粒度评估时,是可能会少转录几个单词或者一两句话。

从生产效率上考虑,建议先使用快速转录进行语音转录,之后再使用 SDK 指定转录文本进行评分。

如果涉及到需要韵律评分的场景,可以单独使用 SDK 不指定文本进行全文级别的评分,这样可以获取韵律评分结果。

五、并发限制

微软 Auzre 的并发可以从两个方面进行调整。

第一,在创建订阅时,选择付费计划可以提高并发能力。点此跳转文档

1736857324928.png

第二,可以创建多个订阅,使用平均调度算法(如队列轮询、哈希、加权、计数器等方法),确保每个订阅 key 的并发保持在一定地步。