Unearthing sources of bias: A sociolinguistically-informed analysis of single- and multi-phone ASR errors on a multiethnic corpus

Speaker: Alicia Beckford Wassink

Location: 10 Washington Place

Date: Friday, February 24, 2023

Many industry reports praise contemporary commercial automated speech recognition (ASR) systems, boasting of near-parity between the performance of these recognition systems and human-generated transcriptions. Wassink et al. (2022) highlighted three enduring problems with contemporary ASR: historical bias, representation bias, and aggregation bias.  In the last 20 years, training corpora have persisted in omitting data for minoritized groups (historical and representation biases). Pronunciation models thereby reproduce racial inequities in tools that are intended to work for normal users and use cases (aggregation bias). Furthermore, pronunciation tables that provide phone recognition targets omit sociolinguistic variants that are commonly used in underrepresented speech communities (e.g., consonant cluster simplification, (th)-stopping, (PIN/PEN) merger). This talk examines phonetic errors generated by a custom-built ASR system using Microsoft’s Speech developer kit. I focus on error rates associated with 17 sociophonetic variables in a subset of the Pacific Northwest English corpus, including African American, Yakama (Native) American, and ChicanX speech. My previous work found that certain phonetic error types were more frequently observed in these data than in comparable forms produced by our mainstream English-speaking white speakers, resulting in lower-quality transcriptions for all non-white groups.  That earlier work focused on word-errors that could be traced back to single-phone errors associated with well-studied sociolinguistic variables. However, there tend to be more multi-phone than single-phone errors in the African American materials. I find that multi-phone errors in the PNWE Caucasian-American speech account for 3-4.5% of all errors, but for 37-62% in our AAL materials. This observation holds for a comparison AAL corpus from another region of the US. I will consider possible causes of multi-phone errors.  Most ASR systems today use a “generative model” of phone classification, which means, in the domain of supervised Machine Learning (ML), that we attempt to infer the likeliest sequence of words, given a linear sequence of acoustic feature vectors. Phone probabilities are adjusted under simultaneous reference to the probabilities of the language model. These two facts about recognition system architectures appear to conspire to create phonetic error rates that are approximately 4-6 times higher for our non-white speech materials than for those of our white speakers.