RESUMO
This study constitutes an investigation into the acoustic variability of intervocalic alveolar taps in a corpus of spontaneous speech from Madrid, Spain. Substantial variability was documented in this segment, with highly reduced variants constituting roughly half of all tokens during spectrographic inspection. In addition to qualitative documentation, the intensity difference between the tap and surrounding vowels was measured. Changes in this intensity difference were statistically modeled using Bayesian finite mixture models containing lexical and phonetic predictors. Model comparisons indicate predictive performance is improved when we assume two latent categories, interpreted as two pronunciation variants for the Spanish tap. In interpreting the model, predictors were more often related to categorical changes in which pronunciation variant was produced than to gradient intensity changes within each tap type. Variability in tap production was found according to lexical frequency, speech rate, and phonetic environment. These results underscore the importance of evaluating model fit to the data as well as what researchers modeling phonetic variability can gain in moving past linear models when they do not adequately fit the observed data.
Assuntos
Acústica da Fala , Percepção da Fala , Teorema de Bayes , Fala , Fonética , AcústicaRESUMO
Given an orthographic transcription, forced alignment systems automatically determine boundaries between segments in speech, facilitating the use of large corpora. In the present paper, we introduce a neural network-based forced alignment system, the Mason-Alberta Phonetic Segmenter (MAPS). MAPS serves as a testbed for two possible improvements we pursue for forced alignment systems. The first is treating the acoustic model as a tagger, rather than a classifier, motivated by the common understanding that segments are not truly discrete and often overlap. The second is an interpolation technique to allow more precise boundaries than the typical 10â¯ms limit in modern systems. During testing, all system configurations we trained significantly outperformed the state-of-the-art Montreal Forced Aligner in the 10â¯ms boundary placement tolerance threshold. The greatest difference achieved was a 28.13â¯% relative performance increase. The Montreal Forced Aligner began to slightly outperform our models at around a 30â¯ms tolerance. We also reflect on the training process for acoustic modeling in forced alignment, highlighting how the output targets for these models do not match phoneticians' conception of similarity between phones and that reconciling this tension may require rethinking the task and output targets or how speech itself should be segmented.
Assuntos
Redes Neurais de Computação , Fonética , Humanos , Fala , Acústica da FalaRESUMO
It is well-known that children with expressive communication difficulties have the right to communicate, but they should also have the right to do so in whichever language they choose, with a voice that closely matches their age, gender, and dialect. This study aimed to develop naturalistic synthetic child speech, matching the vocal identity of three children with expressive communication difficulties, using Tacotron 2, for three under-resourced South African languages, namely South African English (SAE), Afrikaans, and isiXhosa. Due to the scarcity of child speech corpora, 2 hours of child speech data per child was collected from three 11- to 12-year-old children. Two adult models were used to "warm start" the child speech synthesis. To determine the naturalness of the synthetic voices, 124 listeners participated in a mean opinion score survey (Likert Score) and optionally gave qualitative feedback. Despite limited training data used in this study, we successfully developed a synthesized child voice of adequate quality in each language. This study highlights that with recent technological advancements, it is possible to develop synthetic child speech that matches the vocal identity of a child with expressive communication difficulties in different under-resourced languages.
RESUMO
This study examines the role of frequencies above 8 kHz in the classification of conversational speech fricatives [f, v, θ, ð, s, z, Ê, Ê, h] in random forest modeling. Prior research has mostly focused on spectral measures for fricative categorization using frequency information below 8 kHz. The contribution of higher frequencies has received only limited attention, especially for non-laboratory speech. In the present study, we use a corpus of sociolinguistic interview recordings from Western Canadian English sampled at 44.1 and 16 kHz. For both sampling rates, we analyze spectral measures obtained using Fourier analysis and the multitaper method, and we also compare models without and with amplitudinal measures. Results show that while frequency information above 8 kHz does not improve classification accuracy in random forest analyses, inclusion of such frequencies can affect the relative importance of specific measures. This includes a decreased contribution of center of gravity and an increased contribution of spectral standard deviation for the higher sampling rate. We also find no major differences in classification accuracy between Fourier and multitaper measures. The inclusion of power measures improves model accuracy but does not change the overall importance of spectral measures.
Assuntos
Comunicação , Idioma , Canadá , Linguística , Algoritmo Florestas AleatóriasRESUMO
The papers in this special issue provide a critical look at some historical ideas that have had an influence on research and teaching in the field of speech communication. They also address widely used methodologies or address long-standing methodological challenges in the areas of speech perception and speech production. The goal is to reconsider and evaluate the need for caution or replacement of historical ideas with more modern results and methods. The contributions provide respectful historical context to the classic ideas, as well as new original research or discussion that clarifies the limitations of the original ideas.
Assuntos
Percepção da Fala , Fala , ComunicaçãoRESUMO
The present study compares the production of fricatives in conversational versus read speech in American English. The goal is to examine which parameters contribute to the identification of fricatives across the two speech styles. The study surveys over 162 000 fricative tokens from the Buckeye Corpus [Pitt, Johnson, Hume, Kiesling, and Raymond (2005). Speech Commun. 45, 89-95] and the TIMIT Corpus [Zue and Seneff (1996). Recent Research towards Advanced Man-Machine Interface through Spoken Language (Elsevier, Amsterdam, the Netherlands), pp. 515-525]. A total of 18 different temporal and spectral measures are tested, including segment duration, preceding and following phone duration, spectral moments (at onset, midpoint, and/or offset), spectral peak frequency, etc. Results show that segment duration and midpoint spectral moments make the most prominent contribution to the categorization of fricatives for both speech styles. Spectral measures are more important for conversational speech, whereas duration plays a greater role for read speech. At the same time, the magnitude of the differences across speech styles is often low and many of the observed effects may be attributable to methodological differences across the corpora. Results may indicate that reduction of fricatives in conversational speech is more limited compared to the reduction of other types of speech sounds, such as plosives.
Assuntos
Idioma , Percepção da Fala , Humanos , Estados Unidos , Acústica da Fala , Fonética , FalaRESUMO
Using phonological neighborhood density has been a common method to quantify lexical competition. It is useful and convenient but has shortcomings that are worth reconsidering. The present study quantifies the effects of lexical competition during spoken word recognition using acoustic distance and acoustic absement rather than phonological neighborhood density. The indication of a word's lexical competition is given by what is termed to be its acoustic distinctiveness, which is taken as its average acoustic absement to all words in the lexicon. A variety of acoustic representations for items in the lexicon are analyzed. Statistical modeling shows that acoustic distinctiveness has a similar effect trend as that of phonological neighborhood density. Additionally, acoustic distinctiveness consistently increases model fitness more than phonological neighborhood density regardless of which kind of acoustic representation is used. However, acoustic distinctiveness does not seem to explain all of the same things as phonological neighborhood density. The different areas that these two predictors explain are discussed in addition to the potential theoretical implications of the usefulness of acoustic distinctiveness in the models. The present paper concludes with some reasons why a researcher may want to use acoustic distinctiveness over phonological neighborhood density in future experiments.
Assuntos
Acústica , LinguísticaRESUMO
ABSTRACChildren with cerebral palsy (CP) are characterized as difficult to understand because of poor articulation and breathy voice quality. This case series describes the subsystems of the speech mechanism (i.e., respiratory, laryngeal, oroarticulatory) in four children with CP and four matched typically developing children (TDC) during the modulation of vocal loudness. TDC used biomechanically efficient strategies among speech subsystems to increase vocal loudness. Children with CP made fewer breathing adjustments but recruited greater chest wall muscle activity and neuromuscular drive for louder productions. These results inform future clinical research and identify speech treatment targets for children with motor speech disorders.
Assuntos
Paralisia Cerebral , Disartria , Paralisia Cerebral/complicações , Criança , Disartria/etiologia , Humanos , FalaRESUMO
As scientists, it is important to sample as broadly as possible; however, there is a bias in the research performed on the speech acoustics of the world's languages toward work on languages of convenience (e.g., English). This special issue seeks to initiate increased publication of acoustic research on the sounds of the world's languages. The special issue contains a sample of 25 under-documented languages. While large relative to previous work (particularly in the Journal of the Acoustical Society of America), the 23 articles in this issue just scratch the surface. To have a better understanding of the fundamentals of speech communication, it is imperative, as a research community, to make a concerted effort to learn more about how speech sounds are perceived and produced in a wide variety of languages.
Assuntos
Fonética , Percepção da Fala , Idioma , Fala , Acústica da FalaRESUMO
Multiple measures of vowel overlap have been proposed that use F1, F2, and duration to calculate the degree of overlap between vowel categories. The present study assesses four of these measures: the spectral overlap assessment metric [SOAM; Wassink (2006). J. Acoust. Soc. Am. 119(4), 2334-2350], the a posteriori probability (APP)-based metric [Morrison (2008). J. Acoust. Soc. Am. 123(1), 37-40], the vowel overlap analysis with convex hulls method [VOACH; Haynes and Taylor, (2014). J. Acoust. Soc. Am. 136(2), 883-891], and the Pillai score as first used for vowel overlap by Hay, Warren, and Drager [(2006). J. Phonetics 34(4), 458-484]. Summaries of the measures are presented, and theoretical critiques of them are performed, concluding that the APP-based metric and Pillai score are theoretically preferable to SOAM and VOACH. The measures are empirically assessed using accuracy and precision criteria with Monte Carlo simulations. The Pillai score demonstrates the best overall performance in these tests. The potential applications of vowel overlap measures to research scenarios are discussed, including comparisons of vowel productions between different social groups, as well as acoustic investigations into vowel formant trajectories.
RESUMO
The Massive Auditory Lexical Decision (MALD) database is an end-to-end, freely available auditory and production data set for speech and psycholinguistic research, providing time-aligned stimulus recordings for 26,793 words and 9592 pseudowords, and response data for 227,179 auditory lexical decisions from 231 unique monolingual English listeners. In addition to the experimental data, we provide many precompiled listener- and item-level descriptor variables. This data set makes it easy to explore responses, build and test theories, and compare a wide range of models. We present summary statistics and analyses.
Assuntos
Tomada de Decisões , Adolescente , Adulto , Coleta de Dados , Bases de Dados Factuais , Feminino , Humanos , Idioma , Masculino , Psicolinguística , Fala , Adulto JovemRESUMO
Spoken language manifests itself as change over time in various acoustic dimensions. While it seems clear that acoustic-phonetic information in the speech signal is key to language processing, little is currently known about which specific types of acoustic information are relatively more informative to listeners. This problem is likely compounded when considering reduced speech: Which specific acoustic information do listeners rely on when encountering spoken forms that are highly variable, and often include altered or elided segments? This work explores contributions of spectral shape, f0 contour, target duration, and time varying intensity in the perception of reduced speech. This work extends previous laboratory-speech based perception studies into the realm of casual speech, and also provides support for use of an algorithm that quantifies phonetic reduction. Data suggest the role of spectral shape is extensive, and that its removal degrades signals in a way that hinders recognition severely. Information reflecting f0 contour and target duration both appear to aid the listener somewhat, though their influence seems small compared to that of short term spectral shape. Finally, information about time varying intensity aids the listener more than noise filled gaps, and both aid the listener beyond presentation of acoustic context with duration-matched silence.
Assuntos
Estimulação Acústica/métodos , Acústica , Percepção Auditiva/fisiologia , Idioma , Percepção da Fala/fisiologia , Fala , Adolescente , Adulto , Feminino , Humanos , Masculino , Adulto JovemRESUMO
Lexical tone identification requires a number of secondary cues, when main tonal contours are unavailable. In this article, we examine Mandarin native speakers' ability to identify lexical tones by extracting tonal information from sonorant onset pitch (onset contours) on syllable-initial nasals ranging from 50 to 70 ms in duration. In experiments I and II we test speakers' ability to identify lexical tones in a second syllable with and without onset contours in isolation (experiment I) and in a sentential context (experiment II). The results indicate that speakers can identify lexical tones with short distinctive onset contour patterns,they also indicate that misperception of tones 213 and 24 are common. Furthermore, in experiment III, we test whether onset contours in a following syllable can be utilized by listeners in tone identification. We find that onset contours in the following syllable also contribute to the identification of the target lexical tones. The conclusions are twofold: (1) Mandarin lexical tones can be identified with onset contours; (2) tonal domain must be extended to include not just typical cues of tones but also coarticulated tonal patterns.
Assuntos
Sinais (Psicologia) , Idioma , Fonética , Percepção da Altura Sonora , Acústica da Fala , Percepção da Fala , HumanosRESUMO
Recent evidence indicates that a word's paradigmatic neighbors affect production. However, these findings have mostly been obtained in careful laboratory settings using words in isolation, and thus ignoring potential effects that may arise from the syntagmatic context, which is typically present in spontaneous speech. The current corpus analysis investigates paradigmatic and syntagmatic effects in Estonian spontaneous speech. Following work on English, we focus on the duration of inflected and uninflected word-final /-s/ in content words, while simultaneously investigating whole words. Our analyses reveal three points. First, we find an effect of realized inflectional paradigm size, such that smaller paradigms actively used by the speakers lead to longer durations. Second, higher conditional probability is associated with shorter word forms and shorter segments. Finally, we do not directly replicate previous work on effects of inflectional status as in English word-final /-s/. Instead, we find that inflectional status interacts with conditional probability. We discuss the results in light of models of speech production and how they account for morphologically complex words and their paradigmatic neighbors.
Assuntos
Idioma , Fala , Humanos , Estônia , Probabilidade , Fatores de TempoRESUMO
We present an implementation of DIANA, a computational model of spoken word recognition, to model responses collected in the Massive Auditory Lexical Decision (MALD) project. DIANA is an end-to-end model, including an activation and decision component that takes the acoustic signal as input, activates internal word representations, and outputs lexicality judgments and estimated response latencies. Simulation 1 presents the process of creating acoustic models required by DIANA to analyze novel speech input. Simulation 2 investigates DIANA's performance in determining whether the input signal is a word present in the lexicon or a pseudoword. In Simulation 3, we generate estimates of response latency and correlate them with general tendencies in participant responses in MALD data. We find that DIANA performs fairly well in free word recognition and lexical decision. However, the current approach for estimating response latency provides estimates opposite to those found in behavioral data. We discuss these findings and offer suggestions as to what a contemporary model of spoken word recognition should be able to do.
Assuntos
Percepção da Fala , Fala , Humanos , Tempo de Reação , Simulação por Computador , Percepção da Fala/fisiologia , AcústicaRESUMO
In conversational speech, phones and entire syllables are often missing. This can make "he's" and "he was" homophonous, realized for example as [ɨz]. Similarly, "you're" and "you were" can both be realized as [jÉ], [ɨ], etc. We investigated what types of information native listeners use to perceive such verb tenses. Possible types included acoustic cues in the phrase (e.g., in "he was"), the rate of the surrounding speech, and syntactic and semantic information in the utterance, such as the presence of time adverbs such as "yesterday" or other tensed verbs. We extracted utterances such as "So they're gonna have like a random roommate" and "And he was like, 'What's wrong?!'" from recordings of spontaneous conversations. We presented parts of these utterances to listeners, in either a written or auditory modality, to determine which types of information facilitated listeners' comprehension. Listeners rely primarily on acoustic cues in or near the target words rather than meaning and syntactic information in the context. While that information also improves comprehension in some conditions, the acoustic cues in the target itself are strong enough to reverse the percept that listeners gain from all other information together. Acoustic cues override other information in comprehending reduced productions in conversational speech.
RESUMO
While known to influence visual lexical processing, the semantic information we associate with words has recently been found to influence auditory lexical processing as well. The present work explored the influence of semantic richness in auditory lexical decision. Study 1 recreated an experiment investigating semantic richness effects in concrete nouns (Goh et al., 2016). In Study 2, we expanded the stimulus set from 442 to 8,626 items, exploring the robustness of effects observed in Study 1 against a larger data set with increased diversity in both word class and other characteristics of interest. We also utilized generalized additive mixed models to investigate potential nonlinear effects. Results indicate that semantic richness effects become more nuanced and detectable when a wider set of items belonging to different parts of speech is examined. Findings are discussed in the context of models of spoken word recognition. (PsycInfo Database Record (c) 2022 APA, all rights reserved).
RESUMO
Listeners require context to understand the highly reduced words that occur in casual speech. The present study reports four auditory lexical decision experiments in which the role of semantic context in the comprehension of reduced versus unreduced speech was investigated. Experiments 1 and 2 showed semantic priming for combinations of unreduced, but not reduced, primes and low-frequency targets. In Experiment 3, we crossed the reduction of the prime with the reduction of the target. Results showed no semantic priming from reduced primes, regardless of the reduction of the targets. Finally, Experiment 4 showed that reduced and unreduced primes facilitate upcoming low-frequency related words equally if the interstimulus interval is extended. These results suggest that semantically related words need more time to be recognized after reduced primes, but once reduced primes have been fully (semantically) processed, these primes can facilitate the recognition of upcoming words as well as do unreduced primes.
Assuntos
Compreensão/fisiologia , Semântica , Percepção da Fala/fisiologia , Humanos , Fonética , Psicolinguística , Testes Psicológicos , Tempo de Reação , Reconhecimento Psicológico/fisiologiaRESUMO
Variability is perhaps the most notable characteristic of speech, and it is particularly noticeable in spontaneous conversational speech. The current research examines how speakers realize the American English stops /p, k, b, g/ and flaps (ɾ from /t, d/), in casual conversation and in careful speech. Target consonants appear after stressed syllables (e.g., "lobby") or between unstressed syllables (e.g., "humanity"), in one of six segmental/word-boundary environments. This work documents the degree and types of variability listeners encounter and must parse. Findings show greater reduction in connected and spontaneous speech, greater reduction in high frequency phrases (but not within high frequency words), and greater reduction between unstressed syllables than after a stress. Although highly reduced productions of stops and flaps occur often, with approximant-like tokens even in careful speech, reduction does not lead to a large amount of overlap between phonological categories. Approximant-like realizations of expected stops and flaps in some conditions constitute the majority of tokens. This shows that reduced speech is something that listeners encounter, and must perceive, in a large proportion of the speech they hear.
Assuntos
Fonação , Acústica da Fala , Inteligibilidade da Fala , Percepção da Fala , Análise de Variância , Feminino , Humanos , Masculino , Processamento de Sinais Assistido por Computador , Espectrografia do Som , Medida da Produção da Fala , Fatores de TempoRESUMO
The present study investigates the informativity of anticipatory coarticulatory acoustic detail about inflectional suffixes in English verbs, performing two experiments in which listeners classified inflectional functions of verbs. Listener response latencies were slower when acoustic detail resulting from anticipatory coarticulation mismatched with the inflectional suffix. The results indicate that listeners actively use coarticulatory phonetic detail to predict the verbs' inflectional function.