Your browser doesn't support javascript.
loading
: 20 | 50 | 100
1 - 20 de 8.444
1.
J Acoust Soc Am ; 155(5): 3037-3050, 2024 May 01.
Article En | MEDLINE | ID: mdl-38717209

The progress of fin whale study is hindered by the debate about whether the two typical type-A and type-B calls (characterized by central source frequencies of 17-20 Hz and 20-30 Hz, respectively) originate from a single fin whale or two individual fin whales. Here, hydroacoustic data is employed to study the type, vocal behavior, and temporal evolution of fin whale calls around the Southern Wake Island from 2010 to 2022. It is identified that (1) type-A and type-B calls come from two individuals based on the large source separation of the two calls through high-precision determination of source location; (2) type-A fin whales exhibit vocal influence on type-B fin whales, where type-B fin whales become paired with type-A calls and vocalize regularly when type-A fin whales appear, and type-A fin whales always lead the call sequences; and (3) some type-A fin whales stop calling when another type-A fin whale approaches at a distance of about 1.6 km. During 2010-2022, type-A calls occur every year, whereas type-B calls are prevalent only after November 2018. A culture transmission is proposed from type-A fin whales to type-B fin whales and/or a population increase of type-B fin whales in the region after November 2018.


Acoustics , Fin Whale , Vocalization, Animal , Animals , Fin Whale/physiology , Sound Spectrography , Time Factors , Islands
2.
J Acoust Soc Am ; 155(5): 3071-3089, 2024 May 01.
Article En | MEDLINE | ID: mdl-38717213

This study investigated how 40 Chinese learners of English as a foreign language (EFL learners) differed from 40 native English speakers in the production of four English tense-lax contrasts, /i-ɪ/, /u-ʊ/, /ɑ-ʌ/, and /æ-ε/, by examining the acoustic measurements of duration, the first three formant frequencies, and the slope of the first formant movement (F1 slope). The dynamic formant trajectory was modeled using discrete cosine transform coefficients to demonstrate the time-varying properties of formant trajectories. A discriminant analysis was employed to illustrate the extent to which Chinese EFL learners relied on different acoustic parameters. This study found that: (1) Chinese EFL learners overemphasized durational differences and weakened spectral differences for the /i-ɪ/, /u-ʊ/, and /ɑ-ʌ/ pairs, although they maintained sufficient spectral differences for /æ-ε/. In contrast, native English speakers predominantly used spectral differences across all four pairs; (2) in non-low tense-lax contrasts, unlike native English speakers, Chinese EFL learners failed to exhibit different F1 slope values, indicating a non-nativelike tongue-root placement during the articulatory process. The findings underscore the contribution of dynamic spectral patterns to the differentiation between English tense and lax vowels, and reveal the influence of precise articulatory gestures on the realization of the tense-lax contrast.


Multilingualism , Phonetics , Speech Acoustics , Humans , Male , Female , Young Adult , Speech Production Measurement , Adult , Language , Acoustics , Learning , Voice Quality , Sound Spectrography , East Asian People
3.
PLoS One ; 19(5): e0300607, 2024.
Article En | MEDLINE | ID: mdl-38787824

Listening to music is a crucial tool for relieving stress and promoting relaxation. However, the limited options available for stress-relief music do not cater to individual preferences, compromising its effectiveness. Traditional methods of curating stress-relief music rely heavily on measuring biological responses, which is time-consuming, expensive, and requires specialized measurement devices. In this paper, a deep learning approach to solve this problem is introduced that explicitly uses convolutional neural networks and provides a more efficient and economical method for generating large datasets of stress-relief music. These datasets are composed of Mel-scaled spectrograms that include essential sound elements (such as frequency, amplitude, and waveform) that can be directly extracted from the music. The trained model demonstrated a test accuracy of 98.7%, and a clinical study indicated that the model-selected music was as effective as researcher-verified music in terms of stress-relieving capacity. This paper underlines the transformative potential of deep learning in addressing the challenge of limited music options for stress relief. More importantly, the proposed method has profound implications for music therapy because it enables a more personalized approach to stress-relief music selection, offering the potential for enhanced emotional well-being.


Music Therapy , Music , Neural Networks, Computer , Stress, Psychological , Humans , Music/psychology , Stress, Psychological/therapy , Music Therapy/methods , Deep Learning , Male , Female , Adult , Sound Spectrography/methods , Young Adult
4.
Ter Arkh ; 96(3): 228-232, 2024 Apr 16.
Article Ru | MEDLINE | ID: mdl-38713036

AIM: To evaluate the possibility of using spectral analysis of cough sounds in the diagnosis of a new coronavirus infection COVID-19. MATERIALS AND METHODS: Spectral toussophonobarography was performed in 218 patients with COVID-19 [48.56% men, 51.44% women, average age 40.2 (32.4; 51.0)], in 60 healthy individuals [50% men, 50% women, average age 41.7 (32.2; 53.0)] with induced cough (by inhalation of citric acid solution at a concentration of 20 g/l through a nebulizer). The recording was made using a contact microphone located on a special tripod at a distance of 15-20 cm from the face of the subject. The resulting recordings were processed in a computer program, after which spectral analysis of cough sounds was performed using Fourier transform algorithms. The following parameters of cough sounds were evaluated: the duration of the cough act (ms), the ratio of the energy of low frequencies (60-600 Hz) to the energy of high frequencies (600-6000 Hz), the frequency of the maximum energy of the cough sound (Hz). RESULTS: After statistical processing, it was found out that the parameters of the cough sound of COVID-19 patients differ from the cough of healthy individuals. The obtained data were substituted into the developed regression equation. Rounded to integers, the resulting number had the following interpretation: "0" - there is no COVID-19, "1" - there is COVID-19. CONCLUSION: The technique showed high levels of sensitivity and specificity. In addition, the method is characterized by sufficient ease of use and does not require expensive equipment, therefore it can be used in practice for timely diagnosis of COVID-19.


COVID-19 , Cough , SARS-CoV-2 , Humans , Cough/diagnosis , Cough/etiology , Cough/physiopathology , COVID-19/diagnosis , Female , Male , Adult , Middle Aged , Sound Spectrography/methods
5.
PeerJ ; 12: e17320, 2024.
Article En | MEDLINE | ID: mdl-38766489

Vocal complexity is central to many evolutionary hypotheses about animal communication. Yet, quantifying and comparing complexity remains a challenge, particularly when vocal types are highly graded. Male Bornean orangutans (Pongo pygmaeus wurmbii) produce complex and variable "long call" vocalizations comprising multiple sound types that vary within and among individuals. Previous studies described six distinct call (or pulse) types within these complex vocalizations, but none quantified their discreteness or the ability of human observers to reliably classify them. We studied the long calls of 13 individuals to: (1) evaluate and quantify the reliability of audio-visual classification by three well-trained observers, (2) distinguish among call types using supervised classification and unsupervised clustering, and (3) compare the performance of different feature sets. Using 46 acoustic features, we used machine learning (i.e., support vector machines, affinity propagation, and fuzzy c-means) to identify call types and assess their discreteness. We additionally used Uniform Manifold Approximation and Projection (UMAP) to visualize the separation of pulses using both extracted features and spectrogram representations. Supervised approaches showed low inter-observer reliability and poor classification accuracy, indicating that pulse types were not discrete. We propose an updated pulse classification approach that is highly reproducible across observers and exhibits strong classification accuracy using support vector machines. Although the low number of call types suggests long calls are fairly simple, the continuous gradation of sounds seems to greatly boost the complexity of this system. This work responds to calls for more quantitative research to define call types and quantify gradedness in animal vocal systems and highlights the need for a more comprehensive framework for studying vocal complexity vis-à-vis graded repertoires.


Vocalization, Animal , Animals , Vocalization, Animal/physiology , Male , Pongo pygmaeus/physiology , Reproducibility of Results , Machine Learning , Acoustics , Sound Spectrography , Borneo
6.
Physiol Behav ; 281: 114581, 2024 Jul 01.
Article En | MEDLINE | ID: mdl-38734358

Bird song is a crucial feature for mate choice and reproduction. Song can potentially communicate information related to the quality of the mate, through song complexity, structure or finer changes in syllable characteristics. It has been shown in zebra finches that those characteristics can be affected by various factors including motivation, hormone levels or extreme temperature. However, although the literature on zebra finch song is substantial, some factors have been neglected. In this paper, we recorded male zebra finches in two breeding contexts (before and after pairing) and in two ambient temperature conditions (stable and variable) to see how those factors could influence song production. We found strong differences between the two breeding contexts: compared to their song before pairing, males that were paired had lower song rate, syllable consistency, frequency and entropy, while surprisingly the amplitude of their syllables increased. Temperature variability had an impact on the extent of these differences, but did not directly affect the song parameters that we measured. Our results describe for the first time how breeding status and temperature variability can affect zebra finch song, and give some new insights into the subtleties of the acoustic communication of this model species.


Finches , Sexual Behavior, Animal , Temperature , Vocalization, Animal , Animals , Male , Finches/physiology , Vocalization, Animal/physiology , Sexual Behavior, Animal/physiology , Sound Spectrography , Female
7.
Nat Commun ; 15(1): 3617, 2024 May 07.
Article En | MEDLINE | ID: mdl-38714699

Sperm whales (Physeter macrocephalus) are highly social mammals that communicate using sequences of clicks called codas. While a subset of codas have been shown to encode information about caller identity, almost everything else about the sperm whale communication system, including its structure and information-carrying capacity, remains unknown. We show that codas exhibit contextual and combinatorial structure. First, we report previously undescribed features of codas that are sensitive to the conversational context in which they occur, and systematically controlled and imitated across whales. We call these rubato and ornamentation. Second, we show that codas form a combinatorial coding system in which rubato and ornamentation combine with two context-independent features we call rhythm and tempo to produce a large inventory of distinguishable codas. Sperm whale vocalisations are more expressive and structured than previously believed, and built from a repertoire comprising nearly an order of magnitude more distinguishable codas. These results show context-sensitive and combinatorial vocalisation can appear in organisms with divergent evolutionary lineage and vocal apparatus.


Sperm Whale , Vocalization, Animal , Animals , Vocalization, Animal/physiology , Sperm Whale/physiology , Sperm Whale/anatomy & histology , Male , Female , Sound Spectrography
8.
Stud Health Technol Inform ; 314: 151-152, 2024 May 23.
Article En | MEDLINE | ID: mdl-38785022

This study proposes an innovative application of the Goertzel Algorithm (GA) for the processing of vocal signals in dysphonia evaluation. Compared to the Fast Fourier Transform (FFT) representing the gold standard analysis technique in this context, GA demonstrates higher efficiency in terms of processing time and memory usage, also showing an improved discrimination between healthy and pathological conditions. This suggests that GA-based approaches could enhance the reliability and efficiency of vocal signal analysis, thus supporting physicians in dysphonia research and clinical monitoring.


Algorithms , Dysphonia , Humans , Dysphonia/diagnosis , Signal Processing, Computer-Assisted , Sound Spectrography/methods , Reproducibility of Results , Fourier Analysis , Female , Male
9.
J Acoust Soc Am ; 155(4): 2803-2816, 2024 Apr 01.
Article En | MEDLINE | ID: mdl-38662608

Urban expansion has increased pollution, including both physical (e.g., exhaust, litter) and sensory (e.g., anthropogenic noise) components. Urban avian species tend to increase the frequency and/or amplitude of songs to reduce masking by low-frequency noise. Nevertheless, song propagation to the receiver can also be constrained by the environment. We know relatively little about how this propagation may be altered across species that (1) vary in song complexity and (2) inhabit areas along an urbanization gradient. We investigated differences in song amplitude, attenuation, and active space, or the maximum distance a receiver can detect a signal, in two human-commensal species: the house sparrow (Passer domesticus) and house finch (Haemorhous mexicanus). We described urbanization both discretely and quantitatively to investigate the habitat characteristics most responsible for propagation changes. We found mixed support for our hypothesis of urban-specific degradation of songs. Urban songs propagated with higher amplitude; however, urban song fidelity was species-specific and showed lowered active space for urban house finch songs. Taken together, our results suggest that urban environments may constrain the propagation of vocal signals in species-specific manners. Ultimately, this has implications for the ability of urban birds to communicate with potential mates or kin.


Finches , Species Specificity , Urbanization , Vocalization, Animal , Animals , Vocalization, Animal/physiology , Finches/physiology , Sparrows/physiology , Noise , Sound Spectrography , Ecosystem , Humans , Perceptual Masking/physiology , Male
10.
PLoS One ; 19(4): e0299250, 2024.
Article En | MEDLINE | ID: mdl-38635752

Passive acoustic monitoring has improved our understanding of vocalizing organisms in remote habitats and during all weather conditions. Many vocally active species are highly mobile, and their populations overlap. However, distinct vocalizations allow the tracking and discrimination of individuals or populations. Using signature whistles, the individually distinct calls of bottlenose dolphins, we calculated a minimum abundance of individuals, characterized and compared signature whistles from five locations, and determined reoccurrences of individuals throughout the Mid-Atlantic Bight and Chesapeake Bay, USA. We identified 1,888 signature whistles in which the duration, number of extrema, start, end, and minimum frequencies of signature whistles varied significantly by site. All characteristics of signature whistles were deemed important for determining from which site the whistle originated and due to the distinct signature whistle characteristics and lack of spatial mixing of the dolphins detected at the Offshore site, we suspect that these dolphins are of a different population than those at the Coastal and Bay sites. Signature whistles were also found to be shorter when sound levels were higher. Using only the passively recorded vocalizations of this marine top predator, we obtained information about its population and how it is affected by ambient sound levels, which will increase as offshore wind energy is developed. In this rapidly developing area, these calls offer critical management insights for this protected species.


Bottle-Nosed Dolphin , Vocalization, Animal , Animals , Sound Spectrography , Ecosystem
11.
J Acoust Soc Am ; 155(4): 2724-2727, 2024 Apr 01.
Article En | MEDLINE | ID: mdl-38656337

The auditory sensitivity of a small songbird, the red-cheeked cordon bleu, was measured using the standard methods of animal psychophysics. Hearing in cordon bleus is similar to other small passerines with best hearing in the frequency region from 2 to 4 kHz and sensitivity declining at the rate of about 10 dB/octave below 2 kHz and about 35 dB/octave as frequency increases from 4 to 9 kHz. While critical ratios are similar to other songbirds, the long-term average power spectrum of cordon bleu song falls above the frequency of best hearing in this species.


Acoustic Stimulation , Auditory Threshold , Hearing , Songbirds , Vocalization, Animal , Animals , Vocalization, Animal/physiology , Hearing/physiology , Songbirds/physiology , Male , Psychoacoustics , Sound Spectrography , Female
12.
J Acoust Soc Am ; 155(4): 2627-2635, 2024 Apr 01.
Article En | MEDLINE | ID: mdl-38629884

Passive acoustic monitoring (PAM) is an optimal method for detecting and monitoring cetaceans as they frequently produce sound while underwater. Cue counting, counting acoustic cues of deep-diving cetaceans instead of animals, is an alternative method for density estimation, but requires an average cue production rate to convert cue density to animal density. Limited information about click rates exists for sperm whales in the central North Pacific Ocean. In the absence of acoustic tag data, we used towed hydrophone array data to calculate the first sperm whale click rates from this region and examined their variability based on click type, location, distance of whales from the array, and group size estimated by visual observers. Our findings show click type to be the most important variable, with groups that include codas yielding the highest click rates. We also found a positive relationship between group size and click detection rates that may be useful for acoustic predictions of group size in future studies. Echolocation clicks detected using PAM methods are often the only indicator of deep-diving cetacean presence. Understanding the factors affecting their click rates provides important information for acoustic density estimation.


Echolocation , Sperm Whale , Animals , Vocalization, Animal , Acoustics , Whales , Sound Spectrography
13.
Sci Rep ; 14(1): 6062, 2024 03 13.
Article En | MEDLINE | ID: mdl-38480760

With the large increase in human marine activity, our seas have become populated with vessels that can be overheard from distances of even 20 km. Prior investigations showed that such a dense presence of vessels impacts the behaviour of marine animals, and in particular dolphins. While previous explorations were based on a linear observation for changes in the features of dolphin whistles, in this work we examine non-linear responses of bottlenose dolphins (Tursiops Truncatus) to the presence of vessels. We explored the response of dolphins to vessels by continuously recording acoustic data using two long-term acoustic recorders deployed near a shipping lane and a dolphin habitat in Eilat, Israel. Using deep learning methods we detected a large number of 50,000 whistles, which were clustered to associate whistle traces and to characterize their features to discriminate vocalizations of dolphins: both structure and quantities. Using a non-linear classifier, the whistles were categorized into two classes representing the presence or absence of a nearby vessel. Although our database does not show linear observable change in the features of the whistles, we obtained true positive and true negative rates exceeding 90% accuracy on separate, left-out test sets. We argue that this success in classification serves as a statistical proof for a non-linear response of dolphins to the presence of vessels.


Bottle-Nosed Dolphin , Vocalization, Animal , Animals , Humans , Vocalization, Animal/physiology , Bottle-Nosed Dolphin/physiology , Acoustics , Oceans and Seas , Ships , Sound Spectrography
14.
J Acoust Soc Am ; 155(3): 2050-2064, 2024 Mar 01.
Article En | MEDLINE | ID: mdl-38477612

The study of humpback whale song using passive acoustic monitoring devices requires bioacousticians to manually review hours of audio recordings to annotate the signals. To vastly reduce the time of manual annotation through automation, a machine learning model was developed. Convolutional neural networks have made major advances in the previous decade, leading to a wide range of applications, including the detection of frequency modulated vocalizations by cetaceans. A large dataset of over 60 000 audio segments of 4 s length is collected from the North Atlantic and used to fine-tune an existing model for humpback whale song detection in the North Pacific (see Allen, Harvey, Harrell, Jansen, Merkens, Wall, Cattiau, and Oleson (2021). Front. Mar. Sci. 8, 607321). Furthermore, different data augmentation techniques (time-shift, noise augmentation, and masking) are used to artificially increase the variability within the training set. Retraining and augmentation yield F-score values of 0.88 on context window basis and 0.89 on hourly basis with false positive rates of 0.05 on context window basis and 0.01 on hourly basis. If necessary, usage and retraining of the existing model is made convenient by a framework (AcoDet, acoustic detector) built during this project. Combining the tools provided by this framework could save researchers hours of manual annotation time and, thus, accelerate their research.


Humpback Whale , Animals , Vocalization, Animal , Sound Spectrography , Time Factors , Seasons , Acoustics
15.
J Acoust Soc Am ; 155(2): 1253-1263, 2024 02 01.
Article En | MEDLINE | ID: mdl-38341748

The reassigned spectrogram (RS) has emerged as the most accurate way to infer vocal tract resonances from the acoustic signal [Shadle, Nam, and Whalen (2016). "Comparing measurement errors for formants in synthetic and natural vowels," J. Acoust. Soc. Am. 139(2), 713-727]. To date, validating its accuracy has depended on formant synthesis for ground truth values of these resonances. Synthesis is easily controlled, but it has many intrinsic assumptions that do not necessarily accurately realize the acoustics in the way that physical resonances would. Here, we show that physical models of the vocal tract with derivable resonance values allow a separate approach to the ground truth, with a different range of limitations. Our three-dimensional printed vocal tract models were excited by white noise, allowing an accurate determination of the resonance frequencies. Then, sources with a range of fundamental frequencies were implemented, allowing a direct assessment of whether RS avoided the systematic bias towards the nearest strong harmonic to which other analysis techniques are prone. RS was indeed accurate at fundamental frequencies up to 300 Hz; above that, accuracy was somewhat reduced. Future directions include testing mechanical models with the dimensions of children's vocal tracts and making RS more broadly useful by automating the detection of resonances.


Voice , Child , Humans , Acoustics , Speech Acoustics , Vibration , Sound Spectrography
16.
J Acoust Soc Am ; 155(2): 1437-1450, 2024 02 01.
Article En | MEDLINE | ID: mdl-38364047

Odontocetes produce clicks for echolocation and communication. Most odontocetes are thought to produce either broadband (BB) or narrowband high-frequency (NBHF) clicks. Here, we show that the click repertoire of Hector's dolphin (Cephalorhynchus hectori) comprises highly stereotypical NBHF clicks and far more variable broadband clicks, with some that are intermediate between these two categories. Both NBHF and broadband clicks were made in trains, buzzes, and burst-pulses. Most clicks within click trains were typical NBHF clicks, which had a median centroid frequency of 130.3 kHz (median -10 dB bandwidth = 29.8 kHz). Some, however, while having only marginally lower centroid frequency (median = 123.8 kHz), had significant energy below 100 kHz and approximately double the bandwidth (median -10 dB bandwidth = 69.8 kHz); we refer to these as broadband. Broadband clicks in buzzes and burst-pulses had lower median centroid frequencies (120.7 and 121.8 kHz, respectively) compared to NBHF buzzes and burst-pulses (129.5 and 130.3 kHz, respectively). Source levels of NBHF clicks, estimated by using a drone to measure ranges from a single hydrophone and by computing time-of-arrival differences at a vertical hydrophone array, ranged from 116 to 171 dB re 1 µPa at 1 m, whereas source levels of broadband clicks, obtained from array data only, ranged from 138 to 184 dB re 1 µPa at 1 m. Our findings challenge the grouping of toothed whales as either NBHF or broadband species.


Dolphins , Echolocation , Animals , Acoustics , Vocalization, Animal , Sound Spectrography
17.
J Acoust Soc Am ; 155(1): 274-283, 2024 01 01.
Article En | MEDLINE | ID: mdl-38215217

Echolocating bats and dolphins use biosonar to determine target range, but differences in range discrimination thresholds have been reported for the two species. Whether these differences represent a true difference in their sensory system capability is unknown. Here, the dolphin's range discrimination threshold as a function of absolute range and echo-phase was investigated. Using phantom echoes, the dolphins were trained to echo-inspect two simulated targets and indicate the closer target by pressing a paddle. One target was presented at a time, requiring the dolphin to hold the initial range in memory as they compared it to the second target. Range was simulated by manipulating echo-delay while the received echo levels, relative to the dolphins' clicks, were held constant. Range discrimination thresholds were determined at seven different ranges from 1.75 to 20 m. In contrast to bats, range discrimination thresholds increased from 4 to 75 cm, across the entire ranges tested. To investigate the acoustic features used more directly, discrimination thresholds were determined when the echo was given a random phase shift (±180°). Results for the constant-phase versus the random-phase echo were quantitatively similar, suggesting that dolphins used the envelope of the echo waveform to determine the difference in range.


Bottle-Nosed Dolphin , Chiroptera , Echolocation , Animals , Acoustics , Sound Spectrography
18.
J Acoust Soc Am ; 155(1): 396-404, 2024 01 01.
Article En | MEDLINE | ID: mdl-38240666

When they are exposed to loud fatiguing sounds in the oceans, marine mammals are susceptible to hearing damage in the form of temporary hearing threshold shifts (TTSs) or permanent hearing threshold shifts. We compared the level-dependent and frequency-dependent susceptibility to TTSs in harbor seals and harbor porpoises, species with different hearing sensitivities in the low- and high-frequency regions. Both species were exposed to 100% duty cycle one-sixth-octave noise bands at frequencies that covered their entire hearing range. In the case of the 6.5 kHz exposure for the harbor seals, a pure tone (continuous wave) was used. TTS was quantified as a function of sound pressure level (SPL) half an octave above the center frequency of the fatiguing sound. The species have different audiograms, but their frequency-specific susceptibility to TTS was more similar. The hearing frequency range in which both species were most susceptible to TTS was 22.5-50 kHz. Furthermore, the frequency ranges were characterized by having similar critical levels (defined as the SPL of the fatiguing sound above which the magnitude of TTS induced as a function of SPL increases more strongly). This standardized between-species comparison indicates that the audiogram is not a good predictor of frequency-dependent susceptibility to TTS.


Phoca , Phocoena , Animals , Acoustic Stimulation , Auditory Fatigue , Sound Spectrography , Recovery of Function , Hearing , Auditory Threshold
19.
IEEE Trans Pattern Anal Mach Intell ; 46(6): 4234-4245, 2024 Jun.
Article En | MEDLINE | ID: mdl-38241115

Text-to-speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality, and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on benchmark datasets. Specifically, we leverage a variational auto-encoder (VAE) for end-to-end text-to-waveform generation, with several key modules to enhance the capacity of the prior from text and reduce the complexity of the posterior from speech, including phoneme pre-training, differentiable duration modeling, bidirectional prior/posterior modeling, and a memory mechanism in VAE. Experimental evaluations on the popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS (comparative mean opinion score) to human recordings at the sentence level, with Wilcoxon signed rank test at p-level p >> 0.05, which demonstrates no statistically significant difference from human recordings for the first time.


Algorithms , Humans , Signal Processing, Computer-Assisted , Speech/physiology , Natural Language Processing , Databases, Factual , Sound Spectrography/methods
20.
Behav Res Methods ; 56(3): 2114-2134, 2024 Mar.
Article En | MEDLINE | ID: mdl-37253958

The use of voice recordings in both research and industry practice has increased dramatically in recent years-from diagnosing a COVID-19 infection based on patients' self-recorded voice samples to predicting customer emotions during a service center call. Crowdsourced audio data collection in participants' natural environment using their own recording device has opened up new avenues for researchers and practitioners to conduct research at scale across a broad range of disciplines. The current research examines whether fundamental properties of the human voice are reliably and validly captured through common consumer-grade audio-recording devices in current medical, behavioral science, business, and computer science research. Specifically, this work provides evidence from a tightly controlled laboratory experiment analyzing 1800 voice samples and subsequent simulations that recording devices with high proximity to a speaker (such as a headset or a lavalier microphone) lead to inflated measures of amplitude compared to a benchmark studio-quality microphone while recording devices with lower proximity to a speaker (such as a laptop or a smartphone in front of the speaker) systematically reduce measures of amplitude and can lead to biased measures of the speaker's true fundamental frequency. We further demonstrate through simulation studies that these differences can lead to biased and ultimately invalid conclusions in, for example, an emotion detection task. Finally, we outline a set of recording guidelines to ensure reliable and valid voice recordings and offer initial evidence for a machine-learning approach to bias correction in the case of distorted speech signals.


Voice Quality , Voice , Humans , Sound Spectrography , Smartphone , Microcomputers
...