Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 4 de 4
Filtrar
Mais filtros

Base de dados
Assunto principal
Ano de publicação
Tipo de documento
País de afiliação
Intervalo de ano de publicação
1.
Entropy (Basel) ; 25(2)2023 Feb 18.
Artigo em Inglês | MEDLINE | ID: mdl-36832741

RESUMO

Voice conversion (VC) consists of digitally altering the voice of an individual to manipulate part of its content, primarily its identity, while maintaining the rest unchanged. Research in neural VC has accomplished considerable breakthroughs with the capacity to falsify a voice identity using a small amount of data with a highly realistic rendering. This paper goes beyond voice identity manipulation and presents an original neural architecture that allows the manipulation of voice attributes (e.g., gender and age). The proposed architecture is inspired by the fader network, transferring the same ideas to voice manipulation. The information conveyed by the speech signal is disentangled into interpretative voice attributes by means of minimizing adversarial loss to make the encoded information mutually independent while preserving the capacity to generate a speech signal from the disentangled codes. During inference for voice conversion, the disentangled voice attributes can be manipulated and the speech signal can be generated accordingly. For experimental evaluation, the proposed method is applied to the task of voice gender conversion using the freely available VCTK dataset. Quantitative measurements of mutual information between the variables of speaker identity and speaker gender show that the proposed architecture can learn gender-independent representation of speakers. Additional measurements of speaker recognition indicate that speaker identity can be recognized accurately from the gender-independent representation. Finally, a subjective experiment conducted on the task of voice gender manipulation shows that the proposed architecture can convert voice gender with very high efficiency and good naturalness.

2.
R Soc Open Sci ; 11(1): 231713, 2024 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-38204786

RESUMO

Vocal communication is widespread in animals, with vocal repertoires of varying complexity. The social complexity hypothesis predicts that species may need high vocal complexity to deal with complex social organization (e.g. have a variety of different interindividual relations). We quantified the vocal complexity of two geographically distant captive colonies of rooks, a corvid species with complex social organization and cognitive performances, but understudied vocal abilities. We quantified the diversity and gradation of their repertoire, as well as the inter-individual similarity at the vocal unit level. We found that males produced call units with lower diversity and gradation than females, while song units did not differ between sexes. Surprisingly, while females produced highly similar call repertoires, even between colonies, each individual male produced almost completely different call repertoires from any other individual. These findings question the way male rooks communicate with their social partners. We suggest that each male may actively seek to remain vocally distinct, which could be an asset in their frequently changing social environment. We conclude that inter-individual similarity, an understudied aspect of vocal repertoires, should also be considered as a measure of vocal complexity.

3.
Front Artif Intell ; 6: 1142997, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37377638

RESUMO

Modeling virtual agents with behavior style is one factor for personalizing human-agent interaction. We propose an efficient yet effective machine learning approach to synthesize gestures driven by prosodic features and text in the style of different speakers including those unseen during training. Our model performs zero-shot multimodal style transfer driven by multimodal data from the PATS database containing videos of various speakers. We view style as being pervasive; while speaking, it colors the communicative behaviors expressivity while speech content is carried by multimodal signals and text. This disentanglement scheme of content and style allows us to directly infer the style embedding even of a speaker whose data are not part of the training phase, without requiring any further training or fine-tuning. The first goal of our model is to generate the gestures of a source speaker based on the content of two input modalities-Mel spectrogram and text semantics. The second goal is to condition the source speaker's predicted gestures on the multimodal behavior style embedding of a target speaker. The third goal is to allow zero-shot style transfer of speakers unseen during training without re-training the model. Our system consists of two main components: (1) a speaker style encoder network that learns to generate a fixed-dimensional speaker embedding style from a target speaker multimodal data (mel-spectrogram, pose, and text) and (2) a sequence-to-sequence synthesis network that synthesizes gestures based on the content of the input modalities-text and mel-spectrogram-of a source speaker and conditioned on the speaker style embedding. We evaluate that our model is able to synthesize gestures of a source speaker given the two input modalities and transfer the knowledge of target speaker style variability learned by the speaker style encoder to the gesture generation task in a zero-shot setup, indicating that the model has learned a high-quality speaker representation. We conduct objective and subjective evaluations to validate our approach and compare it with baselines.

4.
Environ Sci Pollut Res Int ; 30(32): 78959-78972, 2023 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-37278892

RESUMO

Contaminated sites pose a serious threat to the ecological environment and human health. Because of the presence of multiple peaks in the pollution data of some contaminated sites, as well as strong spatial heterogeneity and skewness in their distribution, the accuracy of spatial interpolation prediction is low. This study proposes a method for investigating highly skewed contaminated sites, which uses Thiessen polygons coupled with geostatistics and deterministic interpolation to optimize the spatial prediction and sampling strategy of sites. An industrial site in Luohe is used as an example to validate the proposed method. The results indicate that using 40 × 40 m as the minimum initial sampling unit can obtain data that is representative of the regional pollution situation. Evaluation indexes reveal that the ordinary kriging (OK) method for interpolation prediction accuracy and the radial basis function_inverse distance weighted (RBF_IMQ) method for pollution scope prediction provides the best results, which can effectively improve the spatial prediction accuracy of pollution in the study area. Each accuracy indicator is enhanced by 20-70% after supplementing 11 sampling points in the suspect region, and the identification of the pollution scope approaches 95%. This method offers a novel approach for investigating highly biased contaminated sites, which can optimize the spatial prediction accuracy of pollution and reduce economic costs.


Assuntos
Poluentes do Solo , Humanos , Poluentes do Solo/análise , Monitoramento Ambiental/métodos , Poluição Ambiental , Meio Ambiente , Solo , Análise Espacial
SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa