RESUMEN
Federated learning (FL) has emerged as a powerful machine learning technique that enables the development of models from decentralized data sources. However, the decentralized nature of FL makes it vulnerable to adversarial attacks. In this survey, we provide a comprehensive overview of the impact of malicious attacks on FL by covering various aspects such as attack budget, visibility, and generalizability, among others. Previous surveys have primarily focused on the multiple types of attacks and defenses but failed to consider the impact of these attacks in terms of their budget, visibility, and generalizability. This survey aims to fill this gap by providing a comprehensive understanding of the attacks' effect by identifying FL attacks with low budgets, low visibility, and high impact. Additionally, we address the recent advancements in the field of adversarial defenses in FL and highlight the challenges in securing FL. The contribution of this survey is threefold: first, it provides a comprehensive and up-to-date overview of the current state of FL attacks and defenses. Second, it highlights the critical importance of considering the impact, budget, and visibility of FL attacks. Finally, we provide ten case studies and potential future directions towards improving the security and privacy of FL systems.
RESUMEN
Generating multi-sentence descriptions for video is considered to be the most complex task in computer vision and natural language understanding due to the intricate nature of video-text data. With the recent advances in deep learning approaches, the multi-sentence video description has achieved an impressive progress. However, learning rich temporal context representation of visual sequences and modelling long-term dependencies of natural language descriptions is still a challenging problem. Towards this goal, we propose an Attentive Atrous Pyramid network and Memory Incorporated Transformer (AAP-MIT) for multi-sentence video description. The proposed AAP-MIT incorporates the effective representation of visual scene by distilling the most informative and discriminative spatio-temporal features of video data at multiple granularities and further generates the highly summarized descriptions. Profoundly, we construct AAP-MIT with three major components: i) a temporal pyramid network, which builds the temporal feature hierarchy at multiple scales by convolving the local features at temporal space, ii) a temporal correlation attention to learn the relations among various temporal video segments, and iii) the memory incorporated transformer, which augments the new memory block in language transformer to generate highly descriptive natural language sentences. Finally, the extensive experiments on ActivityNet Captions and YouCookII datasets demonstrate the substantial superiority of AAP-MIT over the existing approaches.