ABSTRACT
In this paper, a series of CL-20 based explosive formulations containing Al/PTFE reactive materials are designed using a self-designed closed explosion test device. The quasi-static pressure (QSP) and peak temperature of the explosive reaction are studied under different mass percentages of Al/PTFE and different charge structures. The composition and morphology of the solid residue products after the explosion were analyzed, proving the feasibility of using Al/PTFE in explosives and providing theoretical support for the design of the aluminized explosive in this system. The results show that a high content of Al/PTFE reactive material can be successfully detonated by CL-20. Using CL-20 as the central explosive column can make pure Al/PTFE react, but this will result in a decrease in QSP by about 25%. The mass ratio of 75/25 has the highest QSP, which can reach 0.289, 0.310, 0.270 and 0.218 MPa. The three samples in G2# exhibit the highest equilibrium temperature, with G2#A, G2#B and G2#C reaching 868.2 °C, 942.0 °C and 626.2 °C, respectively. Regardless of the charge structure, the equilibrium temperatures after explosion of Al/PTFE at ratios of 75/25 and 70/30 are higher than those of 60/40. When the proportion of Al/PTFE is 60/40, the equilibrium temperature after explosion will decrease by nearly 20%. XRD revealed that the solid residue mainly comprises Al, α-Al2O3 and γ-Al2O3. No C element was found in the solid product, indicating that the C element mainly exists in a gaseous state after the explosion.
ABSTRACT
Today's VQA models still tend to capture superficial linguistic correlations in the training set and fail to generalize to the test set with different QA distributions. To reduce these language biases, recent VQA works introduce an auxiliary question-only model to regularize the training of targeted VQA model, and achieve dominating performance on diagnostic benchmarks for out-of-distribution testing. However, due to the complex model design, ensemble-based methods are unable to equip themselves with two indispensable characteristics of an ideal VQA model: 1) Visual-explainable: The model should rely on the right visual regions when making decisions. 2) Question-sensitive: The model should be sensitive to the linguistic variations in questions. To this end, we propose a novel model-agnostic Counterfactual Samples Synthesizing and Training (CSST) strategy. After training with CSST, VQA models are forced to focus on all critical objects and words, which significantly improves both visual-explainable and question-sensitive abilities. Specifically, CSST is composed of two parts: Counterfactual Samples Synthesizing (CSS) and Counterfactual Samples Training (CST). CSS generates counterfactual samples by carefully masking critical objects in images or words in questions and assigning pseudo ground-truth answers. CST not only trains the VQA models with both complementary samples to predict respective ground-truth answers, but also urges the VQA models to further distinguish the original samples and superficially similar counterfactual ones. To facilitate the CST training, we propose two variants of supervised contrastive loss for VQA, and design an effective positive and negative sample selection mechanism based on CSS. Extensive experiments have shown the effectiveness of CSST. Particularly, by building on top of model LMH+SAR (Clark et al. 2019), (Si et al. 2021), we achieve record-breaking performance on all out-of-distribution benchmarks (e.g., VQA-CP v2, VQA-CP v1, and GQA-OOD).
ABSTRACT
We focus on grounding (i.e., localizing or linking) referring expressions in images, e.g., "largest elephant standing behind baby elephant". This is a general yet challenging vision-language task since it does not only require the localization of objects, but also the multimodal comprehension of context - visual attributes (e.g., "largest", "baby") and relationships (e.g., "behind") that help to distinguish the referent from other objects, especially those of the same category. Due to the exponential complexity involved in modeling the context associated with multiple image regions, existing work oversimplifies this task to pairwise region modeling by multiple instance learning. In this paper, we propose a variational Bayesian method, called Variational Context, to solve the problem of complex context modeling in referring expression grounding. Specifically, our framework exploits the reciprocal relation between the referent and context, i.e., either of them influences estimation of the posterior distribution of the other, and thereby the search space of context can be greatly reduced. In addition to reciprocity, our framework considers the semantic information of context, i.e., the referring expression can be reproduced based on the estimated context. We also extend the model to unsupervised setting where no annotation for the referent is available. Extensive experiments on various benchmarks show consistent improvement over state-of-the-art methods in both supervised and unsupervised settings.
ABSTRACT
Image annotation aims to annotate a given image with a variable number of class labels corresponding to diverse visual concepts. In this paper, we address two main issues in large-scale image annotation: 1) how to learn a rich feature representation suitable for predicting a diverse set of visual concepts ranging from object, scene to abstract concept and 2) how to annotate an image with the optimal number of class labels. To address the first issue, we propose a novel multi-scale deep model for extracting rich and discriminative features capable of representing a wide range of visual concepts. Specifically, a novel two-branch deep neural network architecture is proposed, which comprises a very deep main network branch and a companion feature fusion network branch designed for fusing the multi-scale features computed from the main branch. The deep model is also made multi-modal by taking noisy user-provided tags as model input to complement the image input. For tackling the second issue, we introduce a label quantity prediction auxiliary task to the main label prediction task to explicitly estimate the optimal label number for a given image. Extensive experiments are carried out on two large-scale image annotation benchmark datasets, and the results show that our method significantly outperforms the state of the art.