ObjectiveCeladon is not only a dazzling pearl among the cultural treasures of the Chinese nation but also a cultural messenger in cultural exchanges between China and other countries. It has rich historical and cultural connotations and demonstrates excellent artistic value. Its elegant shape and moist glaze make it an outstanding representative of traditional Chinese craft aesthetics. The production of celadon embodies the wisdom and creativity of ancient craftsmen and is an important carrier for the inheritance of excellent traditional Chinese culture. In the context of cultural digitization, constructing a cross-modal knowledge graph of celadon is one of the key technologies for promoting the protection and inheritance of celadon culture. In this process, matching the same entities across different modalities, which involves aligning the different modal features of equivalent entities, is crucial. However, the inherent structural differences between cross-modal data present challenges for alignment tasks. Traditional methods that rely on manually annotated data can ensure the accuracy of alignment to some extent, but they have problems such as low efficiency and high cost. In addition, coarse-grained annotated data can hardly meet the requirements for fine-grained concepts and for entity recognition when constructing a cross-modal knowledge graph. At present, the vision-language pretraining (VLP) model can effectively capture cross-modal semantic associations by learning rich cross-modal representations from large-scale unmarked image-text pair data. The strong cross-modal understanding ability of the VLP model can provide precise semantic associations and fine-grained entity recognition for aligning entities of different modalities in graph construction. Here, a cross-modal entity alignment method based on the VLP model, which can map multiple features of images, is proposed to maximize the degree of matching between celadon images and text.MethodThe cross-modal entity alignment method proposed in this study, which maps multiple features of images, is initialized with the publicly available VLP model for both the image and the text encoders, and the parameters of the encoders remain unchanged during the training process. The method mainly consists of four parts. First, on the basis of the visual characteristics of celadon images, local features in terms of contour, texture, and color are extracted. Then, a gated multifusion unit is introduced to adaptively assign weights to the image features, and the extracted multiple local image features are used to generate reliable fused features. Furthermore, a multilayer fully connected mapper is designed to learn the mapping of the fused features to an appropriate intermediate representation space by using multiple layers of nonlinear transformations, guiding the text encoder to generate text features that match the image features more closely. Finally, the model is trained and optimized via the information noise contrastive estimation loss function, that is, by optimizing the similarity of positive sample pairs and the difference in negative sample pairs through calculating the cosine similarity between cross-modality features, thereby establishing the connection between image features and text features.ResultThe proposed method was compared with four of the latest benchmark methods in an experimental comparison, namely, contrastive VLP in Chinese (CN-CLIP), context optimization (CoOp), conditional context optimization (CoCoOp), and mapping pictures to words (Pic2Word). The quantitative evaluation metrics are the recall rates, including
R@1,
R@5,
R@10, and the mean recall (MR). The experiments were conducted using the ChinaWare dataset, so all methods were trained on this dataset. A data table comparing each method’s performance on recall rate metrics was provided. In terms of the MR metric, the proposed method outperformed zero-shot CN-CLIP
ViT-B/16 by 3.2% in the text-to-image alignment task and by 7.5% in the image-to-text task. CoOp focuses on text features; it also outperforms CoOp by 11.4% and 12.1%, respectively. Moreover, CoCoOp considers image features on the basis of CoOp, and the proposed method outperforms CoCoOp by 8.4% and 9.5%, respectively. Pic2Word also focuses on original image features and does not fully utilize other local image features to improve model performance, and the proposed method outperforms Pic2Word by 5.8% and 5.6%, respectively.ConclusionThe cross-modal entity alignment method proposed in this study can fully explore the effective intermediate representation of image features to reconstruct text features without changing the parameters of the VLP model, thereby improving the cross-modal recognition accuracy of the details of celadon. The experimental results show that this method is superior to several state-of-the-art methods and has improved the performance of alignment. Ultimately, a celadon cross-modal knowledge graph with 8 949 nodes and 18 211 relationships was successfully constructed by applying technologies such as ontology modeling, data mining, and the cross-modal entity alignment method proposed in this study.… …
相似文献