跨界太极学术 | Cross-media retrieval by intra-media and inter-media
Cross-media retrieval by intra-media and inter-media correlation mining
Xiaohua Zhai • Yuxin Peng • Jianguo Xiao
Multimedia Systems (2013) 19:395–406
Abstract With the rapid development of multimedia content on the Internet, cross-media retrieval has become a key problem in both research and application. Cross-media retrieval is able to retrieve the results of the same semantics with the query, but with different media types. For instance, given a query image of Moraine Lake, besides retrieving the images about Moraine Lake, cross-media retrieval system can also retrieve the related media contents of different media types such as text description. As a result, measuring content similarity between different media is a challenging problem. In this paper, we propose a novel cross-media similarity measure. It considers both intra-media and inter-media correlation, which are ignored by existing works. Intra-media correlation focuses on semantic category information within each media, while inter-media correlation focuses on positive and negative correlations between different media types. Both of them are very important and their adaptive fusion can comple- ment each other. To mine the intra-media correlation, we propose a heterogeneous similarity measure with nearest neighbors (HSNN). The heterogeneous similarity is obtained by computing the probability for two media objects belonging to the same semantic category. To mine the inter-media correlation, we propose a cross-media correlation propagation (CMCP) approach to simulta- neously deal with positive and negative correlation between media objects of different media types, while existing works focus solely on the positive correlation. Negative correlation is very important because it provides effective exclusive information. The correlations are modeled as must-link constraints and cannot-link con- straints, respectively. Furthermore, our approach is able to propagate the correlation between heterogeneous modali- ties. Finally, both HSNN and CMCP are flexible, so that any traditional similarity measure could be incorporated. An effective ranking model is learned by further fusion of multiple similarity measures through AdaRank for cross- media retrieval. The experimental results on two datasets show the effectiveness of our proposed approach, com- pared with state-of-the-art methods.
Keywords Cross-media retrieval·Heterogeneous similarity measure ·Intra-media correlation · Inter-media correlation
1 Introduction
With the rapid growth of multimedia content on the Internet, multimedia retrieval [12] has been widely studied for decades. In the early years, keyword-based multimedia retrieval are focused, such as keyword-based image retrieval, audio retrieval and video retrieval. Most of these methods are based on the surrounding text of media objects in the Web page. However, the surrounding text is not directly used to describe the multimedia content itself, which limits the retrieval accuracy. To solve this problem, much research efforts have been devoted to content-based multimedia retrieval, such as content-based image retrieval[3, 4], audio retrieval [22] and video retrieval [6, 18, 21]. However, the prevailing methods generally focus on sin- gle-media retrieval, where the retrieved results and user query are of the same media type. Actually, users demand all of the multimedia content that they are interested in. As a result, these methods cannot measure the content simi- larity among media objects of different media types. Although some methods are proposed to modeling image data complemented with text annotations [5, 9], they are restricted to exploiting keywords associated with visible objects. In particular, these methods can only support keyword-based image retrieval. They cannot support con- tent-based cross-media retrieval among all media types, such as using image to retrieve relevant textual data and audio data, which should be the goal of multimedia retrieval.
Consequently, the new topic of cross-media retrieval becomes increasingly important and attracts considerable research attention. In cross-media retrieval, users can query media objects of all media types by submitting any type of media object. By submitting either one or multiple media objects, we can obtain all of the related media objects of different media types. For instance, given a query image of Moraine Lake, besides retrieving the images of Moraine Lake, cross-media retrieval system can also retrieve the related media contents of different media types such as text description. This is an interesting, yet difficult problem. On one hand, users demand such cross-media retrieval systems to search results across various media [15]. On the other hand, due to the ‘‘semantic gap’’ between low-level feature and human understanding, the similarity measure among objects of the same media type has always been a difficult problem. As a result, how to measure content similarity among objects of different media types is a much more challenge. Compared with traditional single-media retrie- val methods, cross-media retrieval requires cross-media correlation modeling, so that users can query whatever they want by submitting whatever they have [26].
In this paper, we propose a novel cross-media similarity measure. It considers both intra-media and inter-media correlation, which are ignored by previous work [20, 26, 28, 29]. Intra-media correlation focuses on semantic cate- gory information within each media, while inter-media correlation focuses on positive and negative correlation between different media types. Both of them are very important and their adaptive fusion can complement each other.
To mine the intra-media correlation, we propose a het- erogeneous similarity measure with nearest neighbors (HSNN). Unlike traditional similarity measures which are limited in homogeneous feature space, HSNN could com- pute the similarity between media objects of different media types. The heterogeneous similarity is obtained by computing the probability for two media objects belonging to the same semantic category, regardless of which cate- gory it is. The probability is achieved by analyzing the homogeneous nearest neighbors of each media object.
To mine the inter-media correlation, we propose a cross- media correlation propagation (CMCP) approach to simultaneously deal with positive and negative correlation between media objects of different media types, while existing works focus solely on the positive correlation. Negative correlation is very important because it provides the effective exclusive information. Figure 1 gives an example to depict such constraints. In this example, the images and texts come from ‘‘music’’ category and ‘‘sport’’ category, respectively. A section of text description about the ‘‘sport’’ may have strong positive correlation with the image of players on the playground. It also has negative correlation with the image in which two people sit before a piano. The correlation is modeled as must-link constraints and cannot-link constraints, respectively. We obtain both kinds of pairwise constraints from the category label of the media objects. Since traditional constraints propagation techniques [14, 16, 17] have been designed basically for single modality data and cannot be directly applied for cross-media retrieval, our proposed CMCP provides an efficient solution to the cross-media retrieval problem by dividing the cross-media correlation propagation problem into two sets of independent label propagation subprob- lems. Each set of subproblems propagate the initial corre- lation along corresponding media modality. Through combining the propagation along each modality, the semantic correlation can be propagated throughout the whole dataset. Then the achieved correlation on the target objects can naturally meet the requirement of cross-media retrieval problem, in which the retrieval results are of the same semantic and can be of different media type from the query. Furthermore, our approach is able to propagate the correlation between heterogeneous modalities.
Finally, both HSNN and CMCP are flexible, so that any traditional similarity measure can be incorporated. An effective ranking model is learned by further fusion of multiple similarity measures through AdaRank for cross- media retrieval. The experimental results on the two datasets show the effectiveness of our proposed approach, compared with state-of-the-art methods.
The rest of this paper will be organized as follows. In Sect. 2, we review related work on cross-media retrieval. We then introduce our framework in Sect. 3. Section 4 presents the proposed HSNN similarity measure and Sect. 5 presents the CMCP similarity measure. Section 6 pre- sents the Adarank fusion algorithm. Experiments are given in Sect. 7. Finally, we conclude this paper in Sect. 8.
2 Related work
The key challenge in cross-media retrieval is to measure the content similarity among different media types. Exist- ing approaches on cross-media retrieval basically focus on two aspects: feature representation and similarity measure.
As for the feature representation, existing methods aim to project the original heterogeneous features into a joint feature space [1, 13, 20]. When different media types are projected into the same feature space, cross-media retrieval is reduced to a classical retrieval problem. As a statistical tool, canonical correlation analysis (CCA) [8] can be used to analyze the correlation between two multivariate random vectors. So, it is a natural possible solution to learn the subspace that maximizes the correlation between two sets of heterogeneous data. Kidron et al. [11] applied CCA to localize visual events associated with sound sources. Bre- din and Chollet [2] applied CCA to the task of audiovisual- based talking-face biometric verification. Kernel canonical correlation analysis (KCCA) has also been of interest and applied for the fusion of text and image for spectral clus- tering [1]. Alternatively, Li et al. [13] introduced a cross- modal factor analysis (CFA) approach to evaluate the association between the two modalities and demonstrated its superior performance with respect to CCA. Unlike the CCA approach which finds transformation matrices that maximize the correlation between two subsets of features, the CFA method adopts a criterion of minimizing the Frobenius norm between pairwise data in the transformed domain. However, both CCA and CFA only focus on modeling the correlation between pairwise data, but do not take the semantic category into account. Category infor- mation is helpful for the joint representation learning. More recently, a notable work of [20] investigates two hypoth- eses for cross-media retrieval: (1) there is benefit in explicitly modeling the correlation between text and image modalities, and (2) this modeling is more effective with higher-level semantic. The correlation between the two modalities is learned with CCA and higher-level semantic is achieved by representing text and image at a more general semantic level. However, due to the gap between low-level features of media content and human under- standing, the high-level representations are not accurate enough. Then the inaccuracy will be propagated to the following matching step and decrease the performance of cross-media retrieval.
As for the similarity measure, existing methods aim to design a heterogeneous similarity measure, by which we could directly compute the similarity or ranking value between heterogeneous features without giving an explicit representation [10, 26, 27, 31]. The methods of [26, 27, 31] explore the co-occurrence information in different media types, that is, if different media objects co-exist in a mul- timedia document, then they are of the same semantic. However, these methods heavily rely on the co-occurrence of query object and multimedia documents in the dataset. When the query object is out of the dataset, the perfor- mance will decrease dramatically. So users’ feedback is required to ensure the performance. Jia et al. [10] propose a new probabilistic model that learns a set of shared topics for images and text. The model can be seen as a Markov random field of topic models, which connects the media objects based on their similarity. The encoded relations between different modalities could be directly applied for cross-media retrieval.
3 Cross-media retrieval framework
In this section, we present the proposed heterogeneous similarity measure for cross-media retrieval. We restrict the discussion to multimedia documents containing images and texts as [20], and the fundamental ideas are applicable to any combination of media types. The goal is to retrieve text articles in response to the query of images and vice versa.
The multimedia dataset is denoted as D¼ fI; T g; in which I and T denote images and texts, respectively.I ¼ fI1; .. .; Ing; T ¼ fT1; .. .; Tmg; where n is the number of images and m is the number of texts. Each image or text in the training set is assigned a category label, while images and texts in the testing set remain unlabeled. Ima- ges and texts are represented as feature vectors Ii 2 RI and Ti 2 RT ; respectively. Bag-of-words (BOW) model and topic model are utilized to represent the images and texts,respectively. The goal of cross-media retrieval task isdescribed as follows: given a image (text) query Iq 2 RI (Tq 2 RT ) in the testing set, return the closest match in the text (image) space RT ðRI Þ in the testing set.
Then, we introduce the framework of our proposed algorithm. As shown in Fig. 2, our algorithm mainly con- sists of the following three parts.
• Homogeneous similarity: We first calculate the homo-geneous similarity between media objects of the same media type. In this stage, traditional similarity measure such as histogram intersection, chi square, etc. can be incorporated.
• Heterogeneous similarity: Based on the homogeneous similarity, we further compute the similarity by mining the intra-media correlation and inter-media correlation, which is the main contribution of this paper. We will introduce the proposed HSNN and CMCP algorithm in Sect. 4 and Sect. 5
• Adarank fusion: Finally, both HSNN and CMCP are flexible, so that any traditional similarity measure couldbe incorporated. An effective ranking model is learned by further fusion of multiple similarity measures through AdaRank for cross-media retrieval. We will introduce the Adarank fusion in Sect. 6.
Our proposed HSNN and CMCP can naturally support multiple media types. Take fM1; .. .; Mmg as m kinds of media types. I and T can be any different kinds of mediatypes among fM1; .. .; Mmg after we extract the corre- sponding feature vectors of each media type. So we caneasily compute the similarity among any m kinds of media types. Furthermore, we can also jointly model multiple media types by the following divide and conquer algo- rithm. Firstly, we divide m kinds of media types into m0 ¼ d2e combinations fM1; M2g; .. .; fMm—1; Mmg: For eachcombination {Mi, Mi ? 1}, we compute the correlation F between them. Then, each combination {Mi, Mi ? 1} will be treated as a new single media type. Based on this,we can further divide them into dm0 e combinations recursively until we obtain the similarity among all the media types.
4 Heterogeneous similarity measure with nearest neighbors
Normally, kNN classification is used to classify data point Ii (Ti) based on closest training examples in the feature space RI ðRT Þ. Ii (Ti) is classified by a majority vote of its nearest neighbors, with the object assigned to the class most common amongst its k nearest neighbors. As for cross-media retrieval, given two heterogeneous media objects Ii and Tj, we have to predict whether they belong to the same semantic category, regardless of which category itbelongs to. As a toy example, the proposed similarity measure is explicitly shown in Fig. 3. To achieve this goal,we compute the marginal probability that we assign Ii andTj to the same semantic category with nearest neighbors,which equals:
sim(Ii, Ik) is traditional similarity measure between two homogeneous data points.
The range of heterogeneous similarity of Eq. 1 is [0,1], where 0 is achieved when none of the nearest neighbors of two media objects belong to the same category and 1 is achieved when all of the nearest neighbors belong to only one category.
5 Cross-media correlation propagation
In this section, we present the proposed cross-media cor- relation propagation algorithm for cross-media retrieval.
Given a dataset D¼ fI; T g; our goal is to exploit the semantic correlation between heterogeneous objects for cross-media query. Firstly, as shown in the left part of Fig. 4, we construct the semantic correlation matrix Y = {Yij}m*n, where m and n are the object number of texts and images in the dataset. The element Yij stands for the pairwise constraint between the ith text and the jth image. The definition of Y is given as follows:
where C(Ii) and C(Tj) represent the category label of imageor text, respectively. Only the images and texts in the training set are assigned category labels. Yij = 1 means the ith text and the jth image have positive correlation, which is referred to as must-link constraint. Yij = -1 means the ith text and the jth image have negative correlation, which is referred to as cannot-link constraint. Yij = 0 means we do not have any information about the category of the ith text and the jth image. In Fig. 4, black filled circles indicate the already known correlation and empty circles indicate the unknown semantic correlation. We place the labeled media objects in the left top region and the unlabeled media objects in the right bottom region for observation. Consequently, the correlation propagation problem is reduced to how to fill the elements which are empty circles according to the the already known filled circles.
We make further observations on Y column by column. For the j-th column Y·j which corresponds to the semantic correlation between the j-th image and all the texts, the elements Y·j actually provide the initial configuration of a two-class semi-supervised learning problem. The ‘‘positive class’’ (Yij = 1) consists of texts with the same semantic of theimage and the ‘‘negative class’’ (Yij = -1) con- sists of the texts of different semantic from the j-th image.
This two-class semi-supervised learning problem can besolved by the label propagation technique [30]. We call this kind of propagation text propagation. That is, semantic correlation is propagated according to text similarity.
However, there are some columns containing neither positive nor negative label, which means we do not know any correlation about these media objects and all the cor- responding elements are zero. In this case, we cannot propagate the semantic correlation information. As shown in Fig. 4, the elements in the right bottom region of Y cannot be obtained readily by text propagation. Similar to text propagation, image propagation can also be performed row by row. The propagation directions of text propagation and image propagation are orthogonal to each other. As a result, through combining text propagation and image propagation, the semantic correlation on the labeled objects can be successfully propagated to the unlabeled objects.
In practice, the larger Yij is, the more probable that the ith text and the jth image have the same semantic. Obvi- ously, semantic correlation matrix can naturally meet the requirement of cross-media retrieval. Once correlation isobtained, cross-media retrieval can be accomplished based on the semantic correlation. The propagation problem can be solved efficiently through semi-supervised learning based on k-nearest neighbors graphs. We summarized our CMCP algorithm as follows:
6 Learning the ranking model through AdaRank
Any traditional similarity measure could be incorporated into our HSNN and CMCP similarity measure, such as normalized correlation, histogram intersection, Chi-square distance and so on. Each similarity measure could measure a certain aspect of relationship between two media objects and could be regarded as a weak ranker. Learning to combine multiple weak rankers is also a key component in a retrieval task. In this paper, the ranking model for cross- media retrieval is learned through AdaRank [25], which is a kind of listwise approach [24] to learn to rank. UnlikeAdaBoost, AdaRank can train a ranking model that directly optimizes the information retrieval performance measures such as MAP (mean average precision) with respect to the training data, while the loss function in AdaBoost is spe- cific for binary classification. Furthermore, AdaRank tries to optimize a loss function based on query lists, while other learning algorithms for ranking attempt to minimize a loss function based on instance pairs. So, we adopt AdaRank to learn the ranking model for the cross-media retrieval task. In the learning stage, a number of image (text) queries and their corresponding retrieved texts (images) are given. The relevance of the retrieved texts (images) with respect to the image (text) queries are also provided. The trainingset can be represented as L ¼ fðqi; oi; yiÞg; where qi is the query, oi ¼ oi1; oi2; .. .; oinðqi Þ is the list of objects with different modalities from the query, yi ¼ yi1; yi2; .. .; yinðqi Þ is a list of labels , where n(qi) denotes the size of object list oi and yi.
The objective of learning is to construct a ranking function which achieves the best results in ranking of the training data. The AdaRank algorithm is summarized in Algorithm 1. AdaRank runs seven rounds and at each round selects a weak ranker ht ðt ¼ 1; .. .; TÞ with the lowest weighted error. Finally, it outputs a ranking model f by linearly combining the weak rankers.
Initially, AdaRank sets equal weights to the training queries. In each round of iteration, training queries are re- weighted accordingly, where the queries that are not ranked well by ft are emphasized. ft is the ranking model created in the tth round. As a result, the learning at the next round will be focused on those hard queries.
The effectiveness of a ranking model is usually evalu- ated with performance measures such as MAP (mean average precision). The loss function to measure the weighted error of weak ranker ht is defined as follows:
where wt,i is the weight of query i in tth round, p(qi, oi, ht) is a permutation for query qi on object list oi by weak ranker ht. E measures the consistency between p and labels yi, and the performance measure of mean average precision (MAP) is adopted here. For each query, average precision is the average of precisions computed at the ranks where recall changes. The mean average precision is the mean value of a set of queries. It is widely used in the image retrieval literature [19].
7 Experiments
In this section, we compare the proposed approach with the state-of-the-art methods for cross-media retrieval.
7.1 Dataset description
Since cross-media retrieval is a relatively new topic, there are few publicly available cross-media datasets. A notable publicly available cross-media dataset is the Wikipedia dataset [20], which is chosen from the Wikipedia’s ‘‘fea- tured articles’’. To further objectively evaluate the perfor- mance of cross-media retrieval methods, we should evaluate methods on more datasets, which are not publicly available now. So we construct a new Wiki-Flickr dataset, which contains more categories and more media objects. We evaluate the proposed approach on the above two datasets.
The first is the Wikipedia dataset [20], which is chosen from the Wikipedia’s ‘‘featured articles’’. This is a con- tinually updated collection of 2,700 articles that have been selected and reviewed by Wikipedia’s editors since 2009. Each article is accompanied with one or more images from Wikimedia Commons. Both the text and images are assigned a category label by Wikipedia. There are 29 cat- egories in total. Since some of the categories are very scarce, ten most populated categories are preserved in this dataset. Each article is split into several sections according to its section headings. Then the accompanied images are assigned to the sections, respectively, according to image position in the article. The final dataset contains a total of 2,866 documents, which are text–image pairs and anno- tated with a label from the vocabulary of ten semantic categories. The dataset is randomly split into a training set of 2,173 documents and a test set of 693.
The second is our Wiki-Flickr dataset, which consists of 3,000 texts and 3000 images. We generate 3,000 doc- uments similar to the Wikipedia dataset. All of the texts are crawled from the Wikipedia articles. The images consist of two parts: 600 images from Wikipedia articles and 2,400 images from the photo-sharing website Flickr.
This dataset is organized into 12 categories, each category with 250 documents. The dataset is randomly split into a training set of 2,000 documents and a test set of 1,000 documents. We randomly select six examples of different media types from three categories of the Wikipedia dataset and Wiki-Flickr dataset, respectively, which are shown in Fig. 5.
Two cross-media retrieval tasks are considered: retrieve the texts using an image query and retrieve the images using a text query. For the former, each image in the test set is used as a query and the result is the ranking of all the texts in the test set. For the latter, each text is used as a query and the result is the ranking of all the images in the test set. The precision–recall (PR) curves and mean average precision (MAP) are taken as performance measures.
Bag-of-words (BOW) model and topic model are uti- lized to represent the images and texts, respectively. Each image is represented using a histogram of a 128-codeword SIFT codebook and each text is represented using a his- togram of a 10-topic LDA text model. All of the compared cross-media retrieval methods in the experiment section adopt the same features and training data for fair compar- ison purpose. We set k = 30 for k nearest neighbors. We set aI = aT = 0.6 in Eq. (9) for cross-media correlation propagation algorithm.
7,2 Performance of our proposed heterogeneous similarity measure
Our first set of experiments examine the performance of heterogeneous similarity measure HSNN and CMCP in Table 1. In this table, two methods [20] are compared, which are learning a homogeneous subspace for the modalities through canonical correlation analysis (CCA) and learning a high-level semantic for each object after mapping the original heterogeneous features into a homo-geneous subspace through CCA (SMN ? CCA). Both CCA and SMN seek to find a homogeneous feature representa- tion for media objects with different modalities. After the media objects are represented as homogeneous featurevectors, comparison could be performed by commonly used homogeneous similarity measures. Since our pro- posed HSNN and CMCP could incorporate any kind of similarity measure, we evaluate objectively the MAP scores with multiple similarity measures, including the Normalized Correlation, Histogram Intersection and Chi square. We can conclude that HSNN and CMCP outper- form previous methods [20] in three commonly used sim- ilarity measures. The contributions of HSNN and CMCP are clearly seen. On one hand, the intra-media correlation focuses on the structure of each media type. On the other hand, the inter-media correlation reflects the semantic correlation between different media types. Both of them are effective and complementary to each other. In addition,we find that CMCP always outperforms HSNN, which means that inter-media correlation is more effective than intra-media correlation.
7.3 Performance of learning the ranking model through AdaRank
Next, we examine the performance of the ranking model trained through AdaRank in Table 2. We denote our Adarank fusion approach as an adaptive heterogeneous similarity measure (AHSM). Each similarity measure is regarded as a weak ranker. To make the weak rankers more diverse, we further adopt the square root of the original features to form new features. It is motivated by previous work [7, 23] in face recognition realm, where the square root of the original features is shown to be effective. In this paper, 12 weak rankers are combined in total. We found that the square root version of the features outperforms the original features in most cases. Furthermore, AdaRank further improves the performance by linearly combining multiple weak rankers.
7.4 Comparison with state-of-the-art methods
Table 3 shows the performance of our proposed hetero- geneous similarity measure, compared with the state-of- the-art methods as discussed in Sect. 7.2. Here random means using randomly ranking images/texts as result. Figure 6 shows the PR curve of all of the above methods. It can be seen that our adaptive heterogeneous similarity measure (AHSM) attains higher precision at most levels of recall. The reason may be that AHSM jointly models intra-media corre- lation and inter-media correlation, which are very important, and their adaptive fusion can complement each other.
8 Conclusion
In this paper, we have proposed a novel cross-media sim- ilarity measure which considers both intra-media and inter- media correlation. Intra-media correlation focuses on semantic category information within each media, while inter-media correlation focuses on positive correlation and negative correlation between different media types. Both of them are very important and their adaptive fusion can complement each other. The intra-media correlation is modeled through heterogeneous similarity measure with nearest neighbors (HSNN). The inter-media correlation is modeled through cross-media correlation propagation (CMCP) approach, which can simultaneously deal with positive and negative correlation between media objects of different media types. Negative correlation is very impor- tant because it provides effective and exclusive informa- tion. The correlation is modeled as must-link constraints and cannot-link constraints, respectively. Furthermore, our approach is able to propagate the correlation between heterogeneous modalities. Finally, both HSNN and CMCP are flexible, so that any traditional similarity measure can be incorporated. An effective ranking model is learned by further fusion of multiple similarity measures through AdaRank for cross-media retrieval.
In the future, on one hand, we will jointly model other media types such as audio and video; on the other hand, the correlation between categories can also be explored.
Acknowledgments This work was supported by National Natural Science Foundation of China under Grant 61073084, Beijing Natural Science Foundation of China under Grant 4122035, National Hi-Tech Research and Development Program (863 Program) of China under
Grant 2012AA012503, National Development and Reform Commis- sion High-tech Program of China under Grant [2010]3044, and National Key Technology Research and Development Program of China under Grants 2012BAH07B01 and 2012BAH18B03.
References
1. Blaschko, M., Lampert, C.: Correlational spectral clustering. IEEE International Conference on on Computer Vision and Pat- tern Recognition (CVPR) (2007)
2. Bredin, H., Chollet, G.: Audio-visual speech synchrony mea- sure for talking-face identity verification. International Con- ference on Acoustics, Speech, and Signal Processing (ICASSP) (2007)
3. Clinchant, S., Ah-Pine, J., Csurka, G.: Semantic combination of textual and visual information in multimedia retrieval. In: ACM International Conference on Multimedia Retrieval (2011)
4. Escalante, H., He´rnadez, C., Sucar, L., Montes, M.: Late fusion of
heterogeneous methods for multimedia image retrieval. Pro- ceeding of the 1st ACM international conference on Multimedia information retrieval (2008)
5. Grangier, D., Bengio, S.: A discriminative kernel-based model to rank images from text queries. IEEE Trans. Pattern Anal. Mach. Intell. 30(8), 1371–1384 (2008)
6. Greenspan, H., Goldberger, J., Mayer, A.: Probabilistic space-
time video modeling via piecewise gmm. IEEE Trans. Pattern Anal. Mach. Intell. 26(3), 384–396 (2004)
7. Guillaumin, M., Verbeek, J., Schmid, C.: Is that you? metric
learning approaches for face identification. International Con- ference on Computer Vision and Pattern Recognition (CVPR) (2009)
8. Hotelling, H.: Relations between two sets of variates. Biometrika
28(3-4), 321–377 (1936)
9. Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image anno-
tation and retrieval using cross-media relevance models. In: Proceedings of the 26th annual international ACM SIGIR con- ference (2003)
10. Jia, Y., Salzmann, M., Darrell, T.: Learning cross-modality similarity for multinomial data. IEEE International Conference on Computer Vision (ICCV) (2011)
11. Kidron, E., Schechner, Y., Elad, M.: Pixels that sound. IEEE International Conference on on Computer Vision and Pattern Recognition (CVPR) (2005)
12. Lew, M., Sebe, N., Djeraba, C., Jain, R.: Content-based multi- media information retrieval: state of the art and challenges. ACM Trans. Multime’d. Comput. Commun. Appl. 2, 1–19 (2006)
13. Li, D., Dimitrova, N., Li, M., Sethi, I.K.: Multimedia content
processing through cross-modal association. In:Proceedings of the ACM International Conference on Multimedia pp. 604–611 (2003)
14. Li, Z., Liu, J., Tang, X.: Pairwise constraint propagation by semidefinite programming for semi-supervised classification. ICML pp. 576–583 (2008)
15. Liu, J., Xu, C., Lu, H.: Cross-media retrieval: state-of-the-art and open issues. Int. J. Multime’d. Intell. Secur. 1(1), 33–52 (2010)
16. Lu Z., Carreira-Perpinan M. (2008) Constrained spectral clus-
tering through affinity propagation. CVPR
17. Lu, Z., Ip, H.: Constrained spectral clustering via exhaustive and efficient constraint propagation. In: Proceedings of the European Conference on Computer Vision (2010)
18. Peng, Y., Ngo, C.: Clip-based similarity measure for query- dependent clip retrieval and video summarization. IEEE Trans. Circuits Syst. Video Technol. 16(5), 612–627 (2006)
19. Rasiwasia, N., Moreno, P., Vasconcelos, N.: Bridging the gap:
Query by semantic example. IEEE Transactions on Multime’d.
9(5), 923–938 (2007)
20. Rasiwasia, N., Pereira, J.C., Coviello, E., Doyle, G., Lanckriet,
G., Levy, R., Vasconcelos, N.: A new approach to cross-modal multimedia retrieval. ACM international conference on Multi- media (2010)
21. Shen, J., Cheng, Z.: Personalized video similarity measure. Multime’d. Syst. 17(5), 421–433 (2011)
22. Typke, R., Wiering, F., Veltkamp, R.: A survey of music infor-
mation retrieval systems. In:Proceedings of ISMIR (2005)
23. Wolf, L., Hassner, T., Taigman, Y.: Effective unconstrained face recognition by combining multiple descriptors and learned background statistics. IEEE Trans. Pattern Anal. Mach. Intell 33(10), 1978–1990 (2011)
24. Xia, F., Liu, T., Wang, J., Zhang, W., Li, H.: Listwise approach to
learning to rank—theory and algorithm. In: Proceedings of the 25th international conference on Machine learning (2008)
25. Xu, J., Li, H.: Adarank: A boosting algorithm for information retrieval. The 30th Annual International ACM SIGIR Conference (2007)
26. Yang, Y., Xu, D., Nie, F., Luo, J., Zhuang, Y.: Ranking with local regression and global alignment for cross media retrieval. ACM International Conference on Multimedia pp. 175–184 (2009)
27. Yang, Y., Zhuang, Y., Wu, F., Pan, Y.: Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Trans. Multime’d.10(3), 437–446 (2008)
28. Zhai, X., Peng, Y., Xiao, J.: Cross-modality correlation propa- gation for cross-media retrieval. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2012)
29. Zhai, X., Peng, Y., Xiao, J.: Effective heterogeneous similarity measure with nearest neighbors for cross-media retrieval. In: International Conference on MultiMedia Modeling (MMM) (2012)
30. Zhou, D., Bousquet, O., Lal, T., Weston, J., Scho¨lkopf, B.: Learning with local and global consistency. Advances in Neural Information Processing Systems (NIPS) (2003)
31. Zhuang, Y., Yang, Y., Wu, F.: Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Trans. Multime’d. 10(2), 221–229 (2008)
图片来源于网络
总编:凌逾
责编:何春桃
推荐人:谢慧清
往期精彩:
跨界太极学术| Steinmueller: Science fiction and innovation: A response
跨界太极学术 | Bo Gyllensvärd:中瑞文化跨越千年的交往(中欧文化交流研究翻译之一)
跨界太极学术 | 闻瑞东,钟世川:加强粤港澳大湾区文化整合的对策
【跨界太极】 第1048期
关注跨媒介 跨学科 跨艺术
跨地域 跨文化理论及创意作品