Improvement of multimodal images classification based on DSMT using visual saliency model fusion with SVM

Multimodal images carry available information that can be complementary, redundant information, and overcomes the various problems attached to the unimodal classification task, by modeling and combining these information together. Although, this classification gives acceptable classification results, it still does not reach the level of the visual perception model that has a great ability to classify easily observed scene thanks to the powerful mechanism of the human brain.


Introduction
Nowadays, multimodal imaging has gained increasing importance in computer vision application, and significant efforts have been put into developing methods of different tasks, such as Registration[1][2] [3][4], Data fusion [5], Representation learning [6], Classification [7]and so on. In classification task, the unimodal image presents various problems as noisy data, incomplete information and distorted ones, etc. This often led to a misclassification. These limitations are overcome by using multimodal images, which are acquired from multiple sensors, and taken for the same object or scene. Each image or modality allows to provide different information that can sometimes be redundant, because the same area/scene is presented in a different sensor, and complementary for another modality, regarding the diversity of sensor technologies and theirphysical interaction mechanism. The use of this set of images together presents a real-world benefit to resolve a given problem with some various available information. The fusion of these data form a better quality classification.
However, these data are crippled with some imperfections such as conflict, ignorance, uncertainty and so on, which must be handled and taken into account by dedicated formalism as long as they presentan aspect of reality. To fix such problem, several formalism exist as probability theory [8], Fuzzy theory [9], belief function formalism [10]and Dezert-Smarandache formalism [11] [12].In this work, we benefit from the latest theory whichis the most recent one, and it was introduced in order to deal with the high conflicted and uncertaintydata thanks to its rich mode lization and the combination operators (PCR5 and PCR6) that it integrates.
In classification task, belief function theory is widely exploited in many works [13] [14] [15] [16]. Whereas DSmT or so-called plausible and paradoxical reasoning shows its efficiency in many applications, it was performed for multi-source remote sensing application [17]for supervised classification purpose by integrating contextual information obtained from ICM classifier with constraint and temporal information in hybrid DSmT process with adaptive decision rule, the authors also proposed a new decision rule based on DSmP transformation for change detection purpose [18]. In [19], the authors present an effective use of DSmT for multiclass classification by combining two SVM OAA (One-Against-All) implementation using PCR6 combination rule. A new method, based on fusing the attribute type information obtained from Ground Moving Target Indicator and imagery sensor using DSmT for tracking and classification, has been presented in [20]. Multidate fusion has been proposed in [21] [22] for the short-term prediction of the winter land cover. DSmT is also used in the medical case retrieval by [23], the authors used DSmT to fuse heterogeneousfeatures of several sensors which will be included in CBR systems.
According to our study of the state of the art, all studied research works disregard the power of perceptual attention to well classify any scene thanks to the high human brain capacities. We benefit from this ability in our approachby integrating the visual perception model, using DSmT,with spectral and dense SURF features obtained from SVM classification for significant classification improvement.
The paper is organized as follows. After a brief presentation of mathematical background of DSmT formalism in section 2, we present the overall system of the proposed method in section 3. Data and experiments are then given in section 4 in order to evaluate the performance of our approach on real image datasets. A conclusion is given in section5.

Mathematical Background of DSmT
Dezert-Smarandache theory was proposed jointly by Jean Dezert and Florentin Smarandache [24]and was an attempt to overcome belief function limitations by handling a high uncertainty and conflicting information. This theory can be describedas follows: We denote Θ = { 1 , 2 , … . . , } the discernment space of the N class classification problem, and Θ the hyperpower-set [25] that is the set of subsets of Θ, with the union of classes and also their intersection, so that if , ∈ Θ , then ⋃ ∈ Θ and ⋂ ∈ Θ . Each source contributes its belief mass to , known by the generalized basic belief assignment gbba step and satisfying following properties: (1) Where ∅ is the null set, The size of hyper-power-set presents a real limit in DSmT when N>6 (N number of classes) in Free model [26] which corresponds to the full hyper-power-set without any constraints, in contrary to hybrid model [26]which allows integrating constraints that can be exclusive and refined, and therefore minimizing Θ size.
The assigned generalist mass obtained from different sources are then combined and a new mass distribution is provided to Θ elements. Combination step presents the kernel of the fusion process and each formalism proposed several combination operators.In DSmT formalism, all combination operators can be found in detail in [27], we quote the most used as Smets rule, Dempster Shafer (normalized) operator, Yager operator, Zhang operator, DsmH rule, Debois and Prade rule, PCR5 operator for N=2 and PCR operator for N>2. To deal with a large number of the sources used in this work and the high uncertainty and conflicting information provided, we benefit from the performance of PCR6 combination rule in handling such problem.
The generalized belief functions Credibility noted Bel(. ) orCr(. ),Plausibility noted (. )and DSmP transformation are derived from the function of basic mass and respectively defined for Θ in 0,1 : Θ can present full Θ or reduced Θ with constraint , depend on the model used ( Free or Hybrid).ℇ is an adjustment parameter, ( ∩ ) and ( ) are respectively the cardinality of ∩ and .
The last step in DSmT process is making a final decision, which presents a real challenge in many applications. In this work, we are interested in improving classification, we have to take a decision about pixels' belonging to a simple class also called Singleton class, and in this case there are two ways: taking decision based on maximum of generalized basic belief mass gbba or based on generalized belief function already computed as follows:


Maximum of credibilityCr(. ) is widely used in many applications [28], and it is considered as a pessimistic decision.


Maximum of plausibility Pl(. ) which is considered as an optimistic decision.
 Maximum of DSmP that is a compromise decision between the above decisions which are based on using probabilistic transformation P(. ) in the interval of [Cr . , Pl(. ) ].

Pre-processing
Generally, the pre-processing that precedes classification aims to eliminate imperfections that taint information by a set of actions as filtering, gradient operations, etc. However, in the classification based on the theories of the uncertain, these imperfections are protected, modeled and combined to help to make a decision.
The registration is the usually used pre-processing in the fusion process, it aims at setting correspondence between two or more images of a scene obtained from one or various sensors potentially at different spatial positions and scales, by using an optimal spatial and radiometric transformations between the images.
In the case of multimodal images, registration is an issue because of the significant difference between images [29] [30]. An original methodology was proposed in a previous work to answer the particular issue of the registration with multimodal imaging inputs in whichwe exploit the SURF scale-and rotation-invariant descriptors for the identification and the description of the interest points and we introduce a relevance filtering based on both SURF distance and orientation featuresin matching step[1].

Feature Extraction
Feature extraction is a pivotal step in the classification process. It aims to underline the relevant features that are correspondentto various classes. It is worth stating that the appropriate choice of extracted features improves the performance of classification step. Spectral, Spatial and perceptual features are extracted in this work.

Spectral Information
The spectral information is widely used on large classification methods. In this work, we have extracted the spectral values of each pixel as a vector of attributes and then converted them to Cielab space model for a better correlation with human color processing.

Dense SURF Description
Speeded up robust feature (SURF) proposed by Herbert Bay [31] is a spatial descriptor which consists originally of two phases, Detection and description of keypoints. We proposed in a previous work [32]to skip the detection phase and to perform description one to each pixel in the image. This is done, at the first by assigning to each pixel the dominant orientation calculated by combining the Haar wavelets results within a circular neighborhood around each pixel, and then creating 4 × 4sub regions around the pixel. In each subregion, a pixel wise Haar wavelets responses are computed, which in turn are summed up to form 64-elements descriptor.

Saliency Information
Based on a performed comparative analysis of saliency detection in our multimodal data [33], we extract the saliency features by using the method proposed by Rahtu et al [34]. This method used local features contrast in illuminance, color mapped to feature space that is divided into disjoint bins. A saliency measure is calculated by applying a sliding windows divided into inner windows and border in which a hypothesis that points in are salient and points in B are not, the measure can be defined as probability conditional and computed through the Bayes Formula as With 0 < 0 < 1and = ( ( )| 1). A regularized saliency measure is then introduced to make it more robust to the noise.
The motivation of integrating saliency information in the fusion process is the fact that usually visual perception succeeds easily to classify any objet or scene.

SVM Pre-Classification
Support vector machine is a supervised classification method introduced by Vapnik [35] [36], widely used in classification applications thanks to its performance to deal with high-dimensional data. Basically, it is designed for binary class by finding an optimal hyperplan that separates the two classes linearly-separated. In non-linear separable class, the feature space is mapped to some higher dimensional feature space where the classes are separable using a Kernel function that should fulfill Mercers conditions, the most kernels used are Radial Basis Function RBF, in which the decision function is expressed as a flow Where are Lagrange multipliers, and the associated Kernel function is In case of multiclass problem, two main approaches were proposed, One-Versus-Rest approach in which binary classifiers are constructed for -class classification, and One-versus-Onein which ( −1) 2 binary classifiers are applied on each pair of classes.
In order to generate the probabilities for DSmT, we have performed a pre-classification [32]based on combining spectral information (cited in 3.2.1) and Dense SURF information ( cited in 3.2.2) using SVM classifier with RBF kernel to handle non-linear high-dimensional data in our multimodal dataset, and One-Versus-Rest approach to deal with incomplete information provided from divers modalities.

Mass function estimation
Mass estimation function step is very crucial in fusion process, because the imperfections such as uncertainty, imprecision, paradox will be introduced. The most generation used fortheses masses is the probabilities from pre-classification. The SVM classification of images generates the matrices of the probabilities ( | )of pixels belonging to the singleton class of the frame of discernment Θ = { 1 , 2 , … . . , }, the same for saliency map generated using the proposed method in [34]. Each source (modality/saliency map) noted ( = 1, … . , ) gives the probability of belonging to one, or two classes, and their complementary classes which presents the mass of the partial ignorance. Based on [19], we denoteΘ = { 1 … . . }, and the gbba mass of each source is given by: ) is a normalization term that we used in order to make sure that = 1.

Combination of masses and decision
The estimated masses must be combined with appropriate rules that handles the conflict generated from different sources . In this work, we have used PCR6 [37] rule in combination step because it shows a better performance compared with all combination rule cited in the previous section and tested o our datasets. The PCR6 is computed as follows: Considering N independent sources, the combined 6 (. ) masses acquired from > 2sources are computed as follow: Where Where the mass 12…. ≡ ⋂ ( ) corresponds to the conjunctive consensus on between > 2 sources.
Once the combination step is achieved, we calculate the generalized belief function and we use a probabilistic transformation DSmP that converts the combined masses measure to a probability measure using Eq (6) to make a final decision.

Data
Large sets of multimodal images acquired on wall paintings from the Germolles palace are used to demonstrate our proposed method. This palace was offered by Dukes of Burgundy Philip to his wife Margaret Flanders in 1380, and it was the only remaining castle of the Dukes of Burgundy so well preserved, its wall painting was restored between 1989 and 1991. However, there were no conservation reports of the applied restoration. In order to detect the original from restored area, the conservator of Germolles used the multimodal images that have the advantage of being fast and relatively inexpensive solution for the examination of large areas of wall paintings.This technical photography consists of recording a set of images with a commercial digital photographic camera which has been modified by removing the thermal filter regularly positioned in front of the CCD. In this way it is possible to record images of reflected visible light (Vis), reflected infrared light (IRr), reflected ultraviolet light (UVr) and UV-fluorescence (UVf). This set of images provides information about the optical behaviour of the surface when reached by the different types of light and therefore provides information about the original portions of wall paintings from recent repainting.
For illustration purpose, we select an area of a south wall of the dressing room of Margaret represented in Figure 1. This area presents a large white P (for Philip) that covers the walls and painted in green, which is presented by four modalities VIS, UVF, UVR and IRR. Each modality measures 3744×5616 pixels. IRR modality shows very well the parts over non-original green surface. The image of the UV-induced fluorescence modality shows a relatively strong fluorescence corresponding to remains of an old/original paint layer over the white. The UVR image helps to identify the repainting original over the white of the letter P.

Experiments
The adopted methodology can be divided into four steps as illustrated in figure 2, which is started with the preprocessing by aligning each image with the VIS image that is used as a reference image.

Figure 2 A representative illustration of the workflow
In the second step, four topics have been identified: White original (WO), White repainted (WR), Green original (GO) and Green repainted (GR). Then spectral and Dense-SURF information is extracted and used jointly as the entry ofthe SVM classifier using the RBF kernel. In parallel, Saliency information is extracted using the proposed method in [34], the provided maps are shown in figure3.

Figure 3 Saliency maps
The third step is pre-classification using the SVM classifier that is applied to the images, in order to recover the probability matrixes of pixels belonging to classes. Each used modality highlights the presence of one or two classes. The UV-induced fluorescence modality shows a relatively strongfluorescence corresponding to the remains of an old painted layer of the white (WO) that reaches an accuracy of 92% using SVM, also UVR modality emphasizes WO class with a classification accuracy of 98%. Infrared light shows very well the parts over the original and repainted surface of the green and gets accuracy of 94% [32]. The provided maps are presented in figure4.

Figure 4 Multimodal SVM Classification
The VIS modality reaches an accuracy of 98% with the classification of the two classes GO and GR, whereas this precision is reduced when classifying four classes because of the increase of the conflict. The classified image is presented in figure 5.
The last step presents the fusion process that is started with defining the frame of discernmentΘ = { , , , }. Due to the obtained information by SVM classification and saliency maps, there are some constraints that can be taken into account to deal with the real situation and to reduce the hyper power set Θ , for example ∩ = ∅ .
Then the mass function that is associated with the emphasized class and it's complementary in each modality are computed using equation 10. The PCR6 combination rule is used for combining the calculated masses basing on the equation 11, and as a final task, the decision is taken using maximum DsmP.
The final classified map, provided by DSmT only, is given in figure 6, and the final classified map obtained using DSmT-Salience is shown in figure 7. The results have progressed with the integration of the perceptual model in DSmT process, the visual analysis of the classification maps shows that the result of the proposed method much better with the ground truth over the WR and WO classes and appears to be closer to the reality, rather than the result obtained using DSmT only for the same classes, while the obtained map using unimodal image present a degraded result in terms of smoothness and connectivity between classes. In this work, in order to evaluate the performance of the used methods and to compare the results, we have used the Overall accuracy (OA) that presentsa percentage of correctly classified pixels, and Mean Error Rate (MER) that presents the percentage of misclassified pixels. Table 1 summarizes the obtained results using the different methods, from the results, we can note that the proposed method produces a better overall accuracy of 95,39% compared with the DSmT classification which provides an overall accuracy of 91,46% and the SVM classification that gives an overall accuracy of 86,43%, in terms of the error rate, the proposed method gives the low MER score of 4,61% compared with DSmT-Classification and SVM-Classification that provides a MER of 8,53% and 12,60% respectively.
In conclusion, the use of DSmT theory with PCR6 combination rule provides a better result thanks to its effectiveness in managing correctly the conflict information that is provided from the different sources, and shows a significant classification improvement compared with the unimodal SVM classification. Thus, the integration of saliency information inthe fusion process presents a real benefit due to the powerful mechanism of the human brain in classification tasks.

Conclusion
In this paper, we have proposed a new method for multimodal image classification. As a first step, we have extracted spatial (Dense-SURF), spectral and saliency information. The extracted spatial and spectral information are combined and passing to the classifier SVM for pre-classification step. The SVM-classification results that are obtained from each modality is then fused using DSmT theory, the use of DSmT and SVM jointly provides better performance compared with the unimodal SVM classification. In the second step, the extracted saliency information is then modeled and combined with SVM classification results using DSmT process based on PCR6 combination rule and DsmP decision rule, the proposed method yields the best performance in terms of accuracy and error rate compared with DSmT-SVM classification and unimodal SVM classification.

Acknowledgement
The authors thank the Château de Germolles managers for providing data and expertise and the COST Action TD1201 "Colour and Space in Cultural Heritage (COSCH)" (www.cosch.info) for supporting this case study. The authors also thank the PHC Toubkal/16/31: 34676YA program for the financial support. [2] S. Y. a. J. Z. Jing Huang, "Multimodal image matching using self similarity," Applied Imagery Pattern