I Introduction
Learning good feature representation from unlabeled data is the key to make progress in recognition and classification tasks, and has attracted great attention and interest from both academia and industry recently. A representative method for this is the deep learning (DL) approach
[1]with its goal to learn multiple layers of abstract representations from data. Among others, one typical DL method is the so called convolutional neural network (ConvNet), which consists of multiple trainable stages stacked on top of each other, followed by a supervised classifier
[2] [3]. Many variations of ConvNet network have been proposed as well for different vision tasks [4][5][6][7] [8] with great success.In these methods layers of representation are usually obtained by greedily training one layer at a time on the lower level [5] [9] [3]
, using an unsupervised learning algorithm. Hence the performance of singlelayer learning has a big effect on the final representation. Neural network based singlelayer methods, such as autoencoder
[10]and RBM (Restricted Boltzmann Machine,
[11]), are widely used for this but they usually have many parameters to adjust, which is very timeconsuming in practice.That motivates more simple and more efficient methods for singlelayer feature learning. Among others Kmeans clustering algorithm is a commonly used unsupervised learning method, which maps the input data into a feature representation simply by associating each data point to its nearest cluster center. There is only one parameter involved in the Kmeans based method, i.e., the number of clusters, hence the model is very easy to use in practice. Coates et al. [12]
shows that the Kmeans based feature learning network is capable to achieve superior performance compared to sparse autoencoder, sparse RBM and GMM (Guassian Mixture Model). However, the Kmeans based feature representation may be too terse, and does not take the nonuniform distribution of cluster size into account  Intuitively, clusters containing more data are likely to be part of the features with higher influential power, compared to the smaller ones.
In this paper, we proposed a SVDD (Support Vector Data Description,
[13], [14], [15]) based method to address these issues. The key idea of our method is to use SVDD to measure the density of each cluster resulted from Kmeans clustering, based on which more robust feature representation is built. Actually the Kmeans algorithm lacks a robust definition of the size of its clusters, since the nearest center principle is not robust against the noise or outliers commonly encountered in real world applications. We advocate that the SVDD could be a good way to address this issue. Actually SVDD is a widely used tool to find a minimal closed spherical boundary to describe the data belonging to the target class and therefore, given a cluster of data, we are expecting SVDD to generate a ball containing the normal data except outliers. Performing this procedure on all the clusters of Kmeans, we will finally obtain
SVDD balls on which our representation can be built. In addition, to take the cluster size into account, we use the distance from the data to each ball’s surface instead of the center as the feature.One possible problem of this method, however, may come from the instability of SVDD’s center, due to the fact that its position is mainly determined by the support vectors on the boundary and the noise in the data may deviate the center far from the mode (c.f., Fig. 3 (left)). Hence the resulting SVDD ball may not be consistent with the data’s distribution when used for feature representation. To address this issue, we add a new constraint to the original SVDD objective function to make the model align better with the data. In addition, we show that our modified SVDD can be solved very efficiently as a linear programming problem, instead of as a quadratic one. Usually we need to compute hundreds of clusters, and a linear programming solution can thus save us large amounts of time. The proposed method is further extended by adopting a set of receptive fields with different sizes to capture multiscale information ranging from detailed edgelike features to partlevel features. A preliminary version of this work appeared in [16]
, and the feasibility and effectiveness of the proposed CSVDDbased method (called CSVDDNet) is verified extensively on several object recognition and image retrieval benchmarks with competitive performance.
The remaining parts of this paper are organized as follows: In Section II, preliminaries are provided regarding unsupervised feature learning representation, then we detail our improved feature learning method in Section III. In Section IV, we investigate the performance of our method empirically over several popular datasets. We conclude this paper in Section V.
Ii Unsupervised Feature Learning
The goal of unsupervised feature learning is to automatically discover useful hidden patterns/features in large datasets without relying on a supervisory signal, and those learnt patterns can be utilized to create representations that facilitate subsequent supervised learning (e.g., object classification). Compared to supervised learning, unsupervised learning has its unique characteristics and advantages. Among others, one of the most important one is that it can be used to learn consistent patterns from unlabelled data, which are often free and easy to obtain. Such patterns distinguish from noise since by definition noise can be thought of as random variations presented in the data. This implies many potential applications of unsupervised learning, e.g., to transfer knowledge from one domain to another related domain, to regularize the behavior of a supervised algorithm, and to represent the data in a compact but effective manner. Due to these reasons, unsupervised learning are regarded as the future of deep learning
[17].There are many kinds of unsupervised learning methods in computer vision, such as Bag of Words (BoW)
[18], Vector of Linearly Aggregated Descriptors (VLAD) [19], Fisher vector (FV) [20], and so on. A typical pipeline for unsupervised feature learning includes three steps. The first step is to train a set of local filters from the unlabeled training data. This is usually done by running Kmeans (for BoW, VLAD) or GMM (for FV) on lots of local patches sampled from the dataset and then using the centers of clusters as filter bank. The second step is to partition a given image into patches and encode them into a set of feature vectors using the learnt filter bank. These feature vectors are finally combined and normalized as the feature representation for the input image. In what follows we give a brief review on these methods.Bag of Words and Its Variants The simple and basic unsupervised feature learning method is the BoW model. In this model local filters are usually the centers of clusters from Kmeans. These filters are looked as bins, which serves to pool the local patches nearest to them. This can be regarded as a ”hard voting” method:
(1) 
where is the value that a patch was encoded as with the kth filter . In BoW, we simply count the number of patches in each bin to get a histogram representation. Thus it is a very coarse way to encode the information of an input image.
Alternatively, VLAD [19] and FV [20] encode each data point with a vector instead of a simple count number as in BoW, which effectively improve the richness and robustness of the feature representation. Particularly, FV captures the first and second order difference between an input and the centres of a GMM, denoted as ,
(2) 
Then the FV coding for an local patch is a vector of . VLAD [19] is a simplified version of FV, with the difference signal between a patch and a filter defined as . As in FV, these difference signals are concatenated into a dim vector for feature representation. Obviously, both FV and VLAD encode much richer information than that of BoW, hence being more discriminative in subsequent tasks such as object classification.
Coates et al.’s Method To the best of our knowledge, the work of [12] is the first “deep” unsupervised learning method that are based on the Kmeans method, hence having close connection with the aforementioned BoW, VLAD and FV methods. Particularly, after learning a filter bank, instead of using it as basin of attraction like in BoW or as references for calculating difference vectors, it is utilized to generate a series of feature maps, one for each filter. This has at least two potential advantages: 1) compared to VLAD and FV, the encoded information is even more rich; 2) the feature maps preserve the spatial information well and hence the whole procedure could be repeated, leading to a deep unsupervised learning architecture.
Furthermore, to deal with the problem of “hard coding” in Kmeans, the following “triangle” encoding is proposed [12]:
(3) 
where , and is the mean of the elements of
. This activation function outputs
for the feature that has an above average distance to the centroid . This model leads to a less sparse representation (roughly half of the features could be set to be). Note that this “triangle” encoding strategy essentially allows us to learn a distributed representation using the simple Kmeans method instead of more complicated networkbased methods (e.g., autoencoder and RBM), hence saving much time in training. Coastes et al.
[12] shows that this strategy actually leads to comparable performance to, if not better than, those based on network methods.However, this method does not take the characteristics of each cluster into consideration. Actually, the number of data point in each cluster is usually different, so is the distribution of data points in each cluster. We believe that these differences would make a difference in feature representation as well. Unfortunately the aforementioned Kmeans feature mapping scheme completely ignores these and only uses the position of center for feature encoding. As shown in Fig. 1, although the data point has the same distance to the centers and of two clusters, it should be assigned a different score on than on since the former cluster is much bigger than the latter. In practice such unequal clusters are not uncommon and the Kmeans method by itself can not reliably grasp the size of its clusters due to the existence of outliers. To this end, we propose a SVDD based method to describe the density and distribution of each cluster and use this for more robust feature representation.
Iii The Proposed Method
In this section, after presenting an overview of the proposed method, we give the details of our CenteredSVDD method for feature encoding, and compare it with the Kmeans “triangle” encoding method. Then we describe our SIFTbased postpooling layer and discuss how to extend the method to extract multiscale information.
Iiia Overview of the Proposed Method
A typical singlelayer network contains several components: an input image is first mapped into a set of feature maps using filter banks (or dictionary), which are then subjected to a pooling/subsampling operation to condense the information contained in the feature maps. Finally, the pooled feature maps are concatenated to a feature vector, which serves as the representation for the subsequent classification/cluster tasks. There are several design options in this procedure, where the size of filter bank and that of the pooling grids are the major tradeoff one has to make.
Generally speaking, bigger filter banks help each sample find its nearby representative points more accurately but at the cost of yielding a highdimensional representation, hence a crude pooling/subsampling is needed to reduce the dimensionality. Overall this type of architecture emphasizes more on the global aspects of the samples than on the local ones (e.g., local texture, local shape, etc.). Actually, Coates et al. show that this kind of network is able to yield state of the art results on several challenging datasets [12]. On the other hand, other works use smaller filter banks but highlight the importance of detailed local information in constructing the representation, usually based on some complicated feature encoding strategy, as done in PCANet [21] or Fisher Vector [22].
In this work, we follow the second design choice, based on the consideration that the learned representation should preserve enough local spatial information for the subsequent processing. Compared to [12], we use an improved feature encoding method named CSVDD (detailed in the next section) and adopt the architecture of relatively small dictionary. Different to [21] or [22], we learn filter banks for feature encoding but add a SIFTbased postpooling processing procedure onto the network, which essentially projects the responses of a pooling operation into a more compact and robust representation space.
IiiB Using SVDD Ball to Cover Unequal Clusters
Assume that a dataset contains data objects, {, and a ball is described by its center and the radius . The goal of SVDD (Support Vector Data Description, [13]) is to find a closed spherical boundary around the given data points. In order to avoid the influence of outliers, SVDD actually faces the tradeoff between two conflicting goals, i.e., minimizing the radius while covering as many data points as possible. This can be formulated as the following objective,
(4) 
where the slack variable represents the penalty related with the deviation of the th training data point outside the ball, and is a user defined parameter controlling the degree of regularization imposed on the objective. With the KKT conditions, we have , i.e., the center of the ball is a linear combination of the data . The dual function of Eq.( 4) is
(5) 
where and are Lagrangian multipliers. By solving the quadratic programming problem we can get the center and the radius .
The SVDD method can be understood as a type of oneclass SVM and its boundary is solely determined by support vectors points. SVDD allows us to summarize a group of data points in a nice and robust way. Hence it is natural to use SVDD ball to model each cluster from Kmeans, thereby combining the strength of both models. In particular, for a given data point we first compute its distance to the surface of each SVDD ball , and then use the following modified “triangle” encoding method for feature representation (c.f., E.q.( 3)),
(6) 
where is the distance from the point to the surface of the th SVDD ball, while is the average of the values .
Shown in Fig. 2 for a data point , respectively are the centroids of two SVDD balls with being the radius. Since the distances from to and are equal, will be assigned the same scores on the two ball with the Kmeans scheme (c.f., E.q.( 3)). However, if we take the density and size of the clusters into accounts, the score from should be higher in our method.
IiiC The CSVDD Model
Although SVDD ball provides a robust way to describe the cluster of data, one unwelcome property of the ball is that it may not align well with the distribution of data points in that cluster. As illustrated in Fig. 3 (left), although the SVDD ball covers the cluster
well, its center is biased to the region with low density. This should be avoided since it actually gives suboptimal estimates on the distribution of the cluster of data.
To address this issue, inspired by the observation that the centers of Kmeans are always located at the corresponding mode of their local density, we propose to shift the SVDD ball to the centroid of the data such that it may fit better with the distribution of the data in a cluster. Our new objective function is then formulated as, ^{1}^{1}1We choose the squared L2 norm distance as a convenient for optimization. There are also other robust distance such as nonsquared L2 norm distance [23].
(7) 
and its Lagrange function is as follows,
(8) 
where and are the corresponding Lagrange multipliers. According to KKT Conditions, we have,
(9) 
(10) 
Taking Eq.(9) and Eq.(10) into the Lagrange function (8) we get that
Recalling that , one has the following dual function,
(11) 
This can be reformulated as
(12) 
where . This objective function is linear to , and thus can be solved efficiently with a linear programming algorithm.
Since the model is centered towards the mode of the distribution of the data points in a cluster, we named our method as CSVDD (centeredSVDD). Fig. 3 shows the difference between SVDD and CSVDD, where the left is from SVDD and the right from CSVDD. We can see that our new model aligns better with the density of the data points, as expected. It is also worth mentioning that the normalization parameter plays an important role in our model  a larger value would allow more noise to enter the ball, while , the CSVDD model actually reduces to the naive singlecluster Kmeans. More discussions on setting this value empirically will be given in Section IV.
After the model is trained, we use the modified “triangle” encoding (E.q. 6) for feature encoding, with almost the same computational complexity with its Kmeans counterpart.
IiiD Kmeans Encoding vs. CSVDD Encoding
To this end, it will be useful to take a brief discussion on the difference of two kinds of feature maps, i.e., Kmeansbased “triangle” encoding (E.q. 3) and our CSVDDbased one ^{2}^{2}2Hereinafter we will call them respectively “Kmeans encoding” and “CSVDD encoding” for short without confusion.. For this a pilot experiment is conducted. Particularly, we learn a very small dictionary containing only five atoms using five face images, by clustering ZCAwhitened patches randomly sampled from the faces, and then take these for feature encoding. Fig.4 illustrates the face images used for dictionary learning (top) and the five learnt atoms (leftmost). The feature maps of face images encoded by the Kmeans encoding method and those by the CSVDD encoding method are respectively shown in Fig.4 (a) and Fig.4 (b), where each row is corresponding to one dictionary atom next to it and each column corresponding to one face.
By comparing the feature maps shown in Fig.4 (a) and Fig.4 (b), one can see that the CSVDDbased ones contain more detailed information than the Kmeans feature maps for the first three atoms, while the responses of the last two atoms are largely suppressed by our method (c.f., last two rows of Fig.4 (b)). To further understand this phenomenon, we plot the entropy of each atom (by treating them as a small image patch) in Fig.5 (c). The figure shows that the entropy of the last two atoms is much smaller than that of the first three ones, which indicates that the local appearance patterns captured by these last two atoms are much simpler than those by the first three. Hence these two atoms will tend to be widely used by many faces, resulting in reduced discriminative capability in distinguishing different subjects. In this sense, it will be useful to suppress their responses (c.f., the last two rows of Fig.4 (b)).
It is also useful to inspect the distribution of local facial patches attracted by these atoms. Fig.5 (a) gives the results. It can be seen that this distribution is not uniform and the number of local patches attracted by the fourth atom is significantly larger than those by other atoms. As a result, for Kmeans encoding method, the feature maps yielded by this atom show much more rich details than others (see the fourth row of Fig.4 (a)), potentially indicating that it could play more important roles than others in the subsequent classification task. However, as explained above, since this atom actually contains much less information than the first three atoms (low entropy and being a “common word”), it is really not good to overemphasize its importance in feature encoding.
This drawback of Kmeans feature mapping is largely bypassed by our CSVDDbased scheme. As shown Fig.5 (b), the fourth atom actually represents a very small cluster. In fact, the radius of CSVDD ball corresponding to the more informative atom tends to be large, and one major advantage of our CSVDDbased strategy is that it is capable to exploit this characteristic of dictionary atoms for more effective feature encoding, as shown in the first three rows of Fig.4 (b). This partially explains the superior performance of the proposed CSVDD method compared to its Kmeans counterpart (c.f., experimental results in Section IV).
IiiE Encoding Feature Maps with SIFT Representation
Traditional unsupervised methods like Bag of Words (BoW) model [18] usually generate a global feature representation by simply histogramming over local codings, while ignoring spatial relationship between local patches.
One problem of preserving spatial information in feature representation is due to the huge dimension of feature maps. Suppose that the size of a receptive field is , and the size of an input image is . After densely extracting patches and encoding them, we would obtain feature maps, one for each filter, with each of size (). Particularly, for small images with , and a small dictionary with size and with size of its filter , the resulting dimension of feature maps is nearly
, which is too large for many applications. One can use such methods as average pooling or max pooling to reduce the size of feature maps. For
sized pooling blocks, the size of a feature map can be reduced to . In the above example if we set , the dimension of each map becomes , which is still too big when concatenating maps. However, if we choose a bigger pooling window, the more spatial information will be lost.In this paper we proposed a variant of SIFTrepresentation to address the above issues. SIFT is a widely used descriptor in computer vision and is helpful to suppress the noise and improve the invariant properties of the final feature representation. The way we get SIFT representation is not in general way which extracts a 128bit SIFTdescriptors densely. This will also cause to a very high dimensionality. For example, if we extract 128 dimensional SIFTdescriptors densely in feature maps with the size of in pixel, the dimension of the obtained representation vector will be as high as over M (). To address this issue, we first divide each feature map into blocks and then only extract an 8bit gradient histogram from each block in the same way as SIFT does. This results in a feature representation with dimension of for each map (Such as if m=3, then the dim is only 72bit). In this way we significantly reduce the dimensionality while preserving rich information for the subsequent task.
IiiF Multiscale Receptive Field Voting
Next we extend our method to exploit multiscale information for better feature learning. A multiscale method is a way to describe the objects of interest in different sizes of context. This would be useful since patches of a fixed size can seldom characterize an object well  actually they can only capture local appearance information limited in that size. For example, if the size is very small, information about edges could be captured but the information on how to combine these into more meaningful patterns such as motifs, parts, poselets, and object, is lost, while information about these entities at different levels is valuable in that they are not only discriminative by itself but complementary to each other as well. Most popular manually designed feature descriptors, such as SIFT or HoG, address this problem to some extend by pooling image gradients into edgletslike features, but it is still unclear, for example, how to assemble edglets into motifs using these methods. Convolutional neural network provides a simple and comprehensive solution to this issue by automatically learn hierarchies of features ranging from edglets to objects. However, during this procedure, information on where those highlevel patterns are found becomes more and more ambiguous.
Because our CSVDDNet is a singlelayer network, it is difficult to learn multiscale information in a hierarchical way. Instead, we take a naive way to obtain multiscale information by using receptive fields of different sizes. In particular, we fetch patches with squares in size from training images and use these to train dictionary atoms with corresponding size through Kmeans. Fig.6 shows some examples of atoms we learnt on a face dataset. One can see that these feature extractors are similar to those learnt using a typical ConvNet. Specifically, with the increasing window size, the learnt features become more understandable  for example, as shown in Fig.6 (c), using a receptive field with size of on face images of , we successfully learned facial parts such as the eyes, the mouth, and so on, while a smaller receptive field gives us some oriented filters, as shown in Fig.6 (a). At each scale we train several networks with different pooling window. One advantage of this method is that it is very efficient to learn and is effective in capture salient features in a multiscale context. However, it will not tell us how the bigger patterns are explained by smaller ones  such information would be useful from a generative angle.
To use the learnt multiscale information for classification, we train a separate classifier on the output layer of the corresponding network (view) according to different receptive sizes and different pooling sizes, then combine them under a boosting framework. Particularly, assume that the total number of categories is , and we have scales (with different number of pooling sizes for each scale), then we have to learn output nodes. These nodes are corresponding to multiclass classifiers. Let us denote the parameter of the th classifier ( is the dimension of feature representation) as , where is the weight vector for the th category. We first train these parameters using a series of oneversusrest SVM classifiers, and then normalize the outputs of each classifier using a soft max function,
(13) 
Finally, the normalized predictions are combined to make the final decision,
(14) 
where is the output vector of the th classifier, and the corresponding combination coefficients are trained using the following objective,
(15) 
This is the same type of oneversusrest SVM mentioned before.
Iv Experiments AND Analysis
To evaluate the performance of the proposed CSVDDNet, we conduct extensive experiments on four datasets including two object classification datasets(STL10 [12], MINST [2]) and two image retrieval datasets(Holiday [24], INRIA Copydays [25]).
Iva Experiment Settings
All the images undergo whitening preprocessing before feeding them into the network. The whitening operation linearly transforms the data such that their covariance matrix becomes unit sphere, hence justifying the Euclidean distance we use in the Kmeans clustering procedure.
Unless otherwise noted, the parameter settings listed in Table.I apply to all experiments. The influence of some important parameters, such as the number of filters, will be investigated in more detail in the subsequent sections. For single scale network the receptive field is set to be by default across all the datasets, as recommended in [12], while in multiscale version, we use receptive fields in three scales, as shown in Table.I.
For CSVDD ball there is a regularization parameter to set. This parameter allows us to control the amount of noise we are willing to tolerant to. As can be seen from E.q.1, a small value encourages a tight ball. We set by default for most datasets except for those with too noisy background are set to 0.005. Furthermore, the centers in CSVDD are set as the same as those in kmeans, so that we can safely ignore the effect of the initialization of kmeans.
Throughout the experiments, we use Coates’ Kmeans “triangle” encoding method [12] (c.f., Section LABEL:subsec_softcoding) as baseline (denoted as ‘Kmeans’), while its direct counterpart method by simply replacing “triangle” encoding with CSVDD encoding is denoted as ‘CSVDD’. Furthermore, we denote the proposed single layer network as ‘CSVDDNet’, and its multiscale version as ‘MSRV + CSVDDNet’. In addition, we reevaluate the baseline method [12] within the proposed network by replacing its component of CSVDD with the Kmeansbased encoding, denoted as ‘KmeansNet’.
Parameter  value 

#clusters  500 
Size of receptive field  
size of average pooling  
of CSVDD  
*default setting for the nonmultiscale network. 
IvB Analysis of the Proposed Method
First of all we conduct extensive experiments on the STL10 dataset to investigate the behavior of the proposed method. The STL10 is a large image dataset popularly used to evaluate algorithms of unsupervised feature learning or selftaught learning. Besides 100,000 unlabeled images, it contains 13,000 labeled images from 10 object classes, among which 5,000 images are partitioned for training while the remaining 8,000 images for testing. All the images are color images with pixels in size. There are 10 predefined overlapped folds of training images, with 1000 images in each fold. In each fold, a classifier is trained on a set of 1000 training images, and tested on all 8000 testing images. In consistence with [12], we report the average accuracy across 10 folds. For unsupervised feature learning we randomly select 20,000 unlabeled data. The size of spatial pooling is , hence the size of feature maps fed for SIFT representation is . For multiscale receptive voting we use 2 scale ( and ), on each of which we perform spatial pooling in 5 sizes ranging from to .
Do we really need a large number of local features?
By the number of features, we mean the number of filters
used for feature extraction, which is equal to the number of dictionary atoms. One of the major conclusions of Coates et al.’s series of controlled experiments on single layer unsupervised feature learning network
[12] is that compared to the choice of particular learning algorithm, the parameters that define the feature extraction pipeline, especially the number of features, have much more deep impact on the performance. Using a Kmeans network with 4000 features, for example, they are able to achieve surprisingly good performance on several benchmark datasets  even better than those with much deeper architectures such as Deep Boltzmann Machine [26] and Sparse Autoencoder [12].However, one drawback accompanying this large dictionary is that a very crude pooling size has to be adopted (e.g., over feature maps) to condense the resulting feature maps, otherwise the dimensionality of the final feature representation could be prohibitively high. For example, a pooling over 4000 feature maps with in size would lead to a total number of features over 3.8M. Hence the first question we investigate is that whether such a large number of features are really needed all the time?
Fig.7 gives the performance curves according to varying number of features with different methods on the STL10 dataset. Besides the aforementioned methods, in this figure we also give the results of random dictionary (i.e, local dictionary atoms are obtained randomly without being fine tuned by kmeans, denoted as “Random” ) and of the combination of random dictionary and SIFT representation (denoted as ’RandomNet’).
It can be seen that with the increasing number of features, the performance of both Kmeans and CSVDD methods rises, which is consistent with the results by Coates et al. [12]
. One possible explanation is that since both Kmeans encoding and CSVDD encoding use the learnt dictionary to extract nonlinear features, more dictionary atoms help to disentangle factors of variations in images. In our opinion the capability to learn a large number of atoms at relatively low computational cost is one of the major advantages of Kmeans based methods for unsupervised feature learning over other algorithms such as Gaussian Mixture Model (GMM), sparse coding, and RBM. For example, it is difficult for a GMM to learn a dictionary with over 800 atoms
[12].On the other hand, a too large dictionary can increase the redundancy and decrease the efficiency. Hence it is desirable to reduce the number of features while not hurting the performance too much. Fig.7 shows that our CSVDD encoding method consistently works better than the Kmeans encoding at different number of features, and combining CSVDD encoding and SIFTbased representation dramatically reduces the needs for large dictionary without scarifying the performance. Actually, Table.II and fig.7 show that using our CSVDD encoding and the SIFT feature representation, the dictionary size reduces by 10 times (from 4,800 [27] to 500) while the performance improves by 12% (from 53.80% [27] to 65.92%).
As for the random dictionary (denoted as ’Random’ and ’RandomNet’ in Fig.7), it is interesting to see that when the number of atoms is small, random atoms perform much worse than those finetuned by kmeans. But as the size of dictionary increases, the performance difference between the random dictionary and Kmeans dictionary begins to reduce. For example, at 500 features, using random atoms gives a performance of 54.77%, slightly worse than that of kmeans (56.63%), and the performance of RandomNet (62.45%) is also close to that of KmeansNet (63.07%). However the performance of both random methods is all much lower than that of the CSVDD based methods.
Effect of the pooling size
To investigate the effect of different pooling sizes on the performance using the proposed method, we conduct a series of experiments on the STL10 dataset. Particularly, for a 96 96 original image, we use a receptive field of 5 5 in pixel for feature extraction and obtain a layer of feature maps with 92 92. The pooling blocks are set to be such that the size of final feature maps after pooling is . We vary from 1 1 to 31 31 and record the yielded accuracy. Fig.8 gives the results under different settings. We can see from the figure that generally for the one layer Kmeansbased network we need bigger block sizes for improved translation invariance, but adding a robust SIFT encoding layer after pooling effectively reduces the needs for large pooling size while obtaining better performance. One possible reason is that this tends to characterize more detailed information of the objects to be represented.
Effect of the multiscale receptive field voting
Fig.9 gives the detailed accuracy of 10 representations using 2 sizes of receptive fields and 5 sizes of pooling blocks. One can see that different representation leads to different prediction accuracy but combining them leads to better performance. This shows that the representations captured with different receptive fields and pooling sizes are complementary to each other.
Contribution of components
To illustrate the contributions of the individual stages of the proposed method (i.e., CSVDDbased encoding, SIFTrepresentation and multiscale voting), we conduct a series of experiments on the STL10 dataset by removing each of the three main stages in turn while leaving the remaining stages in place (the comparison is thus against our full method). Fig.10 gives the results. In general each stage is beneficial and (not shown) the results are cumulative over the stages, but the SIFT stage seems to contribute most to the performance improvement. This suggests that taking spatial information into global representation is of importance.
IvC Object Classification
Algorithm  Accuracy() 

Selective Receptive Fields (3 Layers) [27] (2011) 
60.10 1.0 
Trans. Invariant RBM (TIRBM) [28] (2012)  58.70 
Simulated visual fixation ConvNet [29] (2012)  61.00 
Discriminative SumProd. Net (DSPN) [30] (2012)  62.30 1.0 
Hierarchical Matching Pursuit (HMP) [31] (2013)  64.50 1.0 
Deep Feedforward Networks [32] (2014)  68.00 0.55 
BoW(K = 4800 D = 4800)  51.50 0.6 
VLAD(K = 512 D = 40960)  57.60 0.6 
FV (K = 256 D = 40960)  59.10 0.8 
Kmeans (K = 4800 D = 19200) [27] (2011)  53.80 1.6 
CSVDD (K = 4800 D = 19200)  54.60 1.5 
KmeansNet (K = 500 D = 36000)  63.07 0.6 
CSVDDNet (K = 500 D = 36000)  65.92 0.6 
MSRV+KmeansNet  64.96 0.4 
MSRV+CSVDDNet  68.23 0.5 
Algorithm  Error() 

Deep Boltzmann Machines [26] (2009)  0.95 
Convolutional Deep Belief Networks [33] (2009) 
0.82 
Multicolumn deep neural networks [34] (2012)  0.23 
Network in Network [35] (2013)  0.47 
Maxout Networks [36] (2013)  0.45 
Regularization of neural networks [37] (2013)  0.21 
PCANet [21] (2014)  0.62 
DeeplySupervised Nets [38] (2014)  0.39 
Kmeans (1600 features)  1.01 
CSVDD (1600 features)  0.99 
KmeansNet (400 features)  0.45 
CSVDDNet (400 features)  0.43 
MSRV+KmeansNet  0.36 
MSRV+CSVDDNet  0.35 
STL10 dataset Table.II gives our results on the STL10 dataset. The major challenges of this dataset lie in that its images are captured in the wild with cluttered background, objects in various scales and poses. As before, we compared our method with several feature learning methods with state of the art performance. One can see that our one scale CSVDD network obtains 65.92% accuracy, using a filtering dictionary of 500 atoms, outperforms several other feature encoding methods, such as Bag of Words (BoW), Vector of Linearly Agregated Descriptors (VLAD), Fisher vector (FV) and other unsupervised deep learning methods (e.g, Trans. Invariant RBM (TIRBM) [28], Selective Receptive Fields (SRF) [27], and Discriminative SumProduct Networks (DSPN) [30]). This also indicates that spatial information preserving using SIFT is indeed useful in unsupervised feature learning. Also note that replacing the proposed CSVDD encoding with Kmeans encoding leads to nearly 3.0% performance loss, while fusing the multiscale information gives us about 2.3% improvement in accuracy, exceeding the current best performer [32] on this challenging dataset.
MINST dataset
The MNIST is one of the most popular datasets in pattern recognition. It consists of grey valued images of handwritten digits between 0 and 9. It has a training set of 60,000 examples, and a test set of 10,000 examples, all of which have been sizenormalized and centered in a fixedsize image with 28
28 in pixel. In training we use a dictionary with 400 atoms for feature mapping, and after pooling/subsampling we break each feature map into 9 blocks to extract SIFT features. For multiscale receptive voting, we use 3 types of receptive fields: , and . Combined these with two settings for pooling sizes (i.e., and , respectively), 6 different views/representations can be obtained for each image in this dataset.Table.III gives our experimental results on the MINST dataset. It is wellknown that deep learning has achieved great success on this task of digit recognition. For example, only 95 among 10,000 test digits are misclassified by the Deep Boltzmann Machines [26], while Convolutional Deep Belief Networks [33] and Maxout Networks [36] respectively reduce this number to 82 and 45. Our simple single layer network (MSRV+CSVDDnet) achieves an error as low as 0.35, which is highly competitive to other complex methods using deep architecture. Fig.11 shows all the 35 misclassified digits by our method, and one can see that these misclassified digits are very confusing even for human beings. Compared to the original Kmeans network [12], the proposed method reduces the error rate by 65%, with much smaller number of filters. This reveals that at least on this dataset with clean background, it is very beneficial to focus more on the representation of the details of the image, rather than emphasizing too much on its global aspects using a large number of filters and a large pooling size.
IvD Image Retrieval
Algorithm  K  D  Holidays (mAP %)  
best  D’ = 2048  D’ = 512  D’ = 128  D’ = 64  D’ = 32  
BoW [19] (2012)  20000  20000  45.2  41.80  44.9  45.2  44.4  41.8 
FV [19] (2012)  256  16384  62.6  62.6  57.0  53.8  50.6  48.6 
VLAD [19] (2012)  256  16384  62.1  62.1  56.7  54.2  51.3  48.1 
VLAD+adapt+innorm [39] (2013)  256  16384  64.6  —  —  62.5  —  — 
Kmeans  3200  12800  55.2  54.5  54.9  51.6  48.3  44.5 
CSVDD  3200  12800  57.4  56.8  57.0  53.1  50.5  46.8 
KmeansNet  256  8192  62.5  59.8  62.5  61.3  55.5  49.5 
CSVDDNet  256  8192  66.0  63.7  65.8  64.7  59.3  52.1 
MSRV+KmeansNet  256  81928  66.5  65.0  66.3  65.3  58.3  51.5 
MSRV+CSVDDNet  256  81928  70.2  68.6  69.8  68.5  62.5  53.8 
Algorithm  K  D  crop 50%  transformations  
best  D’ = 128  best  D’ = 128  
BoW [19] (2012)  20k  20k  100.0  100.0  54.3  29.6 
FV [19] (2012)  64  4096  98.7  92.7  59.6  41.2 
VLAD [19] (2012)  64  4096  97.7  94.2  59.2  42.7 
Kmeans  3200  12800  95.2  91.5  47.6  32.80 
CSVDD  3200  12800  97.4  93.8  52.2  36.60 
KmeansNet  256  8192  96.8  94.3  55.4  37.80 
CSVDDNet  256  8192  100.0  98.1  62.2  52.0 
MSRV+KmeansNet  256  81926  99.7  97.9  58.2  41.8 
MSRV+CSVDDNet  256  81926  100.0  100.0  65.6  55.3 
Holiday dataset INRIA Holiday dataset consists of 1491 images from personal holiday photos. There are 500 queries, most of which have 12 ground truth images. mAP (mean average precision) is employed to measure the retrieval accuracy. We resize all the images to 96 96. In training we use a dictionary with 256 atoms for feature mapping, and after pooling/subsampling we break each feature map into 4 blocks to extract SIFT features. Thus the dimension of final representation is 8196. And we also run PCA for dimensionality reduction as [19]. For multiscale receptive voting, we use 2 types of receptive fields: and . Combined these with four settings for pooling sizes (i.e., , , and , respectively), 8 different views/representations can be obtained for each image in this dataset. Note that in image retrieval task, we can not train classifiers so that we just concatenate all the views’ representations to combine multiscale information. In retrieval stage we use Euclidean distance in nearest neighbor searching as in [19] and [39], facilitating a fair comparison between various feature representation methods on this task.
Table.IV gives our experimental results on this dataset. We compare our method with BOW, VLAD, FV under different dimension ( reduced through PCA). BoW takes a 20k sized filter bank but has the lowest mAP (45.2%). Replacing BoW with Kmeans triangle encoding improves mAP by 10% (55.2%), but still needs a large filter bank of 3.2K.
Previous stateofart unsupervised feature learning methods, i.e., VLAD and FV [19], can achieve a high mAP of 62.1% 62.6% respectively. And both of them only take a small set of filters of size 256. In [39] Arandjelovic combines VLAD with adaptive filter bank and a new normalization to achieve an accuracy of 64.6%. Our proposed CSVDDNet can get a mAP of 66.0% with 256 filters as well. It outperforms VLAD by 3.9% and VLAD+adapt+innorm by 1.4%. Even if we reduce its dimension to smaller sizes with PCA, it consistently achieves the best performance among the compared ones.
Also note that replacing Kmeans encoding with CSVDD encoding results in significant improvement (from Kmeans 55.2% to CSVDD 57.4%, and from KmeansNet 62.5% to CSVDDNet 66.0%). When concatenating multiscale representation from 8 views, we are able to achieve the highest mAP of 70.2%, without using any supervision information.
Copydays dataset INRIA Copydays dataset was designed to evaluate nearduplicate detection [19]. The dataset contains 157 original images. To obtain query images relevant in a copy detection scenario, each image of the dataset has been transformed with three different types of transformation: image resizing, cropping (Here we use only the queries with the cropping parameter fixed to 50%), strong transformations (print and scan, occlusion, change in contrast, perspective effect, blur, etc). There is in total 229 transformed images, each of which has only a single matching image in the database. All images are resized to 75 75. We use 2 types of receptive fields and , together with three pooling sizes (i.e., , and respectively), which result 6 different views. To challenge ourself in this experiments we also merge the database with 10k web images as [19] does.
It is a large scale retrieval task. Table.V gives our experimental results on this dataset. We can see that in the 50% cropped circumstance, our CSVDDNet with only 256 filters can be robust enough to achieve a mAP of 100% as BoW with 20k filters. When reducing its dimension to 128 bits, it still performs the second best. In the strong transformation setting, our CSVDDNet achieves a mAP of 62.2% which outperforms VLAD (59.2%) and FV (59.6%) by nearly 3%.
Furthermore, one can see that CSVDD encoding allows our CSVDDNet improve upon KmeansNet by 6.8% in terms of mAP. When reduced to 128 bits, our CSVDDNet achieves a mAP of 52% under the difficult cases of strong transformation, which outperforms other compared methods by more than 10%, while our multiscale version improves the mAP by 3%.
V Conclusion
In this paper, we propose a simple onelayer neural network termed CSVDDNet for unsupervised feature learning. One of the major advantages of the proposed method is that it allows effective feature representation for many applications, such as object classification and image retrieval, by exploiting unlabeled data which are often cheap and readily available. We show that when properly combined with the SIFT descriptors, such representation could be made even more efficient and discriminant. Extensive experiments on several challenging object classification datasets and image retrieval datasts demonstrate that the proposed method significantly outperforms previous state of the art unsupervised feature learning methods such as Bag of Word, VLAD [19], and FV [20].
Additionally, we show that for feature representation, a very big dictionary is not necessary, as one could accumulate rich information in each feature map and preserve them with compact encoding (e.g, using the proposed method). This significantly reduces the computational cost. Last but not least, we show that one can use multiscale information to further improve the performance without training many layers of networks  after all training several shallow networks is much easier than training a deep one.
Acknowledgements
The work was financed by the National Science Foundation
of China (61073112), the National Science Foundation of Jiangsu Province (BK2012793), and the Doctoral Fund of Ministry of Education of China (20123218110033).
References
 [1] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” arXiv preprint arXiv:1206.5538, 2012.
 [2] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [3] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is the best multistage architecture for object recognition?” in Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009, pp. 2146–2153.
 [4] J. Bruna and S. Mallat, “Invariant scattering convolution networks,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 35, no. 8, pp. 1872–1886, 2013.
 [5] Q. V. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, J. Dean, and A. Y. Ng, “Building highlevel features using large scale unsupervised learning,” arXiv preprint arXiv:1112.6209, 2011.
 [6] G. Carneiro, J. C. Nascimento, and A. Freitas, “The segmentation of the left ventricle of the heart from ultrasound data using deep learning architectures and derivativebased search methods,” Image Processing, IEEE Transactions on, vol. 21, no. 3, pp. 968–982, 2012.
 [7] P. P. San, S. H. Ling, Nuryani, and H. Nguyen, “Evolvable roughblockbased neural network and its biomedical application to hypoglycemia detection system.” Cybernetics IEEE Transactions on, vol. 44, no. 8, pp. 1338 – 1349, 2014.
 [8] L. Shuai and Y. Li, “Nonlinearly activated neural network for solving timevarying complex sylvester equation,” Cybernetics IEEE Transactions on, vol. 44, no. 8, pp. 1397 – 1407, 2014.
 [9] A. Agarwal and B. Triggs, “Hyperfeatures–multilevel local coding for visual recognition,” in Proc. Ninth European Conf. Computer Vision, 2006. Springer, 2006, pp. 30–43.
 [10] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.

[11]
M. A. Cueto, J. Morton, and B. Sturmfels, “Geometry of the restricted
boltzmann machine,”
Algebraic Methods in Statistics and Probability,(eds. M. Viana and H. Wynn), AMS, Contemporary Mathematics
, vol. 516, pp. 135–153, 2010.  [12] A. Coates, H. Lee, and A. Y. Ng, “An analysis of singlelayer networks in unsupervised feature learning,” Ann Arbor, vol. 1001, p. 48109, 2010.
 [13] D. M. Tax and R. P. Duin, “Support vector data description,” Machine learning, vol. 54, no. 1, pp. 45–66, 2004.
 [14] J. Xu, J. Yao, and L. Ni, “Fault detection based on svdd and cluster algorithm,” in Electronics, Communications and Control (ICECC), 2011 International Conference on. IEEE, 2011, pp. 2050–2052.

[15]
A. Banerjee and P. Burlina, “Efficient particle filtering via sparse kernel density estimation,”
Image Processing, IEEE Transactions on, vol. 19, no. 9, pp. 2480–2490, 2010.  [16] D. Wang and X. Tan, “Centering svdd for unsupervised feature representation in object classification,” in Neural Information Processing. Springer, 2013, pp. 376–383.
 [17] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
 [18] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, “Visual categorization with bags of keypoints,” in Workshop on statistical learning in computer vision, ECCV, vol. 1, no. 122. Prague, 2004, pp. 1–2.
 [19] H. Jégou, F. Perronnin, M. Douze, J. Sánchez, P. Pérez, and C. Schmid, “Aggregating local image descriptors into compact codes,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 34, no. 9, pp. 1704–1716, 2012.
 [20] J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek, “Image classification with the fisher vector: Theory and practice,” International journal of computer vision, vol. 105, no. 3, pp. 222–245, 2013.
 [21] T.H. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng, and Y. Ma, “Pcanet: A simple deep learning baseline for image classification?” arXiv preprint arXiv:1404.3606, 2014.
 [22] K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman, “The devil is in the details: an evaluation of recent feature encoding methods,” 2011.

[23]
F. Nie, J. Yuan, and H. Huang, “Optimal mean robust principal component analysis,” in
Proceedings of the 31st International Conference on Machine Learning (ICML14), 2014, pp. 1062–1070.  [24] H. Jégou, M. Douze, and C. Schmid, “Hamming embedding and weak geometry consistency for large scale image searchextended version,” 2008.
 [25] M. Douze, H. Jégou, H. Sandhawalia, L. Amsaleg, and C. Schmid, “Evaluation of gist descriptors for webscale image search,” in Proceedings of the ACM International Conference on Image and Video Retrieval. ACM, 2009, p. 19.

[26]
R. Salakhutdinov and G. E. Hinton, “Deep boltzmann machines,” in
International Conference on Artificial Intelligence and Statistics
, 2009, pp. 448–455.  [27] A. Coates and A. Y. Ng, “Selecting receptive fields in deep networks.” in NIPS, vol. 5, 2011, p. 8.
 [28] K. Sohn and H. Lee, “Learning invariant representations with local transformations,” arXiv preprint arXiv:1206.6418, 2012.
 [29] W. Y. Zou, A. Y. Ng, S. Zhu, and K. Yu, “Deep learning of invariant features via simulated fixations in video.” in NIPS, 2012, pp. 3212–3220.
 [30] R. Gens and P. Domingos, “Discriminative learning of sumproduct networks.” in NIPS, 2012, pp. 3248–3256.
 [31] L. Bo, X. Ren, and D. Fox, “Unsupervised feature learning for rgbd based object recognition,” in Experimental Robotics. Springer, 2013, pp. 387–402.
 [32] B. Miclut, “Committees of deep feedforward networks trained with few data,” in Pattern Recognition. Springer, 2014, pp. 736–742.
 [33] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations,” in Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009, pp. 609–616.
 [34] D. Ciresan, U. Meier, and J. Schmidhuber, “Multicolumn deep neural networks for image classification,” in Proc. IEEE Computer Vision and Pattern Recognition (CVPR), 2012, pp. 3642–3649.
 [35] M. Lin, Q. Chen, and S. Yan, “Network in network,” CoRR, vol. abs/1312.4400, 2013. [Online]. Available: http://arxiv.org/abs/1312.4400
 [36] I. J. Goodfellow, D. WardeFarley, M. Mirza, A. Courville, and Y. Bengio, “Maxout networks,” arXiv preprint arXiv:1302.4389, 2013.
 [37] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus, “Regularization of neural networks using dropconnect,” in Proceedings of the 30th International Conference on Machine Learning (ICML13), 2013, pp. 1058–1066.
 [38] C.Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeplysupervised nets,” arXiv preprint arXiv:1409.5185, 2014.
 [39] R. Arandjelovic and A. Zisserman, “All about vlad,” in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013, pp. 1578–1585.
Comments
There are no comments yet.