Sunday 1 May 2016

Cosine Similarity


Description:
              Cosine similarity is a metric to get an idea of how close any 2 entities are based on the difference between their corresponding dimensions. It is intuitively opposite to a distance calculation metric like, say, Euclidean Distance or Manhattan Distance. While such metrics calculate how different two vectors are, cosine similarity measures how similar two vectors are. Entities with high similarity will yield high results when plugged in to the cosine similarity function.

              The important feature that makes cosine similarity so popular inspite of being a crude, basic form of similarity measure is that it is independent of the number of dimensions present. If the vectors are two-dimensional, say x and y, cosine similarity can be determined over those two dimensions. If the vectors have more than two-dimensions, cosine similarity still works exactly the same way, just traversing over each dimension. Moreover, cosine similarity will work over vectors with different number of dimensions as well. For example, if one vector has x and y as its two dimensions, and other has only x, cosine similarity considers y to be 0 for the latter and works normally. This is practically correct, since any n-dimensional vector will not be present in the (n+1)th dimension and hence will have a value 0 in the (n+1)th dimension.

The Formula:

                         \text{similarity} = \cos(\theta) = {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\| \|\mathbf{B}\|} = \frac{ \sum\limits_{i=1}^{n}{A_i  B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}}  \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} }
                                                     Image courtesy: Wikipedia

Where Ai and Bi are the dimensional components of A and B respectively

Range and Analysis:
                     The range of cosine similarity, like that of cosine function, is [-1, +1] inclusive. A value of -1 means that the two entities are completely opposite of one another. A value of +1 means they are exactly similar. A value of 0 means that they are "perpendicular" to one another i.e. they exist in different planes altogether.


Uses:
                  Cosine similarity is widely used in data mining techniques to determine the similarity between two vectors. I recently used this metric in a machine learning project to calculate similarity between users. Each user had Age, Latitude and Longitude of their address as their component dimensions. This was done so that we can predict book ratings for users. For example, if user U has rated a book B as 5, and user V is similar to U, then user V will also rate the book B as 5. For interested users, I also employed uncertainty principle. This is just a modification to the above case. using uncertainty, if user V is very similar to user U (who has given B 5 rating) and very very similar to, say user W (who has given B 8 rating), then we can say that it is more uncertain that V will give a rating around 5 than around 8. This also means that V is more probable to give a rating close to 8 than 5.

Conclusion:
                      Other similarity metrics like Jaccard index and Tanimoto score also exist, but they are mostly rooted in set theory. Meaning, they are designed to deal with sets more than vectors or raw data. Arguably, Jaccard and Tanimoto can be applied to any vector as well, but the inherent operations that need to be performed are intersection and union (AND and OR for Tanimoto). Hence, we recommend cosine similarity (either original or modified as per the problem) for raw vector data.

No comments:

Post a Comment