Cosine Distance (Big Data Analytics/ Machine Learning)
Cosine Similarity – Understanding the math and how it works?
Cosine similarity is a metric used to measure how similar the documents are irrespective of their size.
Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space.
The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together.
“The smaller the angle, higher the cosine similarity.”
Introduction
A commonly used approach to match similar documents is based on counting the maximum number of common words between the documents.
But this approach has an inherent flaw. That is, as the size of the document increases, the number of common words tend to increase even if the documents talk about different topics.
The cosine similarity helps overcome this fundamental flaw in the ‘count-the-common-words’ or Euclidean distance approach.
What is Cosine Similarity and why is it advantageous?
Cosine similarity is a metric used to determine how similar the documents are irrespective of their size.
Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. In this context, the two vectors I am talking about are arrays containing the word counts of two documents.
As a similarity metric, how does cosine similarity differ from the number of common words?
When plotted on a multi-dimensional space, where each dimension corresponds to a word in the document, the cosine similarity captures the orientation (the angle) of the documents and not the magnitude. If you want the magnitude, compute the Euclidean distance instead.
The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance because of the size (like, the word ‘cricket’ appeared 50 times in one document and 10 times in another) they could still have a smaller angle between them. Smaller the angle, higher the similarity.
Cosine Similarity Example
Let’s suppose you have 3 documents based on a couple of star cricket players – Sachin Tendulkar and Dhoni. Two of the documents (A) and (B) are from the wikipedia pages on the respective players and the third document (C) is a smaller snippet from Dhoni’s wikipedia page.
Definitely the cosine distance between document A and document B will be more as compared to Document B and Document C.
Example 1] Find the cosine distance between two documents represented by two vectors as follows:
A=[3,8,7,5,2,9],
B=[10,8,6,6,4,5]
Answer: Given Data is as follows:
A=[3,8,7,5,2,9],
B=[10,8,6,6,4,5]
Cosine Distance=1-similarity(A,B)
=1-0.8639=0.1361
No comments: