Programming Concepts And Algorithms: May 2016

Sunday 8 May 2016

Translate Numbers to Their English Phrases

Problem Statement:
Write a program that takes an integer between 0-99,999 as input and output its English phrase.
For example :

Input: 77890
Output: Seventy Seven Thousand, Eight Hundred, and Ninety

Breaking down the problem:
Such programs are commonly used in NLP to communicate with a user. They rarely use complex data structures or the programs are rarely complicated by themselves. They mostly are a chain of if-else conditions that handle particular cases. However, the sharpness of the programmer is tested here in the sense that he/she should handle all the possible inputs.

If one observes above output, the commas and the word "and" are not associated with any digit in the number. However, according to the problem statement, they are very important. Such elements of the program, which need to be added for the sake of making the output "look better" or "closer to English" are often the main hurdles.

Hence, there are 2 main sub-problems here. The first is to interpret the digits and convert them into their English words (taking care of their place in the number like the unnits place, tens place, etc.) and second is to take into account all the commas and the word "and" along with their placement.

Approach:
The main hint here is the range of inputs - from 0 to 99,999. If one observes carefully, the last two digits and the first two digits always make one phrase in English. In the above input, the last digits "90" form Ninety and the first two digits "77" form Seventy Seven (along with thousand - their place value, but thats easy to predict). Hence, these will be considered in pairs. The third digit i.e. the digit in the hundred's place, is always alone. There is always a "Seven Hundred" or a "Three Hundred". There is never a "Seventy Three Hundred". Hence, logically, this is to be considered alone.

Our first step should be to teach our program the English words, starting with the basics. The program needs to know that 0 becomes "Zero" and 1 becomes "One" and so on. Additionally, it also needs to know that 2 in tens place (or in the higher position whenever pairs are considered by the above logic) becomes "Twenty", 3 in similar position becomes "Thirty" and so on.

English is a funny language. It is widely regarded as the toughest language for a program to learn, because of its sudden and seemingly illogical variations. For example, 11 has to become "Eleven" and not "Onety One". There is no reason why it cannot be "Onety One" for the sake of symmetry. But, such is life, and hence our program has to now learn about the phrases for the numbers from 11-19.

Trivia: The oldest Indian language, Sanskrit, is widely regarded as the easiest language for a program to learn owing to its symmetry.

The next step is to implement the above approach. This is easy enough to do with a nested if-else tree. However, identifying the boundary conditions and all the cases and accomodating them was a pain to the author himself. For example, it is easy to overlook a small flaw in the above approach. If we take digits in pairs and check for words for them in an array, it becomes difficult for numbers with 0 as the lower number. For example, 77 becomes "Seventy Seven" by 70 is just "Seventy" and not "Seventy Zero". Hence, we need to make the 0th index of our array "unitsPlace" as an empty string. However, another issue now arises - how to handle if the input itself is just 0? This can be handles by adding yet another condition which checks if the input is 0, if it is - it simply prints "Zero" and exits. Any other 0s that might come will be translated to the null string.

The final step for ", and" is added to check if a phrase is not left hanging with those words as its last. There needs to be some word after "and" or else the program has some error.

Sunday 1 May 2016

Cosine Similarity

Description:
Cosine similarity is a metric to get an idea of how close any 2 entities are based on the difference between their corresponding dimensions. It is intuitively opposite to a distance calculation metric like, say, Euclidean Distance or Manhattan Distance. While such metrics calculate how different two vectors are, cosine similarity measures how similar two vectors are. Entities with high similarity will yield high results when plugged in to the cosine similarity function.

The important feature that makes cosine similarity so popular inspite of being a crude, basic form of similarity measure is that it is independent of the number of dimensions present. If the vectors are two-dimensional, say x and y, cosine similarity can be determined over those two dimensions. If the vectors have more than two-dimensions, cosine similarity still works exactly the same way, just traversing over each dimension. Moreover, cosine similarity will work over vectors with different number of dimensions as well. For example, if one vector has x and y as its two dimensions, and other has only x, cosine similarity considers y to be 0 for the latter and works normally. This is practically correct, since any n-dimensional vector will not be present in the (n+1)th dimension and hence will have a value 0 in the (n+1)th dimension.

The Formula:

\text{similarity} = \cos(\theta) = {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\| \|\mathbf{B}\|} = \frac{ \sum\limits_{i=1}^{n}{A_i B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}} \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} }

Image courtesy: Wikipedia

Where A_i and B_i are the dimensional components of A and B respectively

Range and Analysis:
The range of cosine similarity, like that of cosine function, is [-1, +1] inclusive. A value of -1 means that the two entities are completely opposite of one another. A value of +1 means they are exactly similar. A value of 0 means that they are "perpendicular" to one another i.e. they exist in different planes altogether.

Uses:
Cosine similarity is widely used in data mining techniques to determine the similarity between two vectors. I recently used this metric in a machine learning project to calculate similarity between users. Each user had Age, Latitude and Longitude of their address as their component dimensions. This was done so that we can predict book ratings for users. For example, if user U has rated a book B as 5, and user V is similar to U, then user V will also rate the book B as 5. For interested users, I also employed uncertainty principle. This is just a modification to the above case. using uncertainty, if user V is very similar to user U (who has given B 5 rating) and very very similar to, say user W (who has given B 8 rating), then we can say that it is more uncertain that V will give a rating around 5 than around 8. This also means that V is more probable to give a rating close to 8 than 5.

Conclusion:
Other similarity metrics like Jaccard index and Tanimoto score also exist, but they are mostly rooted in set theory. Meaning, they are designed to deal with sets more than vectors or raw data. Arguably, Jaccard and Tanimoto can be applied to any vector as well, but the inherent operations that need to be performed are intersection and union (AND and OR for Tanimoto). Hence, we recommend cosine similarity (either original or modified as per the problem) for raw vector data.