Sunday 8 May 2016

Translate Numbers to Their English Phrases


Problem Statement:
                    Write a program that takes an integer between 0-99,999 as input and output its English phrase.
For example :

             Input: 77890
             Output: Seventy Seven Thousand, Eight Hundred, and Ninety

Breaking down the problem:
              Such programs are commonly used in NLP to communicate with a user. They rarely use complex data structures or the programs are rarely complicated by themselves. They mostly are a chain of if-else conditions that handle particular cases. However, the sharpness of the programmer is tested here in the sense that he/she should handle all the possible inputs.

              If one observes above output, the commas and the word "and" are not associated with any digit in the number. However, according to the problem statement, they are very important. Such elements of the program, which need to be added for the sake of making the output "look better" or "closer to English" are often the main hurdles.
             
              Hence, there are 2 main sub-problems here. The first is to interpret the digits and convert them into their English words (taking care of their place in the number like the unnits place, tens place, etc.) and second is to take into account all the commas and the word "and" along with their placement.

Approach:
              The main hint here is the range of inputs - from 0 to 99,999. If one observes carefully, the last two digits and the first two digits always make one phrase in English. In the above input, the last digits "90" form Ninety and the first two digits "77" form Seventy Seven (along with thousand - their place value, but thats easy to predict). Hence, these will be considered in pairs. The third digit i.e. the digit in the hundred's place, is always alone. There is always a "Seven Hundred" or a "Three Hundred". There is never a "Seventy Three Hundred". Hence, logically, this is to be considered alone.

               Our first step should be to teach our program the English words, starting with the basics. The program needs to know that 0 becomes "Zero" and 1 becomes "One" and so on. Additionally, it also needs to know that 2 in tens place (or in the higher position whenever pairs are considered by the above logic) becomes "Twenty", 3 in similar position becomes "Thirty" and so on.

                English is a funny language. It is widely regarded as the toughest language for a program to learn, because of its sudden and seemingly illogical variations. For example, 11 has to become "Eleven" and not "Onety One". There is no reason why it cannot be "Onety One" for the sake of symmetry. But, such is life, and hence our program has to now learn about the phrases for the numbers from 11-19.

Trivia: The oldest Indian language, Sanskrit, is widely regarded as the easiest language for a program to learn owing to its symmetry.

                The next step is to implement the above approach. This is easy enough to do with a nested if-else tree. However, identifying the boundary conditions and all the cases and accomodating them was a pain to the author himself. For example, it is easy to overlook a small flaw in the above approach. If we take digits in pairs and check for words for them in an array, it becomes difficult for numbers with 0 as the lower number. For example, 77 becomes "Seventy Seven" by 70 is just "Seventy" and not "Seventy Zero". Hence, we need to make the 0th index of our array "unitsPlace" as an empty string. However, another issue now arises - how to handle if the input itself is just 0? This can be handles by adding yet another condition which checks if the input is 0, if it is - it simply prints "Zero" and exits. Any other 0s that might come will be translated to the null string.

                  The final step for ", and" is added to check if a phrase is not left hanging with those words as its last. There needs to be some word after "and" or else the program has some error.

Sunday 1 May 2016

Cosine Similarity


Description:
              Cosine similarity is a metric to get an idea of how close any 2 entities are based on the difference between their corresponding dimensions. It is intuitively opposite to a distance calculation metric like, say, Euclidean Distance or Manhattan Distance. While such metrics calculate how different two vectors are, cosine similarity measures how similar two vectors are. Entities with high similarity will yield high results when plugged in to the cosine similarity function.

              The important feature that makes cosine similarity so popular inspite of being a crude, basic form of similarity measure is that it is independent of the number of dimensions present. If the vectors are two-dimensional, say x and y, cosine similarity can be determined over those two dimensions. If the vectors have more than two-dimensions, cosine similarity still works exactly the same way, just traversing over each dimension. Moreover, cosine similarity will work over vectors with different number of dimensions as well. For example, if one vector has x and y as its two dimensions, and other has only x, cosine similarity considers y to be 0 for the latter and works normally. This is practically correct, since any n-dimensional vector will not be present in the (n+1)th dimension and hence will have a value 0 in the (n+1)th dimension.

The Formula:

                         \text{similarity} = \cos(\theta) = {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\| \|\mathbf{B}\|} = \frac{ \sum\limits_{i=1}^{n}{A_i  B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}}  \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} }
                                                     Image courtesy: Wikipedia

Where Ai and Bi are the dimensional components of A and B respectively

Range and Analysis:
                     The range of cosine similarity, like that of cosine function, is [-1, +1] inclusive. A value of -1 means that the two entities are completely opposite of one another. A value of +1 means they are exactly similar. A value of 0 means that they are "perpendicular" to one another i.e. they exist in different planes altogether.


Uses:
                  Cosine similarity is widely used in data mining techniques to determine the similarity between two vectors. I recently used this metric in a machine learning project to calculate similarity between users. Each user had Age, Latitude and Longitude of their address as their component dimensions. This was done so that we can predict book ratings for users. For example, if user U has rated a book B as 5, and user V is similar to U, then user V will also rate the book B as 5. For interested users, I also employed uncertainty principle. This is just a modification to the above case. using uncertainty, if user V is very similar to user U (who has given B 5 rating) and very very similar to, say user W (who has given B 8 rating), then we can say that it is more uncertain that V will give a rating around 5 than around 8. This also means that V is more probable to give a rating close to 8 than 5.

Conclusion:
                      Other similarity metrics like Jaccard index and Tanimoto score also exist, but they are mostly rooted in set theory. Meaning, they are designed to deal with sets more than vectors or raw data. Arguably, Jaccard and Tanimoto can be applied to any vector as well, but the inherent operations that need to be performed are intersection and union (AND and OR for Tanimoto). Hence, we recommend cosine similarity (either original or modified as per the problem) for raw vector data.

Friday 15 April 2016

Triplets of Ones

Problem Statement:

There is a String s which has a few (maybe none) '1's in it. Get all triplets of such '1's which are equidistant to each other. Output the triplet with the maximum distance 'k' between them.

Solution:

We first need to find the maximum value of distance 'k' between triplets of '1's present in the String. Obviously, the maximum possible value of k will be half the string length. So we will iterate over the value of k from s.length()/2 down to 1. We then iterate over the string to check for '1's and checking whether the characters at distances of k and 2k are also '1's.

Clearly, there are 2 variables whose iterations will determine the time complexity of the problem viz. 'k' and the string itself which needs to be traversed. Thus, one solution could be to decide on a value of 'k' in the outer loop and then traverse the string to see if this 'k' contains triplets of one.

Algorithm1{
              for k = string.length/2 down to 1 :
                   for i = 0 up to string.length-2k :
                       if string[i]=string[i+k]=string[i+2k]=1 :
                             output k;

}

Another solution could be to decide on an index in the string and then traverse over the string using different values of 'k'.

Algorithm2{

              for i = 0 upto string.length-2 :
                     for k = string.length/2 down to 1 :
                          if i+2k < string.length AND string[i]=string[i+k]=string[i+2k]=1 :
                               maxK = k; break;
              output maxK;
}

Both the solutions are asymptotically equally fast and have a time complexity of O(n2) in the average case.

However, if one observes carefully, the former method might yield faster results in cases where the answer is closer to the length of the string. This is because 'k' is decremented steadily in the former algorithm and is oputputted as soon as the condition is satisfied. This results in lesser conditional checks and in turn lesser array elements accesses leading to lesser running time.This fact, however, will not be reflected in the asymptotic performance of the algorithm, since asymptotically, it will remain O(n2) owing to the nested for loops.

Below is the Java code for the former algorithm:
 

Thursday 31 March 2016

Mapping in Java

Mapping:
                 The basic idea behind mapping is to establish a relation between two or more seemingly unrelated set of data structures. A common example of mapping is when a word string is mapped to its integer equivalent based on the ASCII values of it's characters. At a first glance, it might make little sense to develop a relationship between a string and an integer. However, mapping is very useful in programming and is, in fact, very common. 
                The main attraction to mapping is the reduced search time. While a linear search requires O(n) in the average case, search in mapping takes O(n) in worst case, only if the hashing function is not carefully chosen.


A mapping overview - buckets (ranges/numbers) mapped to keys (Strings)
(Image courtesy - Wikipedia)

Types of Mapping and Data Structures in Java:

1) One-to-one Map - java.util.hashmap:
                    HashMap is the most commonly used data structure that maps an object with a key. Both the object and the key can be of any data type, provided that the data type extends Object. Thus, primitive data types like int, float, double, etc. are not used as object and keys. Their corresponding wrapper classes like Integer, Float, Double, etc., however, can be used since they all extend Object.
                    HashMap maps any object using a key. Both the object and the key can be determined by the programmer, provided that the key is unique. Every key in the HashMap maps to one object only. Thus, this is a one-to-one mapping kind of data structure.
                     Already hashed objects can be retrieved from a key by the Map.get() method. Map.put() assigns an object to a key.

2) One-to-one bidirectional Map - org.apache.commons.collections.BidiMap:
                      One shortcoming of java.util.HashMap is that one cannot retrieve a key from its value. This is logical in the sense that the value is derived from the key, and hence the value should depend on the key and not the other way around. If, we try to search a key from a value, we may need to traverse the entire HashMap, which takes O(n) and defeats the purpose of a HashMap.
                       However, there may be times when we need to establish a bidirectional relationship between 2 objects whilst preserving the fast searching of a HashMap. Fortunately, BidiMap becomes useful in such cases. Note that in the case of BidiMap, the words "key" and "value" do not hold their semantic value. Since, both depend on each other, neither is a key. Both are simply values, strictly speaking.
                       BidiMap gives a convenient function getKey() which returns the key for the value argument. Both the key and value can be anything that extends the Object class. Again, "key" and "values" should not be taken literally.


3) One-to-many Map using java.util.hashmap:
                         Although generally used for one-to-one mapping, we can implement one-to-many mapping using java.util.hashmap. As mentioned earlier, the value can be anything that extends Object, we can use a List, or an ArrayList, or any other Collection for that matter as value, thus making the map one-to-many.
                          One key will thus map to one list of values. These values may now be arranged in any order as possible. If there is a specific order in the values, then an ArrayList is preferred, since it provides arbitrary access via indexes. In any case, the primary point of consideration here should be that the values must be arranged in a way that the main advantage of using a Map i.e. O(1) average search time should not be compromised to a large extent.

Traversing through a Map in Java:
                           There are a few ways to traverse through a map in Java, but the one we suggest is using an Iterator via a Map.Entry of the map. After assigning an iterator to a map, it traverses through the map via its next() method. A Map.Entry will create an Entry object from which the key and the value of the Entry can be easily obtained via the getKey() and getValue() methods. Both HashMap and BiDiMap can be traversed in the same way.