How algorithms that correct your spelling also help find cures for diseases.
Autocorrection is used everywhere. You use them in your phones, tablets, computers, word processors, even on Medium.
Sometimes it can give us a helping hand correcting us as we type.. but sometimes it can lead to some hilarious conversations
So how does Autocorrection work ?
Let’s look at an example where user has typed:
Property damage due to nold.
When the word “mold” was mistyped as “nold”
If we break the process of autocorrection into 4 simple steps
- Identify misspelled word
- Find candidate strings n edits away
- Filter candidates
- Calculate word probabilities
Step 1: Identify misspelled word
look for the word in dictionary / vocabulary. If word is not found in vocabulary then it’s probably an error. We cannot find the word “nold” in our vocabulary thus this must be an error.
Step2: Find strings n edits away
Edit is an operation performed on a string to change it.
Step 3: Filter candidates
In this step, you want to take all the words generated above and then only keep the actual words that make sense and that you can find in your vocabulary.
Step 4: Calculate word probabilities
Next we need to calculate the word probabilities, we can use the following formula:
Example: If our corpus is
I am happy because I am in office.
Then we would calculate the probabilities as:
So the probability of work am is P(am) = 2 / 8 = 0.25
Minimum Edit Distance
Minimum edit distance allows you to:
- Evaluate similarity between two strings
- Find the minimum number of edits between two strings
- Implement spelling correction, document similarity, machine translation, genome sequencing, and more
- Insert (add a letter) ‘to’ -> ‘top’, ‘two’ …
- Delete (remove a letter) ‘hat’ ->‘ha’, ‘at’, ‘ht’
- Replace (change 1 letter to another) ‘jaw’ -> ‘jar’, ‘paw’, …
If each operation has cost of 1
- Distance between nold and mold strings is 1
- Distance between nold and fnol strings is 2
If substitutions cost 2
- Distance between nold and mold strings is 2
- Distance between nold and fnol strings is 2
Levenstein Distance is a example of Minimum Edit Distance measure where substitution has cost 2.
Searching for a path (sequence of edits) from the start string to the final string.
- Initial state: the word we’re transforming
- Operators: insert, delete, substitute
- Goal state: the word we’re trying to get to
- Path cost: what we want to minimise: the number of edits
To compute the minimum edit distance, we start with a source word and transform it into the target word.
To go from # → # you need a cost of 0. From n→# you get 1, because that is the cost of a delete. n→m is 2 because that is the minimum cost one could use to get from n to m.
You can keep going this way by populating one element at a time, But the space of all edit sequences is huge. As your strings get larger it gets much harder to calculate the minimum edit distance. Hence you will now learn about the minimum edit distance algorithm!
- We can’t afford to navigate naïvely
- Lots of distinct paths wind up at the same state.
- We don’t have to keep track of all of them
- Just the shortest path to each of those revisited states.
Dynamic programming for calculating Minimum Edit Distance:
You first solve the smallest subproblem first and then reusing that result you solve the next biggest subproblem, saving that result, reusing it again, and so on
Other distance metrics:
There are other popular measures of edit distance, which are calculated using a different set of allowable edit operations. For instance,
- Damerau–Levenshtein distance allows the transposition of two adjacent characters alongside insertion, deletion, substitution;
- longest common subsequence (LCS) distance allows only insertion and deletion, not substitution;
- Hamming distance allows only substitution, hence, it only applies to strings of the same length.
- the Jaro distance allows only transposition.
Weighted Edit distance:
You can improve the algorithm by adding a weighted edit distance. Why would we add weights to the computation? Not all corrections are equally likely.
- Spell Correction: some letters are more likely to be mistyped than others
- Biology: certain kinds of deletions or insertions are more likely than others
The corrections on a AZERTY key pad will be different than a QWERTY key pad.
Below is the substitution matrix for an AZERTY keypad:
NLP in computational biology
Natural language processing algorithms such as minimum edit distance are also applied in computational biology.
A genetic sequence is a string formed from a four-letter alphabet [ Adenosine (A), Thymidine (T), Guanosine (G), Cytidine (C ) ]of biological macromolecules referred to together as the DNA bases.
Each new cell produced by your body receives a copy of the genome. This copying process, as well as natural wear and tear, introduces a small number of changes into the sequences of many genes.
Among the most common changes are the insertion, substitution of one base for another and the deletion of a substring of bases; such changes are generally referred to as point mutations.
As a result of these point mutations, the same gene sequenced from closely related organisms will have slight differences.
Researches need to align genomic sequences to:
- Compare genes or regions from different species
- Assemble fragments to sequence DNA
- Compare individuals to look for mutations
In NLP we we generally are looking to minimise distance, and in computational biology we are generally looking to maximise similarity
As an example, lets say biologists have found the following sequence of a gene in a previously unstudied organism.
What is the function of the protein that this gene encodes? You could immediately begin a series of uninformed experiments in the lab to determine what role this gene plays. However, there is a good chance that it is a variant of a known gene in a previously studied organism.
Since biologists and computer scientists have laboriously determined (and published) the genetic sequence of many organisms (including humans), you would like to leverage this information to your advantage.
We compare the above genetic sequence with one which has already been sequenced and whose function is well understood.
If the two genetic sequences are similar enough, we might expect them to have similar functions. We want to quantify if these two are “similar enough”.
Applying minimum edit distance algorithm, we can find how closely matching the two sequences.