How algorithms that correct your spelling also help find cures for diseases.

Autocorrection is used everywhere. You use them in your phones, tablets, computers, word processors, even on Medium.

Sometimes it can give us a helping hand correcting us as we type.. but sometimes it can lead to some hilarious conversations

So how does Autocorrection work ?

Let’s look at an example where user has typed:

Property damage due to nold.

When the word “moldwas mistyped as “nold

If we break the process of autocorrection into 4 simple steps

Step 1: Identify misspelled word

look for the word in dictionary / vocabulary. If word is not found in vocabulary then it’s probably an error. We cannot find the word “nold” in our vocabulary thus this must be an error.

Step2: Find strings n edits away

Edit is an operation performed on a string to change it.

Step 3: Filter candidates

In this step, you want to take all the words generated above and then only keep the actual words that make sense and that you can find in your vocabulary.

Step 4: Calculate word probabilities

Next we need to calculate the word probabilities, we can use the following formula:

Example: If our corpus is

I am happy because I am in office.

Then we would calculate the probabilities as:

So the probability of work am is P(am) = 2 / 8 = 0.25

Minimum Edit Distance

Minimum edit distance allows you to:

Operations

If each operation has cost of 1

If substitutions cost 2

Levenstein Distance is a example of Minimum Edit Distance measure where substitution has cost 2.

Algorithm

Searching for a path (sequence of edits) from the start string to the final string.

Computational complexity:

To compute the minimum edit distance, we start with a source word and transform it into the target word.

To go from # → # you need a cost of 0. From n→# you get 1, because that is the cost of a delete. n→m is 2 because that is the minimum cost one could use to get from n to m.

You can keep going this way by populating one element at a time, But the space of all edit sequences is huge. As your strings get larger it gets much harder to calculate the minimum edit distance. Hence you will now learn about the minimum edit distance algorithm!

Dynamic programming for calculating Minimum Edit Distance:

You first solve the smallest subproblem first and then reusing that result you solve the next biggest subproblem, saving that result, reusing it again, and so on

Other distance metrics:

There are other popular measures of edit distance, which are calculated using a different set of allowable edit operations. For instance,

Weighted Edit distance:

You can improve the algorithm by adding a weighted edit distance. Why would we add weights to the computation? Not all corrections are equally likely.

The corrections on a AZERTY key pad will be different than a QWERTY key pad.

Below is the substitution matrix for an AZERTY keypad:

NLP in computational biology

Natural language processing algorithms such as minimum edit distance are also applied in computational biology.

A genetic sequence is a string formed from a four-letter alphabet [ Adenosine (A), Thymidine (T), Guanosine (G), Cytidine (C ) ]of biological macromolecules referred to together as the DNA bases.

Each new cell produced by your body receives a copy of the genome. This copying process, as well as natural wear and tear, introduces a small number of changes into the sequences of many genes.

Among the most common changes are the insertion, substitution of one base for another and the deletion of a substring of bases; such changes are generally referred to as point mutations.

As a result of these point mutations, the same gene sequenced from closely related organisms will have slight differences.

Researches need to align genomic sequences to:

In NLP we we generally are looking to minimise distance, and in computational biology we are generally looking to maximise similarity

As an example, lets say biologists have found the following sequence of a gene in a previously unstudied organism.

What is the function of the protein that this gene encodes? You could immediately begin a series of uninformed experiments in the lab to determine what role this gene plays. However, there is a good chance that it is a variant of a known gene in a previously studied organism.

Since biologists and computer scientists have laboriously determined (and published) the genetic sequence of many organisms (including humans), you would like to leverage this information to your advantage.

We compare the above genetic sequence with one which has already been sequenced and whose function is well understood.

If the two genetic sequences are similar enough, we might expect them to have similar functions. We want to quantify if these two are “similar enough”.

Applying minimum edit distance algorithm, we can find how closely matching the two sequences.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
sid dhuri

I am data scientist by trade. I love to write about data science, marketing and economics. I am also the founder Orox.ai a marketing ai and automation platform.