ABOUT

Concepts

How can presage predict what text the user is going to enter next?

The approach relies on information theory. Natural language is modelled as an information source. Natural language is a redundant information source.

The key idea is modelling natural language as a set of redundant sources of information. The redundancy embedded in natural language is exploited by various predictive methods to extract information in order to generate predictions.

Information sources can be classified as statistical, syntactic and semantic.

Statistical sources

If one were asked to guess what word followed the following text fragment:

The quick brown fox jumps over the lazy ...

one would most likely reply

dog

The quick brown fox jumps over the lazy dog is in fact a widely known pangram (a sentence which uses every letter of the alphabet at least once).

One would pick the word dog on the grounds that, given the history string The quick brown fox jumps over the lazy ..., dog most frequently follows (frequentist approach to statistical information).

Likewise, one would pick over or on when given the fragment The quick brown fox jumps ... on account that the words that most frequently follow the verb jump are either on or over.

Syntactic sources

Continuing the previous example, one would pick the words over or on when given the context:

The quick brown fox jumps o...

(note the final letter o)

A syntactically aware predictive plugin could exploit the fact that to jump is an intransitive verb and should be followed by an adverb. on and over are adverbs and begin with the letter o, which makes them strong candidates for a prediction.

Semantic sources

Supposing one had been writing about the adventures of a cunning fox and a loyal dog for a while, then one would suggest that dog is the most likely object to be jumped over by the fox:

The quick brown fox jumps over the lazy d...

Information about the context and (however limited) comprehension of said context provides additional information to increase predictive accuracy.

The resulting language model is powerful, flexible and extensible.