From TheBestLinks.com
An N-gram is a subsequence of n letters from a given string after removing all spaces. For example, the 3-grams that can be generated from "good morning" are "goo", "ood", "odm", "dmo", "mor" and so forth.
By converting a string to N-grams, it can be embedded in a vector space thus allowing the string to be compared to other strings in an efficient manner. For example, if we convert strings with only letters in the English alphabet into 3-grams, we get a 26³ dimensional space (the first dimension measures the number of occurrences of "aaa", the second "aab", and so forth for all possible combinations of three letters). Note that using this representation we lose information about the string. For example, both the strings "abcba" and "bcbab" give rise to exactly the same 2-grams. However, we know empirically that if two strings of real text have a similar vectorial representation (for example a small cosine distance) then they are likely to be similar.
N-grams are a commonly used technique to design kernels that allow machine learning algorithms such as support vector machines to learn from string data. They can also be used to find likely candidates for the correct spelling of a misspelled word.
Related links
Top visited
0 of
0 links
[no links posted yet]
>> place link >>
Discussion
Last posted
0 of
0 messages
[no messages posted yet]
>> post message >>
Watch
You can
add this article to your own "watchlist" and receive e-mail notification about all changes in this page.