An Arabic spell-checker content provider

I wanted a sensible content provider, and spell checking is an excellent example. To implement spell-checking in your app, you'd add a couple of queries on the spell-checker content provider.

Perhaps your code, using the content provider, would say something like:

Cursor x = getContent().query("content://edu.bzu.spellcheck/checkspell", new String[]{"Word", "Gloss", "POS"}, "تالب",null,"");
to get back gloss and part-of-speech information for an attempted spelling of a word. Or it might ask:
Cursor x = getContent().query("content://edu.bzu.spellcheck/nearspell/1", new String[]{"Word", "Gloss", "POS"}, "تالب",null,"");
To get back a list of words which are at most one character substitution, deletion, or insertion away.

Obviously, there'd be some code in the app to offer the user a chance to act on the list. For example, in MS Word, words which seem to be misspelled are displayed with a wavy red line under them. To get access to why the wavy red line was printed, the user right-clicks on the word, and a pop-up offers the chance to substitute one of the nearby choices, or add the word to the dictionary so that it won't come up again.

But by putting the spell-check code in a content provider, you are offering every app the chance to use spell-checking without having the keep a copy of the spelling database in private memory.

More about content providers

The Android purpose of this project is for you to write a ContentProvider. ...

More about spell-checking

One way to check spelling is to have a database of all possible correctly spelled words, and look up the word. The coding for this technique is simple, but providing the data is a bit of an issue. There may also be response time issues.

In order to provide possible correct spellings for a misspelled word, one technique is to look at words which are increasingly dissimilar. For example, if we want to know what word was intended for the English word "carot", we look at the possible correct words which can be found by deleting, inserting, or substituting a character for one of those in the word. These include (at least-- I haven't written this program) "tarot", "carrot", "caret", "carom". If checking for correct spellings is fast, we can check each of the 6*26 possible insertions and each of the 5*26 substitutions or deletions and report which of those is a correct word; going on to two or three changes means checking thousands instead of hundreds of possibilities, but is still feasible; the problem for short words is that at a distance of three there are too many correctly spelled words for the answer to be interesting -- it's too much work to read.

To reduce the number of possible answers, you can use various heuristics. For example, you could check for wrong vowels, or letters particularly likely to be confused, like "k" for "c" in English. One common scheme is to assume a typing error, so that letters which are adjacent on the keyboard can be confused (by wrong placement of the fingers.)

For long misspelled words, there are likely to be only a few possible correct words, even after checking thousands of nearby alternatives, so it might be enough to just display them.

English lists of correct words

Now you can download lists of English words from the web. In the past, one way to come up with such lists was to start with a stem, and add affixes. So noun form leads to verb inform, which leads to informed informing informs informer and information

Arabic correct words

When I am reading a newspaper, and come to an Arabic word I don't know, I use a program based on the work of Tim Buckwalter.

He made lists of prefixes, stems, and suffixes, along with compatibility tables. So we can write بالبيت which begins with the prefix بال has the stem بيت and no suffix, or we can write ببيتها which begins with the prefix ب and has the suffix ها . Because the stem بيت is a noun, it can take either the prefix ب or بال and either the empty suffix or the suffix ها , but the prefix ﺏﺎﻟ can't go with the suffix ﻩﺍ .

In order to find out if a word is described by the tables, you go through it, trying all possible splits into prefix, stem, suffix. Including silly splits, an n character word has n+1 possible places to split, but Buckwalter requires that the stem have at least one character, so that the second split can't be in the same spot as the first. So there are n(n+1)/2 places to split, and we check each split to see if the prefix, stem, and suffix are in the table, and if so, whether they are compatible.

Buckwalter's tables
dictPrefixes A list of prefixes, with categories and glosses
dictStems A list of stems, with categories and glosses
dictSuffixes A list of suffixes, with categories and glosses tableAB covers prefix-stem compatibility tableAC covers prefix-suffix compatibility tableBC covers stem-suffix compatibility

In order to make sense of Buckwalter's tables, you need to know that he wrote them in a pre-Unicode age, so he used a transliteration scheme in which each character in Arabic was one Ascii character. For example, the Arabic word ببيغها is transliterated as bbythA. You'd probably be able to make out the code from this hint, but I've provided this handy file, Buckwalter.java which contains some methods for turning sensible Arabic Unicode strings into Buckwalter transliteration and vice versa.

I'm also providing, without really checking it out, a class from a lookup program I wrote in 2012. I don't remember if this class actually worked, and I can't find the code it was based on anymore. I do remember that I had trouble shoehorning it into an Android app, and I rewrote it extensively, but I don't want to give you the result, which definitely works, because it gets rather twisted. And without further ado, here's Dict.java"

You're supposed to use it by intializing a Dict object dict and then getting a dict.LookupIterator object. I think the code was something like this:

Dict dict = new Dict(getResources()); // that was to get at the asset files for (int i=0; i<6; i++){// all six tables... maybe need a bigger number System.out.println(dict.incrInit());// print read length of each table as read } Dict.LookupIterator dli = new Dict.LookupIterator("blbyt"); while (dli.hasNext){ Lookup next = dli.next(); System.out.println(next.pr); // print out the prefix, stem and suffix System.out.println(next.st); System.out.println(next.sf);
The initialization code is so silly because, as it turned out, the app bombed while starting because the hash tables got too big. When I rewrote the code I pre-computed an index file, instead. It's big, but since I controlled the datastructure, it doesn't get out of hand. I now know that the problem with the hash table is that the minimum size of a Froyo Android object is 64 bytes. A hash table has an entry object, a key, and a result object for every entry, plus the space for the table itself, which turns out to be trivial compared to the minimum of 192 bytes per entry. And I now know that reason my app bombed is that the maximum size of a Froyo Android app is 25MB. I think there are only about 100,000 table entries, so it should have almost fit, but it didn't, and I got carried away on the re-write.

The code has some hints about what I finally did; I copied each asset file into a single array, and used ints as addresses for the beginnings of each row. I created index files off-line, which tell for each table where the beginnings of the record are, and I think I do a binary search on the index file to find the key entry. The index files are also allocated as a single array of bytes, so they don't have slack in them either. I COULD have created a hash table off-line, which would have been cooler and faster, or I could have implemented my own version of hash tables which didn't use objects. In retrospect, that might have been better, but as it turned out almost all of my bugs were due to using Key objects that weren't really objects, but structs which used ints to point into the String array. I didn't have too much trouble from the fact that those pointers were to strings which were terminated by either a comma or a newline, but I had a lot of trouble from the fact that I didn't allocate fresh Key objects, but carefully re-used them. Which probably wasn't necessary. Optimize no more than you need... Anyhow, I don't think I had made those changes in the code you're looking at.

Buckwalter's original program was in PERL, and it is about a page of code. Probably if you ignore my code completely, you can redo it in Java in five pages or less.

Arabic incorrect words

About 95 percent of the words in newspaper articles are found with Buckwalter's tables, not counting names and obvious foreign words. As far as I know, all the words found are correct spellings.

That still leaves you with many missing words, especially from areas seldom mentioned in the newpaper. The names of common foods are an example of words you probably won't find. Still, perhaps 95% is enough to keep your spell-checking app from being completely frustrating, and you can have the app add new words as you go. I don't try to add to the Buckwalter tables on the fly, because the categories are too complicated -- they include nouns starting with lam and verbs ending in yaa, and about two hundred more possibilities -- but it shouldn't be a big deal to keep an "extra words" database. In fact, Android has a content provider for exactly that purpose, which is used by the Android SpellCheckerService.

Probably the big problem with looking for nearly words in Arabic is that there are more of them. Arabic words are shorter on average than English words (no vowels to type, for one thing) so this problem is more severe in Arabic than in English. And since the Buckwalter tables include a lot of words, many of them may be unfamiliar. My solution to this problem would be to look at the English gloss, but that seems to me to be a very lame solution for an Arabic spell-checker. Conceivably you could use google translate to generate alternatives; for many words google will give you a list of synonyms. Those lists are called synsets, and if you can find an Arabic wordnet, like http://awnbrowser.sourceforge.net/ synsets are its main job...

Good luck!