Skip to main content

A novel string distance metric for ranking Persian respelling suggestions


Spelling errors in digital documents are often caused by operational and cognitive mistakes, or by the lack of full knowledge about the language of the written documents. Computer-assisted solutions can help to detect and suggest replacements. In this paper, we present a new string distance metric for the Persian language to rank respelling suggestions of a misspelled Persian word by considering the effects of keyboard layout on typographical spelling errors as well as the homomorphic and homophonic aspects of words for orthographical misspellings. We also consider the misspellings caused by disregarded diacritics. Since the proposed string distance metric is custom-designed for the Persian language, we present the spelling aspects of the Persian language such as homomorphs, homophones, and diacritics. We then present our statistical analysis of a set of large Persian corpora to identify the causes and the types of Persian spelling errors. We show that the proposed string distance metric has a higher mean average precision and a higher mean reciprocal rank in ranking respelling candidates of Persian misspellings in comparison with other metrics such as the Hamming, Levenshtein, Damerau–Levenshtein, Wagner–Fischer, and Jaro–Winkler metrics.

Corresponding author
*Corresponding author: Email
Hide All
Abramovici, S. 1983. Errors in proofreading: evidence for syntactic control of letter processing. Memory and Cognition 11 (3): 258–61.
Alberga, C. 1967. String similarity and misspellings. Communications of the ACM 10 (5): 302–13.
AleAhmad, A., Amiri, H., Darrudi, E., Rahgozar, M., and Oroumchian, F. 2009. Hamshahri: a standard Persian text collection. Knowledge-Based System 22 (5): 382–7.
Angell, R., Freund, G. and Willett, P. 1983. Automatic spelling correction using a trigram similarity measure. Information Processing and Management 19 (4): 255–61.
Berkel, B. V. and Smedt, K. D. 1988. Triphone analysis: a combined method for the correction of orthographical and typographical errors. In Proceedings of the Second Conference on Applied Natural Language Processing, Austin, TX, USA.
Bledsoe, W. W. and Browning, I. 1959. Pattern recognition and reading by machine. In Proceedings of the Eastern Joint IRE-AIEE-ACM Computer Conference, Boston, MA, USA.
Brill, E. and Moore, R. C. 2000. An improved error model for noisy channel spelling correction. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, Hong Kong.
Brown, A. 1988. A Singaporean corpus of misspellings: analysis and implications. Journal of the Simplified Spelling Society 3: 410.
Comeau, C., and Wilbur, W. J. 2004. Non-word identification or spell checking without a dictionary. Journal of the American Society for Information Science and Technology 55 (2): 169–77.
Damerau, F. J. 1964. A technique for computer detection and correction of spelling errors. Communications of the ACM 7 (3): 171–6.
Davis, C. O. 1922. What is a misspelling? In Cattell, J. McKeen, Ryan, W. Carson JR., Walters, Raymond (eds.), School and Society, vol. 15, pp. 117148. New York, USA: The Science Press.
Davrondjon, G., and Janowski, T. 2002. Developing a Spell-Checker for Tajik Using RAISE, Lecture Notes in Computer Science vol. 2495, pp. 401–5. Berlin, Germany: Springer-Verlag.
De Heer, T. 1982. The application of the concept of homeosemy to natural language information retrieval. Information Processing and Management 18 (5): 229–36.
Eastman, C. M., and McLean, D. S. 1981. On the need for parsing ill-formed input. Computational Linguistics 7 (4): 257.
Golding, A. R. 1996. A Bayesian hybrid method for context-sensitive spelling correction. In Proceedings of the 3rd Workshop on Very Large Corpora, Cambridge, MA, USA.
Golding, A. R., and Roth, D. 1999. A winnow-based approach to context-sensitive spelling correction. Machine Learning 34 (1–3): 107–30.
Hamming, R. 1950. Error detecting and error correcting codes. Bell System Technical Journal 29 (2): 147–60.
Hanson, A., Riseman, E., and Fisher, E. 1976. Context in word recognition. Pattern Recognition 8 (1): 3545.
Heath, T. 1956. The Thirteen Books of Euclids Elements, vol. 1. New York, USA: Doner.
Hirschberg, D. 1975. A linear space algorithm for computing maximal common subsequences. Communications of the ACM 18 (6): 341–4.
Hodge, V. J., and Austin, J. 2001. An evaluation of phonetic spell checkers. Technical Report ycs 338, Department of Computer Science, University of York, York, UK.
Hodge, V. J., and Austin, J. 2003. A comparison of standard spell checking algorithms and a novel binary neural approach. IEEE Transaction on Knowledge and Data Engineering 15 (5): 1073–81.
Holmes, D., and McCabe, M. C. 2002. Improving precision and recall for soundex retrieval. In Proceedings of the International Conference on Information Technology: Coding and Computing, Las Vegas, NV, USA, pp. 22–6.
Holmes, V. M., and Malone, N. 2004. Adult spelling strategies. Reading and Writing 17 (6): 537–66.
Jaro, M. A. 1989. Advances in record linking methodology as applied to the 1985 census of Tampa Florida. Journal of the American Statistical Society 84 (406): 414–20.
Jaro, M. A. 1995. Probabilistic linkage of large public health data files. Statistics in Medicine 14 (5–7): 491–8.
Kantor, P. B., and Voorhees, E. M. 2000. The trec-5 confusion track: comparing retrieval methods for scanned text. Information Retrieval 2 (2–3): 165–76.
Keen, E. M. 1971. Evaluation Parameters. Englewood Cliffs, NJ, USA: Prentice-Hall, pp 74111.
Korpela, A. J. 2006. Unicode Explained. Sebastopol, CA, USA: O'Reilly Media.
Kukich, K. 1992. Techniques for automatically correcting words in text. ACM Computing Surveys 24: 378– 39.
Lazard, G. 2012. A Grammar of Contemporary Persian. Persian Studies Series. Costa Mesa, CA, USA: Mazda Publishers.
Levenshtein, V. 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10: 707–10.
Mahootian, S., and Gebhardt, L. 1997. Persian. London, UK: Routledge.
Masek, W. 1980. A faster algorithm computing string edit distances. Journal of Computer and System Sciences 20 (1): 1831.
Means, L. G. 1988. Cn yur cmputr raed ths? In Proceedings of the 2nd Conference on Applied Natural Language Processing, Austin, TX, USA.
Megerdoomian, K. 2000. Unification-based Persian morphology. In Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing), Mexico City, Mexico.
Megerdoomian, K. 2004. Finite-state morphological analysis of Persian. In Proceedings of the Workshop on Computational Approaches to Arabic Script-Based Languages, Geneva, Switzerland.
Min, K., Wilson, W. H. and Moon, Y.-J. 2000. Typographical and orthographical spelling error correction. In Proceedings of the 2nd International Conference on Language Resources and Evaluation, Athens, Greece.
Mitton, R. 1987. Spelling checkers, spelling correctors and the misspellings of poor spellers. Information Processing and Management 23 (5): 495505.
Mitton, R. 2009. Ordering the suggestions of a spell checker without using context. Natural Language Engineering 15 (2): 173–92.
Naseem, T., and Hussain, S. 2007. A novel approach for ranking spelling error corrections for Urdu. Language Resources and Evaluation 41: 117–28.
Odell, M. K., and Russell, R. C. 1918. US Patents Nos. 1261167 (1918) and 1435663 (1922). Washington, DC: US Patent and Trademark Office.
Ola, O. 1973. Types of orthographic error. Scandinavian Journal of Educational Research 17 (1): 95115.
Peterson, J. L. 1980a. Computer programs for detecting and correcting spelling errors. Communications of the ACM 23 (12): 676–87.
Peterson, J. L. 1980b. Computer Programs for Spelling Correction. New York, USA: Springer-Verlag.
Peterson, J. L. 1986. A note on undetected typing errors. Communications of the ACM 29 (7): 633–7.
Pollock, J., and Zamora, A. 1983. Collection and characterization of spelling errors in scientific and scholarly text. Journal of the American Society for Information Science, 34 (1): 51–8.
Pollock, J. J., and Zamora, A. 1984. Automatic spelling correction in scientific and scholarly text. Communications of the ACM 27 (4): 358–68.
Rasooli, M., Kashefi, O., and Minaei, B. 2011. Effect of adaptive spell checking in persian. In Proceedings of the 7th International Language Processing and Knowledge Engineering (NLP-KE), Tokushima, Japan.
Ristad, E. S., and Yianilos, P. N. 1998. Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 20 (5): 522–32.
Salthouse, T. A. 1986. Perceptual, cognitive, and motoric aspects of transcription typing. Psychological Bulletin 99 (3): 303–19.
Shaalan, K., Allam, A., and Gomah, A. 2003. Towards automatic spell checking for arabic. In Proceedings of the Conference on Language Engineering, Cairo, Egypt.
Shamsfard, M., Jafari, H. S., and Ilbeygi, M. 2010. Step-1: a set of fundamental tools for Persian text processing. In Proceedings of the International Conference on Language Resources and Evaluation, Valletta, Malta.
Stauffer, R. 1949. Research in spelling and handwriting. Review of Educational Research 19 (2): 118–24.
Sterling, C. M. 1983. Spelling errors in context. British Journal of Psychology 74 (3): 353–64.
Toutanova, K., and Moore, R. C. 2002. Pronunciation modeling for improved spelling correction. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, PA, USA.
Ukkonen, E. 1985. Algorithms for approximate string matching. Information Control 64 (1–3): 100–18.
Ullman, J. 1977. A binary n-gram technique for automatic correction of substitution, deletion, insertion and reversal errors in words. Computer Journal 20 (2): 141–7.
Wagner, R. A., and Fischer, M. J. 1974. The string-to-string correction problem. Journal of the ACM 21 (1): 168–73.
Winkler, W. 1999. The State of Record Linkage and Current Research Problems. Statistics of Income Division, Internal Revenue Service Publication R, 4. Washington, DC: Bureau of the Census.
Winkler, W., and Thibaudeau, Y. 1991. An application of the Fellegi-Sunter model of record linkage to the 1990 US decennial census. Research Report RR91/09, US Bureau of the Census, Washington, DC.
Worthy, J., and Viise, N. M. 1996. Morphological, phonological, and orthographic differences between the spelling of normally achieving children and basic literacy adults. Reading and Writing 8 (2): 139–59.
Yannakoudakis, E., and Fawthrop, D. 1983. The rules of spelling errors. Information Processing and Management 19 (2): 8799.
Yianilos, P. N. 1983. A dedicated comparator matches symbol strings fast and intelligentlydec. Electronics Magazine (McGraw-Hill), pp. 113–7.
Zobel, J., and Dart, P. 1996. Phonetic string matching: lessons from information retrieval. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland.
Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *


Full text views

Total number of HTML views: 2
Total number of PDF views: 19 *
Loading metrics...

Abstract views

Total abstract views: 171 *
Loading metrics...

* Views captured on Cambridge Core between September 2016 - 19th March 2018. This data will be updated every 24 hours.