A novel string distance metric for ranking Persian respelling suggestions

OMID KASHEFI; MOHSEN SHARIFI; BEHROOZ MINAIE

doi:10.1017/S1351324912000186

A novel string distance metric for ranking Persian respelling suggestions

Published online by Cambridge University Press: 24 July 2012

OMID KASHEFI ,

MOHSEN SHARIFI and

BEHROOZ MINAIE

Show author details

OMID KASHEFI: Affiliation:
School of Computer Engineering, Iran University of Science and Technology, Tehran, Iran e-mails: kashefi@ieee.org, kashefi@iust.ac.ir, b_minaei@iust.ac.ir
MOHSEN SHARIFI*: Affiliation:
School of Computer Engineering, Iran University of Science and Technology, Tehran, Iran e-mails: kashefi@ieee.org, kashefi@iust.ac.ir, b_minaei@iust.ac.ir
BEHROOZ MINAIE: Affiliation:
School of Computer Engineering, Iran University of Science and Technology, Tehran, Iran e-mails: kashefi@ieee.org, kashefi@iust.ac.ir, b_minaei@iust.ac.ir
*: *Corresponding author: Email msharifi@iust.ac.ir

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Spelling errors in digital documents are often caused by operational and cognitive mistakes, or by the lack of full knowledge about the language of the written documents. Computer-assisted solutions can help to detect and suggest replacements. In this paper, we present a new string distance metric for the Persian language to rank respelling suggestions of a misspelled Persian word by considering the effects of keyboard layout on typographical spelling errors as well as the homomorphic and homophonic aspects of words for orthographical misspellings. We also consider the misspellings caused by disregarded diacritics. Since the proposed string distance metric is custom-designed for the Persian language, we present the spelling aspects of the Persian language such as homomorphs, homophones, and diacritics. We then present our statistical analysis of a set of large Persian corpora to identify the causes and the types of Persian spelling errors. We show that the proposed string distance metric has a higher mean average precision and a higher mean reciprocal rank in ranking respelling candidates of Persian misspellings in comparison with other metrics such as the Hamming, Levenshtein, Damerau–Levenshtein, Wagner–Fischer, and Jaro–Winkler metrics.

Information

Type: Articles
Information: Natural Language Engineering , Volume 19 , Issue 2 , April 2013 , pp. 259 - 284

DOI: https://doi.org/10.1017/S1351324912000186 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2012

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Abramovici, S. 1983. Errors in proofreading: evidence for syntactic control of letter processing. Memory and Cognition 11 (3): 258–61.CrossRef Google Scholar PubMed

Alberga, C. 1967. String similarity and misspellings. Communications of the ACM 10 (5): 302–13.CrossRef Google Scholar

AleAhmad, A., Amiri, H., Darrudi, E., Rahgozar, M., and Oroumchian, F. 2009. Hamshahri: a standard Persian text collection. Knowledge-Based System 22 (5): 382–7.CrossRef Google Scholar

Angell, R., Freund, G. and Willett, P. 1983. Automatic spelling correction using a trigram similarity measure. Information Processing and Management 19 (4): 255–61.CrossRef Google Scholar

Berkel, B. V. and Smedt, K. D. 1988. Triphone analysis: a combined method for the correction of orthographical and typographical errors. In Proceedings of the Second Conference on Applied Natural Language Processing, Austin, TX, USA.Google Scholar

Bledsoe, W. W. and Browning, I. 1959. Pattern recognition and reading by machine. In Proceedings of the Eastern Joint IRE-AIEE-ACM Computer Conference, Boston, MA, USA.Google Scholar

Brill, E. and Moore, R. C. 2000. An improved error model for noisy channel spelling correction. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, Hong Kong.Google Scholar

Brown, A. 1988. A Singaporean corpus of misspellings: analysis and implications. Journal of the Simplified Spelling Society 3: 4–10.Google Scholar

Comeau, C., and Wilbur, W. J. 2004. Non-word identification or spell checking without a dictionary. Journal of the American Society for Information Science and Technology 55 (2): 169–77.CrossRef Google Scholar

Damerau, F. J. 1964. A technique for computer detection and correction of spelling errors. Communications of the ACM 7 (3): 171–6.CrossRef Google Scholar

Davis, C. O. 1922. What is a misspelling? In Cattell, J. McKeen, Ryan, W. Carson JR., Walters, Raymond (eds.), School and Society, vol. 15, pp. 117–148. New York, USA: The Science Press.Google Scholar

Davrondjon, G., and Janowski, T. 2002. Developing a Spell-Checker for Tajik Using RAISE, Lecture Notes in Computer Science vol. 2495, pp. 401–5. Berlin, Germany: Springer-Verlag.Google Scholar

De Heer, T. 1982. The application of the concept of homeosemy to natural language information retrieval. Information Processing and Management 18 (5): 229–36.CrossRef Google Scholar

Eastman, C. M., and McLean, D. S. 1981. On the need for parsing ill-formed input. Computational Linguistics 7 (4): 257.Google Scholar

Golding, A. R. 1996. A Bayesian hybrid method for context-sensitive spelling correction. In Proceedings of the 3rd Workshop on Very Large Corpora, Cambridge, MA, USA.Google Scholar

Golding, A. R., and Roth, D. 1999. A winnow-based approach to context-sensitive spelling correction. Machine Learning 34 (1–3): 107–30.CrossRef Google Scholar

Hamming, R. 1950. Error detecting and error correcting codes. Bell System Technical Journal 29 (2): 147–60.CrossRef Google Scholar

Hanson, A., Riseman, E., and Fisher, E. 1976. Context in word recognition. Pattern Recognition 8 (1): 35–45.CrossRef Google Scholar

Heath, T. 1956. The Thirteen Books of Euclids Elements, vol. 1. New York, USA: Doner.Google Scholar

Hirschberg, D. 1975. A linear space algorithm for computing maximal common subsequences. Communications of the ACM 18 (6): 341–4.CrossRef Google Scholar

Hodge, V. J., and Austin, J. 2001. An evaluation of phonetic spell checkers. Technical Report ycs 338, Department of Computer Science, University of York, York, UK.Google Scholar

Hodge, V. J., and Austin, J. 2003. A comparison of standard spell checking algorithms and a novel binary neural approach. IEEE Transaction on Knowledge and Data Engineering 15 (5): 1073–81.CrossRef Google Scholar

Holmes, D., and McCabe, M. C. 2002. Improving precision and recall for soundex retrieval. In Proceedings of the International Conference on Information Technology: Coding and Computing, Las Vegas, NV, USA, pp. 22–6.CrossRef Google Scholar

Holmes, V. M., and Malone, N. 2004. Adult spelling strategies. Reading and Writing 17 (6): 537–66.CrossRef Google Scholar

Jaro, M. A. 1989. Advances in record linking methodology as applied to the 1985 census of Tampa Florida. Journal of the American Statistical Society 84 (406): 414–20.CrossRef Google Scholar

Jaro, M. A. 1995. Probabilistic linkage of large public health data files. Statistics in Medicine 14 (5–7): 491–8.CrossRef Google Scholar PubMed

Kantor, P. B., and Voorhees, E. M. 2000. The trec-5 confusion track: comparing retrieval methods for scanned text. Information Retrieval 2 (2–3): 165–76.Google Scholar

Keen, E. M. 1971. Evaluation Parameters. Englewood Cliffs, NJ, USA: Prentice-Hall, pp 74–111.Google Scholar

Korpela, A. J. 2006. Unicode Explained. Sebastopol, CA, USA: O'Reilly Media.Google Scholar

Kukich, K. 1992. Techniques for automatically correcting words in text. ACM Computing Surveys 24: 378– 39.CrossRef Google Scholar

Lazard, G. 2012. A Grammar of Contemporary Persian. Persian Studies Series. Costa Mesa, CA, USA: Mazda Publishers.Google Scholar

Levenshtein, V. 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10: 707–10.Google Scholar

Mahootian, S., and Gebhardt, L. 1997. Persian. London, UK: Routledge.Google Scholar

Masek, W. 1980. A faster algorithm computing string edit distances. Journal of Computer and System Sciences 20 (1): 18–31.CrossRef Google Scholar

Means, L. G. 1988. Cn yur cmputr raed ths? In Proceedings of the 2nd Conference on Applied Natural Language Processing, Austin, TX, USA.Google Scholar

Megerdoomian, K. 2000. Unification-based Persian morphology. In Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing), Mexico City, Mexico.Google Scholar

Megerdoomian, K. 2004. Finite-state morphological analysis of Persian. In Proceedings of the Workshop on Computational Approaches to Arabic Script-Based Languages, Geneva, Switzerland.Google Scholar

Min, K., Wilson, W. H. and Moon, Y.-J. 2000. Typographical and orthographical spelling error correction. In Proceedings of the 2nd International Conference on Language Resources and Evaluation, Athens, Greece.Google Scholar

Mitton, R. 1987. Spelling checkers, spelling correctors and the misspellings of poor spellers. Information Processing and Management 23 (5): 495–505.CrossRef Google Scholar

Mitton, R. 2009. Ordering the suggestions of a spell checker without using context. Natural Language Engineering 15 (2): 173–92.CrossRef Google Scholar

Naseem, T., and Hussain, S. 2007. A novel approach for ranking spelling error corrections for Urdu. Language Resources and Evaluation 41: 117–28.CrossRef Google Scholar

Odell, M. K., and Russell, R. C. 1918. US Patents Nos. 1261167 (1918) and 1435663 (1922). Washington, DC: US Patent and Trademark Office.Google Scholar

Ola, O. 1973. Types of orthographic error. Scandinavian Journal of Educational Research 17 (1): 95–115.Google Scholar

Peterson, J. L. 1980a. Computer programs for detecting and correcting spelling errors. Communications of the ACM 23 (12): 676–87.CrossRef Google Scholar

Peterson, J. L. 1980b. Computer Programs for Spelling Correction. New York, USA: Springer-Verlag.Google Scholar

Peterson, J. L. 1986. A note on undetected typing errors. Communications of the ACM 29 (7): 633–7.CrossRef Google Scholar

Pollock, J., and Zamora, A. 1983. Collection and characterization of spelling errors in scientific and scholarly text. Journal of the American Society for Information Science, 34 (1): 51–8.CrossRef Google Scholar

Pollock, J. J., and Zamora, A. 1984. Automatic spelling correction in scientific and scholarly text. Communications of the ACM 27 (4): 358–68.CrossRef Google Scholar

Rasooli, M., Kashefi, O., and Minaei, B. 2011. Effect of adaptive spell checking in persian. In Proceedings of the 7th International Language Processing and Knowledge Engineering (NLP-KE), Tokushima, Japan.Google Scholar

Ristad, E. S., and Yianilos, P. N. 1998. Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 20 (5): 522–32.CrossRef Google Scholar

Salthouse, T. A. 1986. Perceptual, cognitive, and motoric aspects of transcription typing. Psychological Bulletin 99 (3): 303–19.CrossRef Google Scholar PubMed

Shaalan, K., Allam, A., and Gomah, A. 2003. Towards automatic spell checking for arabic. In Proceedings of the Conference on Language Engineering, Cairo, Egypt.Google Scholar

Shamsfard, M., Jafari, H. S., and Ilbeygi, M. 2010. Step-1: a set of fundamental tools for Persian text processing. In Proceedings of the International Conference on Language Resources and Evaluation, Valletta, Malta.Google Scholar

Stauffer, R. 1949. Research in spelling and handwriting. Review of Educational Research 19 (2): 118–24.Google Scholar

Sterling, C. M. 1983. Spelling errors in context. British Journal of Psychology 74 (3): 353–64.CrossRef Google Scholar PubMed

Toutanova, K., and Moore, R. C. 2002. Pronunciation modeling for improved spelling correction. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, PA, USA.Google Scholar

Ukkonen, E. 1985. Algorithms for approximate string matching. Information Control 64 (1–3): 100–18.CrossRef Google Scholar

Ullman, J. 1977. A binary n-gram technique for automatic correction of substitution, deletion, insertion and reversal errors in words. Computer Journal 20 (2): 141–7.CrossRef Google Scholar

Wagner, R. A., and Fischer, M. J. 1974. The string-to-string correction problem. Journal of the ACM 21 (1): 168–73.CrossRef Google Scholar

Winkler, W. 1999. The State of Record Linkage and Current Research Problems. Statistics of Income Division, Internal Revenue Service Publication R, 4. Washington, DC: Bureau of the Census.Google Scholar

Winkler, W., and Thibaudeau, Y. 1991. An application of the Fellegi-Sunter model of record linkage to the 1990 US decennial census. Research Report RR91/09, US Bureau of the Census, Washington, DC.Google Scholar

Worthy, J., and Viise, N. M. 1996. Morphological, phonological, and orthographic differences between the spelling of normally achieving children and basic literacy adults. Reading and Writing 8 (2): 139–59.CrossRef Google Scholar

Yannakoudakis, E., and Fawthrop, D. 1983. The rules of spelling errors. Information Processing and Management 19 (2): 87–99.CrossRef Google Scholar

Yianilos, P. N. 1983. A dedicated comparator matches symbol strings fast and intelligentlydec. Electronics Magazine (McGraw-Hill), pp. 113–7.Google Scholar

Zobel, J., and Dart, P. 1996. Phonetic string matching: lessons from information retrieval. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland.Google Scholar

Article contents

A novel string distance metric for ranking Persian respelling suggestions

Abstract

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests