Skip to main content
×
Home
    • Aa
    • Aa

A novel string distance metric for ranking Persian respelling suggestions

  • OMID KASHEFI (a1), MOHSEN SHARIFI (a1) and BEHROOZ MINAIE (a1)
Abstract
Abstract

Spelling errors in digital documents are often caused by operational and cognitive mistakes, or by the lack of full knowledge about the language of the written documents. Computer-assisted solutions can help to detect and suggest replacements. In this paper, we present a new string distance metric for the Persian language to rank respelling suggestions of a misspelled Persian word by considering the effects of keyboard layout on typographical spelling errors as well as the homomorphic and homophonic aspects of words for orthographical misspellings. We also consider the misspellings caused by disregarded diacritics. Since the proposed string distance metric is custom-designed for the Persian language, we present the spelling aspects of the Persian language such as homomorphs, homophones, and diacritics. We then present our statistical analysis of a set of large Persian corpora to identify the causes and the types of Persian spelling errors. We show that the proposed string distance metric has a higher mean average precision and a higher mean reciprocal rank in ranking respelling candidates of Persian misspellings in comparison with other metrics such as the Hamming, Levenshtein, Damerau–Levenshtein, Wagner–Fischer, and Jaro–Winkler metrics.

Copyright
Corresponding author
*Corresponding author: Email msharifi@iust.ac.ir
Linked references
Hide All

This list contains references from the content that can be linked to their source. For a full set of references and notes please see the PDF or HTML where available.

S. Abramovici 1983. Errors in proofreading: evidence for syntactic control of letter processing. Memory and Cognition 11 (3): 258–61.

C. Alberga 1967. String similarity and misspellings. Communications of the ACM 10 (5): 302–13.

A. AleAhmad , H. Amiri , E. Darrudi , M. Rahgozar , and F. Oroumchian 2009. Hamshahri: a standard Persian text collection. Knowledge-Based System 22 (5): 382–7.

R. Angell , G. Freund and P. Willett 1983. Automatic spelling correction using a trigram similarity measure. Information Processing and Management 19 (4): 255–61.

B. V. Berkel and K. D. Smedt 1988. Triphone analysis: a combined method for the correction of orthographical and typographical errors. In Proceedings of the Second Conference on Applied Natural Language Processing, Austin, TX, USA.

C. Comeau , and W. J. Wilbur 2004. Non-word identification or spell checking without a dictionary. Journal of the American Society for Information Science and Technology 55 (2): 169–77.

F. J. Damerau 1964. A technique for computer detection and correction of spelling errors. Communications of the ACM 7 (3): 171–6.

T. De Heer 1982. The application of the concept of homeosemy to natural language information retrieval. Information Processing and Management 18 (5): 229–36.

A. R. Golding , and D. Roth 1999. A winnow-based approach to context-sensitive spelling correction. Machine Learning 34 (1–3): 107–30.

R. Hamming 1950. Error detecting and error correcting codes. Bell System Technical Journal 29 (2): 147–60.

A. Hanson , E. Riseman , and E. Fisher 1976. Context in word recognition. Pattern Recognition 8 (1): 3545.

D. Hirschberg 1975. A linear space algorithm for computing maximal common subsequences. Communications of the ACM 18 (6): 341–4.

V. J. Hodge , and J. Austin 2003. A comparison of standard spell checking algorithms and a novel binary neural approach. IEEE Transaction on Knowledge and Data Engineering 15 (5): 1073–81.

V. M. Holmes , and N. Malone 2004. Adult spelling strategies. Reading and Writing 17 (6): 537–66.

M. A. Jaro 1989. Advances in record linking methodology as applied to the 1985 census of Tampa Florida. Journal of the American Statistical Society 84 (406): 414–20.

M. A. Jaro 1995. Probabilistic linkage of large public health data files. Statistics in Medicine 14 (5–7): 491–8.

P. B. Kantor , and E. M. Voorhees 2000. The trec-5 confusion track: comparing retrieval methods for scanned text. Information Retrieval 2 (2–3): 165–76.

K. Kukich 1992. Techniques for automatically correcting words in text. ACM Computing Surveys 24: 378– 39.

S. Mahootian , and L. Gebhardt 1997. Persian. London, UK: Routledge.

W. Masek 1980. A faster algorithm computing string edit distances. Journal of Computer and System Sciences 20 (1): 1831.

L. G. Means 1988. Cn yur cmputr raed ths? In Proceedings of the 2nd Conference on Applied Natural Language Processing, Austin, TX, USA.

K. Megerdoomian 2004. Finite-state morphological analysis of Persian. In Proceedings of the Workshop on Computational Approaches to Arabic Script-Based Languages, Geneva, Switzerland.

R. Mitton 1987. Spelling checkers, spelling correctors and the misspellings of poor spellers. Information Processing and Management 23 (5): 495505.

T. Naseem , and S. Hussain 2007. A novel approach for ranking spelling error corrections for Urdu. Language Resources and Evaluation 41: 117–28.

J. L. Peterson 1980a. Computer programs for detecting and correcting spelling errors. Communications of the ACM 23 (12): 676–87.

J. L. Peterson 1980b. Computer Programs for Spelling Correction. New York, USA: Springer-Verlag.

J. L. Peterson 1986. A note on undetected typing errors. Communications of the ACM 29 (7): 633–7.

J. Pollock , and A. Zamora 1983. Collection and characterization of spelling errors in scientific and scholarly text. Journal of the American Society for Information Science, 34 (1): 51–8.

J. J. Pollock , and A. Zamora 1984. Automatic spelling correction in scientific and scholarly text. Communications of the ACM 27 (4): 358–68.

E. S. Ristad , and P. N. Yianilos 1998. Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 20 (5): 522–32.

T. A. Salthouse 1986. Perceptual, cognitive, and motoric aspects of transcription typing. Psychological Bulletin 99 (3): 303–19.

R. Stauffer 1949. Research in spelling and handwriting. Review of Educational Research 19 (2): 118–24.

C. M. Sterling 1983. Spelling errors in context. British Journal of Psychology 74 (3): 353–64.

E. Ukkonen 1985. Algorithms for approximate string matching. Information Control 64 (1–3): 100–18.

J. Ullman 1977. A binary n-gram technique for automatic correction of substitution, deletion, insertion and reversal errors in words. Computer Journal 20 (2): 141–7.

R. A. Wagner , and M. J. Fischer 1974. The string-to-string correction problem. Journal of the ACM 21 (1): 168–73.

J. Worthy , and N. M. Viise 1996. Morphological, phonological, and orthographic differences between the spelling of normally achieving children and basic literacy adults. Reading and Writing 8 (2): 139–59.

E. Yannakoudakis , and D. Fawthrop 1983. The rules of spelling errors. Information Processing and Management 19 (2): 8799.

Recommend this journal

Email your librarian or administrator to recommend adding this journal to your organisation's collection.

Natural Language Engineering
  • ISSN: 1351-3249
  • EISSN: 1469-8110
  • URL: /core/journals/natural-language-engineering
Please enter your name
Please enter a valid email address
Who would you like to send this to? *
×

Metrics

Full text views

Total number of HTML views: 2
Total number of PDF views: 9 *
Loading metrics...

Abstract views

Total abstract views: 121 *
Loading metrics...

* Views captured on Cambridge Core between September 2016 - 26th September 2017. This data will be updated every 24 hours.