Hostname: page-component-77f85d65b8-t6st2 Total loading time: 0 Render date: 2026-03-28T04:42:05.353Z Has data issue: false hasContentIssue false

SoundexGR: An algorithm for phonetic matching for the Greek language

Published online by Cambridge University Press:  04 February 2022

Antrei Kavros
Affiliation:
Computer Science Department, University of Crete, Heraklion, Crete, Greece
Yannis Tzitzikas*
Affiliation:
Computer Science Department, University of Crete, Heraklion, Crete, Greece Institute of Computer Science, Foundation for Research and Technology – Hellas (FORTH), Heraklion, Crete, Greece
*
*Corresponding author: E-mail tzitzik@ics.forth.gr
Rights & Permissions [Opens in a new window]

Abstract

Text usually suffers from typos which can negatively affect various Information Retrieval and Natural Language Processing tasks. Although there is a wide variety of choices for tackling this issue in the English language, this is not the case for other languages. For the Greek language, most of the existing phonetic algorithms provide rather insufficient support. For this reason, in this paper, we introduce an algorithm for phonetic matching designed for the Greek language: we start from the original Soundex and we redesign and extend it for accommodating the Greek language’s phonetic rules, ending up to a family of algorithms, that we call ${\tt Soundex}_{GR}$. Then, we report various experimental results showcasing how the algorithm behaves in different scenarios, and we provide comparative results for various parameters of the algorithm for revealing the trade-off between precision and recall in datasets with different kinds of errors. We also provide comparative results with matching using stemming, full phonemic transcription, and edit distance, that demonstrate that ${\tt Soundex}_{GR}$ performs better (indicatively, it achieves F-Score over 95% in collections of similar-sounded words). The simplicity, efficiency, and effectiveness of the proposed algorithm make it applicable and adaptable to a wide range of tasks.

Information

Type
Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2022. Published by Cambridge University Press
Figure 0

Table 1. Consonants Replacement in the Soundex

Figure 1

Table 2. Phonetic rules

Figure 2

Table 3. ${\tt Soundex}_{GR}$ buckets

Figure 3

Table 4. Loud consonants in Greek

Figure 4

Table 5. Silent consonants in Greek

Figure 5

Table 6. Examples of ${\tt Soundex}_{GR}$ code generation, through different stages

Figure 6

Table 7. Consonants Replacement in the ${\tt{Soundex}}^{naive}_{GR}$

Figure 7

Figure 1. An overview of the datasets used for evaluation purposes.

Figure 8

Table 8. Indicative good examples for both ${\tt{Soundex}}^{naive}_{GR}$ and ${\tt Soundex}_{GR}$

Figure 9

Table 9. Indicative examples where ${\tt{Soundex}}^{naive}_{GR}$ fails while ${\tt Soundex}_{GR}$ succeeds

Figure 10

Figure 2. Precision levels for each collection.

Figure 11

Figure 3. Recall levels for each collection.

Figure 12

Figure 4. F-measure levels for each collection.

Figure 13

Table 10. Average F-Score (over Dataset A, Dataset B, Dataset C, Dataset D) for different lengths of ${\tt Soundex}_{GR}$

Figure 14

Figure 5. Precision, Recall, and F-Score evaluation metrics on Dataset A (top left), Dataset B (top right), Dataset C (bottom left), and Dataset D (bottom right) for ${\tt Soundex}_{GR}$ code lengths 1 to 10.

Figure 15

Figure 6. Precision levels for each collection (also for stemming).

Figure 16

Figure 7. Recall levels for each collection (also for stemming).

Figure 17

Figure 8. F-measure levels for each collection (also for stemming).

Figure 18

Figure 9. Frequency of ${\tt Soundex}_{GR}$ codes (left), and lemmas of the stemmer (right) over the dictionary.

Figure 19

Table 11. More frequent ${\tt Soundex}_{GR}$ codes

Figure 20

Table 12. More frequent stems

Figure 21

Figure 10. Indicative examples of full phonemic transcription.

Figure 22

Figure 11. Excerpt from ${\rm{Dataset }}{{\rm{D}}^{ext}}$.

Figure 23

Table 13. Evaluating 10 matching methods over ${\tt Dataset\ D}^{ext}$

Figure 24

Figure 12. Average F-Score (top), Precision (middle), and Recall (bottom) as a function of code length (left Y-axis, blue dots) and dataset size (X-axis) of ${\tt Soundex}_{GR}$ in Dataset A, Dataset B, Dataset C, and Dataset D.

Figure 25

Figure 13. Excerpt from ${\tt Dataset\ E}_{1.4K-7.6K}$.

Figure 26

Table 14. Evaluating 10 methods over ${\tt Dataset\ E}_{1.4K-7.6K}$

Figure 27

Table 15. Evaluating 10 methods over ${\tt Dataset\ F}_{2.8K-15.2K}$

Figure 28

Table 16. Evaluating 10 methods over ${\tt Dataset\ G}_{5.7K-30.4K}$

Figure 29

Figure 14. Recall (top), Precision (middle), and F-Score (bottom) as a function of code length (left Y-axis, blue dots) and dataset size (X-axis) of ${\tt Soundex}_{GR}$ in ${\tt Dataset\ H}$.

Figure 30

Figure 15. A synopsis of the main evaluation results.

Figure 31

Figure 16. A tool for visual inspection of the produced codes, approximate matching, and others.

Figure 32

Figure 17. Suggestions for the mispelled word based on length code = 6.

Figure 33

Figure 18. Demonstrating approximate matching methods.