[Trennmuster] FOSDEM and spell checking
Pander
pander at users.sourceforge.net
Mo Jan 25 17:06:25 CET 2016
Hi all,
Which of you will be at FOSDEM this weekend? Would be nice to meet up or
go for a beer. I will be giving a talk there in the Coding for Language
Communities devroom and will be in Brussels from Thursday on.
While gathering some data on spell checking, I processed the files from
this project, just to see how they perform with spell checking on latest
stable Ubuntu. Below is a list of what Hunspell and Aspell make of it
with their respective de_DE dictionaries.
This is in no way criticism, the 84% on 460.000 words that were not used
to build the dictionary is very high. I am curious on how we can use
this project to improve spell checking. Apparently, the source of the
German dictionaries is https://www.j3e.de/ispell/igerman98/ Is he
involved in this project too? In what way can we help him improve dicts?
Myself, I am working on the Dutch equivalent of all this with my
colleagues at OpenTaal. At the moment we are working on a validation
framework for spell checking and a system for fine tuning the compound
mechanism in Hunspell. Since Dutch and German have similar requirements,
it would be nice to collaborate on this if possible.
I found one bug in the spell checking with Hunspell. For some reason the
engine gets configured to ISO8859-1 when checking for de_DE, while UTF-8
is expected. I could work around it by excluding words containing Œ or
Š. Essentially skipping only following non-latin1 encoded words:
Œuvre
Œuvres
Škoda
Škodas
The software I made to process word lists in different spell checkers
with different dicts is somewhere in GitHub but far from presentable.
Would you guys be interested that I include the word lists from this
trennmuster project in its reporting by default? Are there any other
word lists you would recommend validating against?
Regards,
Pander
checker hunspell
library 1.3.3-3ubuntu3
language de_DE
dictionary 20131206-6
wordlist trennmuster-wortliste.txt
total 459806
correct 387874
incorrect 71932
efficiency 84.4%
wordlist trennmuster-arzneiwirkstoffnamen.txt
total 7722
correct 742
incorrect 6980
efficiency 9.6%
wordlist trennmuster-arzneiwirkstoffnamen-supplement.txt
total 2626
correct 15
incorrect 2611
efficiency 0.6%
checker aspell
library 0.60.7~20110707-3
language de_DE
dictionary 20131206-6
wordlist trennmuster-wortliste.txt
total 459810
correct 315803
incorrect 144007
efficiency 68.7%
wordlist trennmuster-arzneiwirkstoffnamen.txt
total 7722
correct 260
incorrect 7462
efficiency 3.4%
wordlist trennmuster-arzneiwirkstoffnamen-supplement.txt
total 2626
correct 3
incorrect 2623
efficiency 0.1%
Mehr Informationen über die Mailingliste Trennmuster