[Trennmuster] FOSDEM and spell checking

Pander pander at users.sourceforge.net
Mo Jan 25 17:06:25 CET 2016


Hi all,

Which of you will be at FOSDEM this weekend? Would be nice to meet up or
go for a beer. I will be giving a talk there in the Coding for Language
Communities devroom and will be in Brussels from Thursday on.

While gathering some data on spell checking, I processed the files from
this project, just to see how they perform with spell checking on latest
stable Ubuntu. Below is a list of what Hunspell and Aspell make of it
with their respective de_DE dictionaries.

This is in no way criticism, the 84% on 460.000 words that were not used
to build the dictionary is very high. I am curious on how we can use
this project to improve spell checking. Apparently, the source of the
German dictionaries is https://www.j3e.de/ispell/igerman98/ Is he
involved in this project too? In what way can we help him improve dicts?

Myself, I am working on the Dutch equivalent of all this with my
colleagues at OpenTaal. At the moment we are working on a validation
framework for spell checking and a system for fine tuning the compound
mechanism in Hunspell. Since Dutch and German have similar requirements,
it would be nice to collaborate on this if possible.

I found one bug in the spell checking with Hunspell. For some reason the
engine gets configured to ISO8859-1 when checking for de_DE, while UTF-8
is expected. I could work around it by excluding words containing Œ or
Š. Essentially skipping only following non-latin1 encoded words:
  Œuvre
  Œuvres
  Škoda
  Škodas

The software I made to process word lists in different spell checkers
with different dicts is somewhere in GitHub but far from presentable.
Would you guys be interested that I include the word lists from this
trennmuster project in its reporting by default? Are there any other
word lists you would recommend validating against?

Regards,

Pander



checker    hunspell
library    1.3.3-3ubuntu3
language   de_DE
dictionary 20131206-6

  wordlist   trennmuster-wortliste.txt
  total      459806
  correct    387874
  incorrect  71932
  efficiency 84.4%

  wordlist   trennmuster-arzneiwirkstoffnamen.txt
  total      7722
  correct    742
  incorrect  6980
  efficiency 9.6%

  wordlist   trennmuster-arzneiwirkstoffnamen-supplement.txt
  total      2626
  correct    15
  incorrect  2611
  efficiency 0.6%



checker    aspell
library    0.60.7~20110707-3
language   de_DE
dictionary 20131206-6

  wordlist   trennmuster-wortliste.txt
  total      459810
  correct    315803
  incorrect  144007
  efficiency 68.7%

  wordlist   trennmuster-arzneiwirkstoffnamen.txt
  total      7722
  correct    260
  incorrect  7462
  efficiency 3.4%

  wordlist   trennmuster-arzneiwirkstoffnamen-supplement.txt
  total      2626
  correct    3
  incorrect  2623
  efficiency 0.1%




Mehr Informationen über die Mailingliste Trennmuster