[Trennmuster] Hyphenation pattern for compounds with joining hyphen
Keno Wehr
wehr at abgol.de
So Jul 17 00:08:11 CEST 2022
Am 16.07.22 um 16:33 schrieb Pander:
> I am working on updating the hyphenation patterns for Dutch for TeX,
> libhyphen, etc. The last time they were updated was end of 1996 after
> a spelling change. In 2006, there was again a spelling change. The old
> patterns do not support them and are missing a lot of new words. Our
> collection for word list and spelling checker has grown with at least
> 100,000 words since then. We have also more hyphenation patterns but
> in 2007 was the last time somebody worked on this experimentally and
> that is unfortunately undocumented and abandoned.
>
> The Trennmuster project I know for years and have made some minor
> contributions in the past. Also on the advice of the creator of the
> Frysian hyphenation patterns, the approach of Trennmuster would be
> good for the new hyphenation patterns for Dutch as we have identical
> transcription changes when hyphenating.
>
> Unlike German, Dutch has many compounds with a joining hyphen, e.g.
> zwart-witfotografie, toe-eigenen, normalisatie-instellingen and
> opera-uitvoeringen. These hyphens have priority for hyphenation over
> all other hyphenation locations. Preferred hyphenations is
>
> ..... ..... normalisatie-
> instellingen
>
> but when absolutely needed, also is possible
>
> ..... ..... ..... norma-
> lisatie-instellingen
>
> .... normalisatie-instel-
> lingen
Thank you for explaining.
I fear what you need is not easy to achieve. The Liang patterns used by
TeX and other programs only determine the hyphenation points, but do not
prioritize them. So all hyphenation points are chosen with equal
probability according to the default algorithm.
Prioritizing hyphenation points can only be achieved by changing the
algorithm.
It might surprise you that all the detailed marks of hyphenation points
and their quality in the Trennmuster word list have only been of
theoretical interest so far. When hyphenation patterns like the ones
currently used by TeX are generated, all hyphenation points are treated
the same, e.g. “Ver<si-che-rungs==kauf=mann” is converted to just
“Ver-si-che-rungs-kauf-mann” and all the effort to differentiate the
hyphenation points is lost.
As I was unhappy with the situation, I wrote a LuaLaTeX package
providing a weighted hyphenation algorithm:
https://codeberg.org/wehr/autotype
The package is in an early state and not yet on CTAN. It uses Liang
patterns, but three sets of them determining hyphenation points of three
different priorities.
Section 3 of the manual might be of interest:
https://codeberg.org/wehr/autotype/src/branch/main/doc/autotype-de.pdf
> I did not find any examples in the wordlist file, so I searched for an
> example German word that fits my problem and was not yet in wordlist,
> hence Alanin-Aminotransferase with possible hyphenations being
>
> ..... ..... Alanin-
> Aminotransferase
>
> ..... ..... ..... Ala-
> nin-Aminotransferase
>
> ..... Alanin-Amino-
> transferase
>
> What would the hyphenation pattern according to the wordlist file be?
>
> Alanin-Aminotransferase;A·la-nin{-/==}A·mi-no<trans<fer=a-se
>
> reasoning that the first part of the {/} is the non-hyphenated writing.
No, it’s a good guess, but we do not include words with explicit hyphens
on the list presuming that the parts of such words are treated as
separate words.
> That was my line of thought. With your info below, it would be?
>
> Alanin;A·la-nin
> Aminotransferase;A·mi-no<trans<fer=a-se
Yes, something like this (more likely “A·mi-no<trans<fe-ra-se” as “ase”
is no real word).
> and then use in TeX
>
> Alanin"=Aminotransferase
>
> Unfortunately, regular Dutch has a lot of these words so I need to
> have them in the pattern file (for LibreOffice, Firefox, Chrome) and
> cannot use Babel's "= for that. Looking forward to your thoughts on
> this. (Hoping Trennmuster will support this edge case.)
I would advise to include the parts of such words separately as stated
above. I suppose most programs except TeX will allow to hyphenate words
with explicit hyphens applying the patterns to the single word parts,
but I might be wrong.
The autotype package mentioned above prioritizes line breaks at explicit
hyphens, but this approach is only possible with LuaTeX, not with other
programs (except by changing the whole hyphenation algorithm).
Is your Dutch word list publicly available?
Please let us know if we can help in any way.
Kind regards,
Keno
Mehr Informationen über die Mailingliste Trennmuster