[Trennmuster] Hyphenation pattern for compounds with joining hyphen

So Jul 17 00:08:11 CEST 2022

Am 16.07.22 um 16:33 schrieb Pander:
> I am working on updating the hyphenation patterns for Dutch for TeX, 
> libhyphen, etc. The last time they were updated was end of 1996 after 
> a spelling change. In 2006, there was again a spelling change. The old 
> patterns do not support them and are missing a lot of new words. Our 
> collection for word list and spelling checker has grown with at least 
> 100,000 words since then. We have also more hyphenation patterns but 
> in 2007 was the last time somebody worked on this experimentally and 
> that is unfortunately undocumented and abandoned.
>
> The Trennmuster project I know for years and have made some minor 
> contributions in the past. Also on the advice of the creator of the 
> Frysian hyphenation patterns, the approach of Trennmuster would be 
> good for the new hyphenation patterns for Dutch as we have identical 
> transcription changes when hyphenating.
>
> Unlike German, Dutch has many compounds with a joining hyphen, e.g. 
> zwart-witfotografie, toe-eigenen, normalisatie-instellingen and 
> opera-uitvoeringen. These hyphens have priority for hyphenation over 
> all other hyphenation locations. Preferred hyphenations is
>
> ..... ..... normalisatie-
> instellingen
>
> but when absolutely needed, also is possible
>
> ..... ..... ..... norma-
> lisatie-instellingen
>
> .... normalisatie-instel-
> lingen

Thank you for explaining.
I fear what you need is not easy to achieve. The Liang patterns used by 
TeX and other programs only determine the hyphenation points, but do not 
prioritize them. So all hyphenation points are chosen with equal 
probability according to the default algorithm.
Prioritizing hyphenation points can only be achieved by changing the 
algorithm.

It might surprise you that all the detailed marks of hyphenation points 
and their quality in the Trennmuster word list have only been of 
theoretical interest so far. When hyphenation patterns like the ones 
currently used by TeX are generated, all hyphenation points are treated 
the same, e.g. “Ver<si-che-rungs==kauf=mann” is converted to just 
“Ver-si-che-rungs-kauf-mann” and all the effort to differentiate the 
hyphenation points is lost.

As I was unhappy with the situation, I wrote a LuaLaTeX package 
providing a weighted hyphenation algorithm: 
https://codeberg.org/wehr/autotype
The package is in an early state and not yet on CTAN. It uses Liang 
patterns, but three sets of them determining hyphenation points of three 
different priorities.
Section 3 of the manual might be of interest: 
https://codeberg.org/wehr/autotype/src/branch/main/doc/autotype-de.pdf

> I did not find any examples in the wordlist file, so I searched for an 
> example German word that fits my problem and was not yet in wordlist, 
> hence Alanin-Aminotransferase with possible hyphenations being
>
> ..... ..... Alanin-
> Aminotransferase
>
> ..... ..... ..... Ala-
> nin-Aminotransferase
>
> ..... Alanin-Amino-
> transferase
>
> What would the hyphenation pattern according to the wordlist file be?
>
> Alanin-Aminotransferase;A·la-nin{-/==}A·mi-no<trans<fer=a-se
>
> reasoning that the first part of the {/} is the non-hyphenated writing.

No, it’s a good guess, but we do not include words with explicit hyphens 
on the list presuming that the parts of such words are treated as 
separate words.

> That was my line of thought. With your info below, it would be?
>
> Alanin;A·la-nin
> Aminotransferase;A·mi-no<trans<fer=a-se

Yes, something like this (more likely “A·mi-no<trans<fe-ra-se” as “ase” 
is no real word).

> and then use in TeX
>
> Alanin"=Aminotransferase
>
> Unfortunately, regular Dutch has a lot of these words so I need to 
> have them in the pattern file (for LibreOffice, Firefox, Chrome) and 
> cannot use Babel's "= for that. Looking forward to your thoughts on 
> this. (Hoping Trennmuster will support this edge case.)

I would advise to include the parts of such words separately as stated 
above. I suppose most programs except TeX will allow to hyphenate words 
with explicit hyphens applying the patterns to the single word parts, 
but I might be wrong.
The autotype package mentioned above prioritizes line breaks at explicit 
hyphens, but this approach is only possible with LuaTeX, not with other 
programs (except by changing the whole hyphenation algorithm).

Is your Dutch word list publicly available?

Please let us know if we can help in any way.

Kind regards,
Keno