[Trennmuster] long-s hyphenation patterns
Guenter Milde
milde at users.sf.net
Fr Jun 13 23:38:47 CEST 2014
Dear hyphenators,
On 11.06.14, Mojca Miklavec wrote:
> On Wed, Jun 11, 2014 at 10:21 PM, Werner LEMBERG wrote:
> > As mentioned in another mail, it's straightforward to have unified
> > patterns, provided a script generates the proper entries.
Patterns with ſ and without them do not interfere. This is similar to
spelling variants like Abbe/Abbé. However, as about 1/3 of German words
contain a "Binnen-S" (190000 out of 470000 words in our list), this would
mean "unified" UTF-8 pattern sets would be about 1/3 larger.
OTOH, separate pattern sets would add one more dimension of variants:
* language variants
'de-1901': # "traditionell" (nach Rechtschreibreform 1901)
'de-1996': # reformierte Reformschreibung (1996)
'de-x-GROSS': # ohne ß (Schweiz oder GROSS) allgemein
'de-1901-x-GROSS': # ohne ß (Schweiz oder GROSS) "traditionell"
'de-CH-1996': # = 'de-1996-x-GROSS' ohne ß (Schweiz oder GROSS) "reformiert"
'de-CH-1901': # ohne ß (Schweiz) "traditionell" ("süssauer")
* hyphenation variants (planned)
permissive ((almost) all valid hyphenation points)
conservative (no confusing hyphenations)
major (only hyphenate at major morpheme boundaries)
... ?
* S-spelling (de-Lat, de-Latf)
This would result in about 6 x 3 x 2 = 36 different pattern sets just for
German.
However, I am not conviced that we need patterns with long-S at all:
* Similar to upper/lower case words, there are very few instances where
distinguishing between ſ and s would allow for disambiguating the make-up
of a word (and hence its spelling). These are the two rarely used words
Wachstube;Wach[=s/s=]tu-be # Wachstube/Wachſtube
Gastrasse;Gas[=t/t=]ras-se # Gastrasse/Gaſtraſſe
In all other 190 000 cases where the spelling for typesetting in
blackletter fonts is different, this does not change the hyphenation.
I.e. the patterns would become 1/3 larger without any additional
information.
* in 8-bit TeX, all input conventions use "s" (rund-s) for both, "ſ" and "s"
and employ the ligature mechanism of TeX fonts to map s at the end of a
word to "s" and in the middle of a word to "ſ".
Pseudo-Ligatures for round-s in the middle of a compound (Ausſteuer, ...)
vary between the various encodings of 8-bit blackletter TeX fonts.
The most advanced font encoding (LF2) implemented by the "frakturx"
package provides for the input convention «s"|», i.e. s followed by an
explicite hyphenation point as defined by Babel's german.ldf
(«Ausſteuer» would be written «Aus"|steuer»). This convention is
compatible with the existing hyphenation patterns, so there is no need for
additional ſ-patterns with 8-bit TeX.
* in Unicode-aware TeX, pattern sets including long-ſ-variants would allow
for hyphenation of texts with correct S-spelling for typesetting in
blackletter fonts (there are some Unicode blackletter fonts with the
various s characters at the correct position by now).
However, LuaTeX provides the hooks for a modification of the hyphenation
algorithm in a package so that any long ſ is mapped to round s prior to
the lookup of hyphenation points (similar to downcasing capital letters).¹
¹ This mapping could possibly be achieved by defining the lc/uc values,
but I am not sure about negative side-effects on \MakeUppercase and
\MakeLowercase.
I am in favour of keeping the German hyphenation patterns free of ſ-spelling
variants unless someone convinces me that a 1/3 increase of the size of
the pattern set for a rarely used orthography variant does not impair the
"normal" use of (Lua)TeX.
Günter
Mehr Informationen über die Mailingliste Trennmuster