[Trennmuster] Please review hyphenation file format

Mojca Miklavec mojca.miklavec.lists at gmail.com
Do Apr 19 11:36:00 CEST 2012


Thank you very much for this document.

Let me first explain. I glimpsed through the document and listed
everything that wasn't crystal clear to me from the first reading.
This means: yes, I would probably be able to understand some points
better if I would read that part of document several times, but it
might help to point out where content is not self-evident for somebody
who reads it for the first time, so that there is still room for
improvement.

1.) I don't understand the following:
    smalltalk;...
Do three dots mean that there is no hyphenation allowed? Why not simply
    smalltalk;smalltalk
then?

At first I wrote:
"It is not clear from specification whether compound words are
hyphenated if not explicitly mentioned. For example, let's say that
"déjà" could be hyphenated as "dé-jà" and "vugevoel" could be
hyphenated as "vuge-voel". Or, if "vitamin" can be hyphenated as
"vi-ta-min", would "c-vitamin" be hyphenated as "c[-]vi-ta-min" if it
is not explicitly on the list? Is it allowed to hyphenate words of
compounds then? I'm asking because it might be that out of a large
number of possible compounds one would never be able to list all of
them."
... but then I realized that this is explained. It just wasn't clear
enough to me when I first read it. The dots are still confusing.

2.) "Hyphenation needs at the beginning and ending of words at least
two characters. Therefor words need to have at least four characters"

That is not true for all languages.

3.) I don't understand the following specification:

% valid words but without hyphenation
%door;...
%four;...

Does one use three dots to denote that there are no hyphenation? If
yes, that should be explicitly mentioned. If not, the right part
should probably be empty? Why are the two words commented out in the
first place?

4.) From description alone, I don't understand specification of
alternative hyphenation.

5.) I find the following confusing:

German noun:
  Abdrücken (Ab-drücken/Abdrük-ken) prints [plural noun]
hyphenation pattern:
  Abdrücken;Ab·drü{ck/k-k}en

hyphenation pattern could be more clearly separated. One option could be

Abdrücken;Ab·drü{ck/k-k}en % Abdrücken -> Ab-drücken or Abdrük-ken
(eng. prints [plural noun])

or simply mentioning explanation outside of yellow boxes. It is
confusing to figure out what exactly is input and what is explanation
(you could have two types of verbatim if needed, for example in two
colours, or simply put explanation outside of verbatim).

6.) It is not clear to me what happens if heteronyms become more
complex. That is: it is not clear to me if long and complex
hyphenation of heteronyms will still be achievable/possible to
express. Imagine very long heteronyms where hyphenation differs in two
places. Personally, I would split heteronyms, so that each of the two
words could still be hyphenated in an arbitrarily complex way. You
could split them with =2= if needed, or simply make more lines (with
some prefix to be explicit that it wasn't a mistake to print that word
twice).

7.) It wasn't clear to me from a quick glimpse what exactly emergency
hyphenation means.

8.) I miss examples for hyphenation priority and the difference
between 16-18 is not clear to me.



Feature requests:
1.) Some time earlier I mentioned that in some languages like mine,
letters a, á, à, ä, ... are all treated as equal letters. So should be
treated the two variants of apostrophe in TeX for example. So "míza"
and "miza" are two identic words. And so is "mi<combining acute>za".
In Serbian it is even worse, for example džezva = <dž>ezva (where dž
is/may be a digraph). And in case of multiple combining characters it
is also a great deal of fun.

1a) The document doesn't even remotely mention that "problem".
1b) I think that precomposed and composed characters of Unicode should
be at least mentioned. They might not have a lot to do with input
format, but they definitely have a lot to do with hyphenation once
patterns are generated.
1c) Some letters in some languages cannot be expressed with a single
Unicode character. Polytonic Greek comes to mind. After running
patgen, it would probably disable hyphenation between base letter and
combining accent anyway, but I still somehow find it important to
"explain to computer" one exactly is considered one character. Say, if
there was no ä in Unicode, the hyphenation algorithm should still
treat <a>+<combining diaeresis> as a single character, not as two.

(I have some more, but that is for later.)

Mojca



Mehr Informationen über die Mailingliste Trennmuster