[Trennmuster] Hyphenation patterns

Pander pander at users.sourceforge.net
Mi Mär 14 18:17:38 CET 2012


On 2012-03-14 17:51, Werner LEMBERG wrote:
> 
>>> BTW, does your grammar support multiple hyphenation marks at one
>>> place, this is, things like `==.'?
>>
>> No.  Should it?  when do you use it?  Shouldn't it be ... in that
>> case?
> 
> The idea is that it makes sense to add more information to the word
> list.  In particular, it requires only a small amount of additional
> tagging so that the list can be used to decide whether `ſ' or `s' must
> be used if the text is typeset in Fraktur.
> 
> Example: At the border of components in a compound word, no `ſ' is
> used.  To make this work reliably, such components are tagged with,
> say, `='.  However, it might still happen that this hyphenation point
> is not desired, so we need `=.' to mark this.  In a previous mail I've
> already mentioned that `.' is actually a shorthand for `-.' (using the
> current tagging characters of the German word list).
> 
> So `=' and `.' belong to different classes (grammatical
> vs. aesthetical), transporting different information.

OK, no problem with that. I would like to prevent defining codes that
are aliases if possible.

>>>   Bra-bant   % Noord-Brabant
>>
>> I have seen the nine comments in wortlist. These are only for the
>> editors, correct?
> 
> No, these comments are for everyone, denoting special cases,
> exceptions, etc.
> 
>> The thing is that most of these words in Dutch are not fully
>> qualified words of their own. :( We have some very ugly combinations
>> and a lot of variations.
> 
> This is of no importance to patgen.
> 
>> Sometimes they are loanwords or foreign expressions that do not
>> exist outside a compound.  They need a hyphenation pattern in the
>> compound.  Especially foreign words that would be hyphenated
>> incorrectly by Dutch rules, rules extracted from Dutch words or they
>> could be by accident written identically to another Dutch word with
>> of course different hyphenation.  Therefore we need to be able to
>> explicitly provide pattern.
> 
> Hmm.  Let's assume the word
> 
>  foobar-baz
> 
> exists in Dutch, and `foobar' is not a stand-alone word.  Why is it
> problematic to have two entries `foobar' and `baz' in the word list
> (with some comments to explain the origin)?  Even if there might exist
> valid Dutch words `fooba' and `foobarr' with a different hyphenation,
> everything can coexist in the word list.

Objections are that we store one hyphenation pattern per word. Another
being that for "déjà-vugevoel" we get "vugevoel" which makes no sense
since it is "gevoel" that is an existing word.

> Or does the case exist that, for example, you hyphenate `foo-bar#baz',
> but a standalone `foobar' would be hyphenated as `foob-ar'?  I really
> doubt that.
> 
>> At the moment we have about 6000 words that contain a hyphen in the
>> normal form and some are used very often.  These can be broken down
>> into several categories.  I will give you some examples.
> 
> Thanks.  Could you please analyze whether there are conflicts in the
> hyphenation of the components, this is, whether there are cases
> `foo-bar' vs. `foob-ar' as outlined above?

I know that compounds with loanwords (which we use a lot, especially in
computer industry) it will cause problems because of mixing spelling of
different languages.

>> Another one is
>>
>>   zwart-wittelevisie;zwart#wit==te-le-vi-sie
>>
>> which means black-and-white television, also 'wittelevisie' is not a
>> Dutch word.  Here == has higher prio as #
> 
> Hmm.  Looking this up with google, I can only find
> 
>   zwart-witte televisie
> 
> The word `zwart-wittetelevisie' has *not a single* hit!  Is this
> really a correct entry?

Indeed not-so-good example from our list. However
  signaal-ruisverhouding (signal to noise ratio)
and the other examples I used do exist.

>> I think ~ is a good one.
> 
> I like it too, but it is probably a bit too similar to `-'.

We have exhausted the subject and ourselves pretty thorough I think. For
now, let's keep it at ~ Perhaps that later on a better way to solve this
presents itself.

>     Werner




Mehr Informationen über die Mailingliste Trennmuster