[Trennmuster] Fwd: Re: Review hyphenation pattern definitions

Do Jan 30 17:19:16 CET 2014

-------- Original-Nachricht --------
Betreff: Re: Review hyphenation pattern definitions
Datum: Thu, 30 Jan 2014 16:47:24 +0100
Von: Németh László <nemeth.lacko at gmail.com>
An: Sander van Geloven <sander.vangeloven at opentaal.org>
Kopie (CC): Bob van de Loo <bob.vandeloo at opentaal.org>, Bart Knubben
[...]

Hi,

For your information, LibreOffice 4.3 will support libhyphen based
non-standard hyphenation at soft hyphens and user-defined dictionary
based non-standard hyphenation. A screen shot:
https://bugs.freedesktop.org/attachment.cgi?id=93061

Issue: https://bugs.freedesktop.org/show_bug.cgi?id=44314

LibreOffice hasn't supported character changes after the break point,
yet, so the user-defined dictionary uses only a simplified syntax for
non-standard hyphenation:

as[z]szon[y]nyal
cafee[2é]tje
ko=nya=kos=meg[y]gye=zik
paral·[1]lel

Now It's possible to add partial Dutch non-standard hyphenation
support to LibreOffice on a new, maybe easier way, with an
user-defined dictionary. The format of the user-defined dictionary is
a simple LC_ALL=C sorted text file, with a header:

============== ~/.config/lodev/4/user/wordbook/standard.dic 
=================
OOoUserDict1
lang: <none>
type: positive
---
as[z]szon[y]nyal
cafee[2é]tje
ko=nya=kos=meg[y]gye=zik
paral·[1]lel
================================

Best regards,
László

2014-01-16 Németh László <nemeth.lacko at gmail.com>:
> Hi,
>
> 2014/1/15 Sander van Geloven <sander.vangeloven at opentaal.org>:
>> Hi László,
>>
>> Thanks for your feedback and proposal. The draft which I will present
>> for discussion at CLIN24 Friday is too close to be compatible with it.
>> However, I would like to study it carefully and work on that part before
>> I consider my draft finished. The aim is to find a standard with which
>> we all can work and which we can extend later on.
>>
>> I actually hope to get questions and feedback on CLIN24 to further
>> improve the draft. Meanwhile, I am curious to what people from
>> Trennmuster think about your proposal. As a quick first impression I
>> find the number for removal of characters not verbose while editing but
>> I see your point.
>>
>> What do you think of using ~ instead of - and replacing it with the
>> (soft) hyphen of your choice in the hyphenation algorithm? '-' is part
>> of Dutch orthography and occurs in places where we do and do not want
>> hyphenation. That would make your proposed definition ambiguous as it is
>> unclear of - is part of ExtLetter (from the word) or a OptSplitExt (from
>> hyphenation point definition).
>
> I'm sorry, I haven't refered some relevant information yet:
> http://en.wikipedia.org/wiki/Unicode_control_characters#Language_tags,
> "hyphen" means U+E002D in my proposal, so it is a hidden (removed)
> control character. There is no ambiguity here, because the extended
> soft hyphen doesn't repeat characters of the original word, only extra
> letters (by control characters). For example (I will add to the
> proposal):
>
> Druc[1k-/]ker will be encoded as
>
> Druc&#xE005B;&#xE0031;&#xE006B;&#xE002D;&#xE002F;&#xE005D;ker in the
> Unicode text.
>
> Best regards,
> László
>
>
>
>
>>
>> Best regards,
>>
>> Sander
>>
>> On 15-01-14 18:10, Németh László wrote:
>>> Hi Sander, Hi All,
>>>
>>> Many thanks for your mail and proposal. A suggestion: maybe
>>> "homograph" is better instead of heteronym (see
>>> http://en.wikipedia.org/wiki/Heteronym_%28linguistics%29).
>>>
>>> Unfortunatelly, I have a strong problem with the implementations of
>>> the automatic hyphenation, so I made the attached Unicode proposal
>>> instead of the careful review of your work. I believe, the extended
>>> soft hyphen could force the correct implementations of non-standard
>>> hyphenation in the different applications. I also will be very glad of
>>> your suggestions or corrections.
>>>
>>> Other comment: a "hyphenation" dictionary has to be support other
>>> similar tasks, too. Now I try to extend Graphite fonts to handle
>>> ligatures in German texts (root + morpheme and compound boundary
>>> detection), also syllable recognition is an important task (TTS,
>>> speech recognition, there is a big ambiguity in Hungarian here, thanks
>>> to the simplified double consonant ccs, nny, ggy, ssz...)
>>> Maybe the solution is a big monolithic linguistic tool or an algorithm
>>> to use different tools for this purpose. Other (also it is necessary
>>> in all cases) method is the manual hyphenation/user exception
>>> dictionaries using also the proposed extension of Unicode soft
>>> hyphens.
>>>
>>> Best regards,
>>> László
>>>
>>> 2014/1/10 Sander van Geloven <sander.vangeloven at opentaal.org>:
>>>> Hi all,
>>>>
>>>> Friday 17th I will be giving a poster presentation on behalf of OpenTaal
>>>> at the congress Computational Linguistics in The Netherlands, Leiden,
>>>> CLIN24, http://clin24.inl.nl/ on hyphenation pattern definitions.
>>>>
>>>> I would like to ask you all if you would like to take some time this
>>>> weekend or Monday or Tuesday to review my work. During the last months
>>>> and Christmas holidays I spend a lot of time on this. Even though I
>>>> started on time and took some days off to write on it, it all has become
>>>> a bit last minute work. :S Sorry for that but I hope you can find the
>>>> time to assist in some fast passed iterations to finalise a *first*
>>>> proposal for an international standard on this which I would like to
>>>> submit as an RFC to the IETF.
>>>>
>>>> For those who haven't been involved in this up to now, we are in need of
>>>> improved hyphenation functionality. With the arrival of smartphones and
>>>> tablets, hyphenation in web browsers is now supported but stuck with
>>>> version from mostly 1996 of which a lot are are outdated. Furthermore
>>>> use of hyphenation in TeX, LibreOffice and other applications needs to
>>>> be raised to a next level as more dynamic hyphenation is required.
>>>>
>>>> Already I have gotten a lot of support and had meaningful discussions
>>>> with people from the Trennmuster project
>>>>   http://projekte.dante.de/Trennmuster/WebHome
>>>> who are a major driving force in this. Thank you guys for this.
>>>>
>>>> Excuse me for sending you a personal email. I have used only publicly
>>>> available email addresses. Reason why I send it like this, is that I
>>>> would like to have your initial feedback before public publication on
>>>> GitHub. Also, I do not want to cross post to five different mailing
>>>> lists. If you really do not want to be involved in this review, simply
>>>> reply with unsubscribe.
>>>>
>>>> Otherwise I will send you only a few short notifications up to the 15th
>>>> on where to find the latest version of the to be proposed standard
>>>> together with a diff of the changes. After publication, all
>>>> communication will be via the normal mailing lists.
>>>>
>>>> Thank you very much for contributing to improved hyphenation and I am
>>>> looking forward to you additions and corrections.
>>>>
>>>> Regards,
>>>>
>>>> Sander
>>>> --
>>>> Sander van Geloven
>>>> ICT-architect Stichting OpenTaal
>>>> sander.vangeloven at opentaal.org
>>>> http://twitter.com/opentaal
>>>> http://fb.com/opentaal
>>>> http://opentaal.org
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Abstract
>>>>
>>>> This document describes a standard for hyphenation definitions enabling
>>>> the generation of prioritised and dynamic hyphenation patterns.
>>>>
>>>> In the early nineteen-eighties, automatic hyphenation of lexical items
>>>> has been made possible by a hyphenator using language-specific
>>>> hyphenation patterns. These patterns are generated by the hyphenation
>>>> software community from hyphenated word lists.
>>>>
>>>> The initial design was based on the English orthography and limited
>>>> character encoding. Support for extended encodings was added in the
>>>> 1990s mostly for Western languages. However, the hyphenated word list
>>>> format remained rather unchanged. This complicated the support of
>>>> specific morphological or phonological structures, requiring hyphenation
>>>> priority in compounds or dynamic hyphenation resulting in altered spelling.
>>>>
>>>> Although over 70 languages are supported now, hyphenation is suboptimal
>>>> and impossible for languages relying on a universal character encoding.
>>>> This limited method of hyphenation has been catering to digital
>>>> typesetting over three decades. Unfortunately, recently implemented
>>>> hyphenation in layout engines for web page rendering is built upon the
>>>> same outdated technology.
>>>>
>>>> An improved hyphenator and extended hyphenation patterns are necessary
>>>> to overcome current limitations and support a wider range of languages.
>>>> To achieve this, the software community needs a standard format for
>>>> hyphenation definitions in universal human-readable hyphenated word
>>>> lists. A context-free grammar was developed with unambiguous and
>>>> fine-grained control allowing enhanced hyphenation. All
>>>> language-specific cases are illustrated with examples and lexicological
>>>> theory.
>>>>
>>>> Our standard for hyphenation definitions enables improved automatic
>>>> hyphenation for printed media and web documents.
>>
>>
>> --
>> Sander van Geloven
>> ICT-architect Stichting OpenTaal
>> sander.vangeloven at opentaal.org
>> http://twitter.com/opentaal
>> http://fb.com/opentaal
>> http://opentaal.org
>>