[Trennmuster] Quality control on wortliste
Pander
pander at users.sourceforge.net
Do Jun 6 11:35:28 CEST 2013
>
>> The resulting CSV files will show the use of some /exotic/ characters.
>> Perhaps these need to be investigated and fixed.
>
> ??? Since we generate hyphenation patterns directly from this list,
> all used characters are OK and properly listed in the file
> `daten/german.tr' (as needed by patgen). There's nothing exotic!
> It's a specialty of German to write foreign words unchanged (if
> possible), this is, with all diacritics as found in the original
> language.
>
>
> Werner
>
I understand that german.tr will only allow certain characters.
Nevertheless I think it is good to review also what is used as input.
Did you look at the results? Below is first a histogram of all the
characters found in the words (before the semi column):
909327,e
479494,n
436784,r
401366,s
398535,t
340012,i
278986,a
225150,l
224140,h
211572,u
188589,g
156675,d
155527,c
135186,m
129610,o
115302,b
108845,Ã
103139,f
87894,k
71826,z
68764,p
46326,w
41993,v
41671,€
37049,Œ
26539,S
18736,A
14795,B
14493,Ÿ
14421,K
14058,¶
12770,M
12514,G
12099,F
10551,P
9567,E
9227,V
9002,W
8395,T
8186,H
8169,L
7831,R
7795,y
7648,D
6779,I
6212,C
4944,N
4808,Z
4611,x
3728,X
3365,j
3262,U
2680,O
1928,q
1830,J
873,œ
550,Q
516,ä
313,–
235,„
221,ö
164,ü
112,©
47,ß
45,Y
9,±
7,Ö
5,§
5,š
3,¡
3,
3,¢
3,ª
3,³
2,Ä
2,«
2,
2,Ž
1,®
1,ú
1,·
Here is the histogram of all the characters found in the pattern
definitions after the semi column (spacers such as -2-, -3-, etc are
omitted):
1042653,e
725069,-
549271,n
516236,s
494093,r
470505,t
450829,·
416813,Â
396720,i
307670,a
254348,h
250965,l
247892,u
215329,g
204847,=
184015,d
182019,c
153985,m
152820,|
142563,o
127466,b
125324,Ã
116252,f
116211,k
83248,z
74384,p
53190,w
47857,v
47749,€
44251,Œ
29353,S
21019,A
16108,¶
16051,B
15690,K
15513,Ÿ
14041,M
13625,G
13230,F
11645,P
10442,E
9824,V
9597,W
9339,y
9048,T
8898,L
8868,/
8828,{
8828,}
8771,H
8472,R
8290,D
7274,I
6370,C
5308,N
5241,Z
5063,x
3728,X
3664,j
3464,U
2913,O
2066,q
2045,J
1773,.
940,œ
582,Q
564,ä
363,
344,–
263,„
226,ö
168,ü
118,©
91,[
91,]
72,#
48,ß
48,Y
35,<
22,Ä
20,+
11,«
10,
9,±
8,Å
8,Ö
6,
6,:
6,Î
5,§
5,(
5,š
5,_
5,)
4,¿
4,Ï
3,“
3,³
3,
3,¡
3,¢
3,ª
2,»
2,Ž
2,*
2,
2,á
1,,
1,4
1,8
1,€
1,œ
1,‚
1,®
1,º
1,Ÿ
And a histogram of all the characters that occur in the pattern
definitions but not in the words (these should be the reserved characters
for hyphenation pattern definitions):
725069,-
416813,Â
204847,=
152820,|
8868,/
8828,{
8828,}
1773,.
363,
91,[
91,]
72,#
35,<
20,+
10,
8,Å
6,
6,:
6,Î
5,(
5,_
5,)
4,¿
4,Ï
3,“
2,»
2,*
2,á
1,,
1,4
1,8
1,€
1,œ
1,‚
1,º
1,Ÿ
Mehr Informationen über die Mailingliste Trennmuster