Remove first letters that have an overlapping prefix.
authorBrian Wolff <bawolff+wn@gmail.com>
Sun, 24 Mar 2013 03:09:43 +0000 (00:09 -0300)
committerGerrit Code Review <gerrit@wikimedia.org>
Mon, 8 Apr 2013 22:52:40 +0000 (22:52 +0000)
commit3d70637a420effac8b0488a30523ced011ff92be
tree489f2432f1c650ce9ac3b16bee02e742212add93
parent98efeb589dbe562de317a0a287c0e055f55a17e9
Remove first letters that have an overlapping prefix.

First letters are supposed to be primary collation elements.
However, we do not want expansions to be considered
as firstletters (aka thorn "þ" -> "th" which isn't
the same as any other first letter (since "t" !== "th" )
however if þ was a first letter, the word "the" and
even worse the word "too" would be sorted under it, which
is wrong.

Looking for feedback if this all sounds sane. I have tested
it, it got rid of the contractions while at the same time
not removing any letter it wasn't supposed to.

Once this is merged, we could get rid of all the
-<langcode> entries. The other firstLetter array
entries for tailorings could be merged into
generateCollationData.php too, since incorrect
things would get pruned automatically, which
would probably make the logic in Collation.php
simpler.

Bug: 43740
Change-Id: I4bd3d39ec2938a53e2c6728adc48ee6cf9778d74
includes/Collation.php
tests/phpunit/includes/CollationTest.php [new file with mode: 0644]