Skip navigation.
 
mlRe: Unicode case conversion
FROM : Glenn Andreas
DATE : Thu Nov 25 17:44:06 2004

At 9:37 PM -0700 11/24/04, Robbie Haertel wrote:
>Levenshtein edit distance of a Mayan language.  Have to compare each
>character one-by-one.  The old Spanish priest often writes b, u, and w
>as 'V', but this is one of the few cases (there are a few others) I
>want to change the case.  I'm already necessarily comparing
>character-by-character due to the algorithm, so it isn't a problem.  I
>can already guarantee that there will be no fancy characters other
>than "option-3" (the English pound symbol).  It may seem like just
>checking for 'V' is an option, but it is more complicated than that.
>
>There are some carbon functions, I believe, but I don't know anything
>about carbon.  Also, I think there are some functions for wide
>characters, but I don't think it is the same thing.
>
>Thanks,
>robbie



Interesting.

If the only fancy character (i.e., non-ascii) is the English pound
symbol, you can just use regular old C style lower, since you'll only
have ascii letters.  If you have to worry about things like accented
characters (and I'm assuming even old Spanish would have them) you'll
have to go further.

What I'd do is take advantage of Obj-C++ and create a map to cache
the results of converting via the "lowercase" method where the result
is a single character, something like:

   std::map<unichar, unichar> lowerMap;
   NSCharacterSet *upperSet = [NSCharacterSet
uppercaseLetterCharacterSet];
   for (unsigned i=0;i<[str length];i++) {
       ...
       unichar c = [str characterAtIndex: i];
       std::map<unichar, unichar>::iterator m = lowerMap.find(c);
       if (m != lowerMap.end()) {
           c = m->second; // get what we mapped into
       } else {
           if ([upperSet characterIsMember: c]) {
               NSString *lowerStr = [[str
substringFromRange: NSMakeRange(i,1)] lower];
               if ([lowerStr length] == 1) { //
converted to a single character
                   unichar lowerC = [lowerStr
characterAtIndex: 0];
                   lowerMap[c] = lowerC;
                   c = lowerC;
               } else {
                   // lower case version isn't a
single character, keep as uppercase
                   lowerMap[c] = c;
               }
           } else {
               // not an uppercase, leave as is, or
do something else
               lowerMap[c] = c; // but enter into
map for the next time we see it
           }
       }
       // c is now in your "canonical form"
       ...
   }

The other advantage of this is that you can perform other
canonicalization - say this was Latin and you wanted to canonicalize
I/J and U/V, you could just add before the for loop:
   lowerMap['V'] = 'u';
   lowerMap['v'] = 'u';
   lowerMap['I'] = 'j';
   lowerMap['i'] = 'j';


You can fill in that map with as many special cases as you want as
well (so you could for "(char i='A';i<='Z';i++) lowerMap[i] = i;"
before putting in the special case handling for "V" and then
everything else will stay as uppercase).


--
Glenn Andreas                      <email_removed>
<http://www.gandreas.com/> oh my!
Mad, Bad, and Dangerous to Know

Related mailsAuthorDate
mlUnicode case conversion Robbie Haertel Nov 25, 03:04
mlRe: Unicode case conversion Kevin Ballard Nov 25, 04:55
mlRe: Unicode case conversion Glenn Andreas Nov 25, 05:01
mlRe: Unicode case conversion Robbie Haertel Nov 25, 05:03
mlRe: Unicode case conversion Robbie Haertel Nov 25, 05:37
mlRe: Unicode case conversion Frederick Cheung Nov 25, 09:41
mlRe: Unicode case conversion Glenn Andreas Nov 25, 17:44