FROM : Glenn Andreas
DATE : Thu Nov 25 17:44:06 2004
At 9:37 PM -0700 11/24/04, Robbie Haertel wrote:
>Levenshtein edit distance of a Mayan language. Have to compare each
>character one-by-one. The old Spanish priest often writes b, u, and w
>as 'V', but this is one of the few cases (there are a few others) I
>want to change the case. I'm already necessarily comparing
>character-by-character due to the algorithm, so it isn't a problem. I
>can already guarantee that there will be no fancy characters other
>than "option-3" (the English pound symbol). It may seem like just
>checking for 'V' is an option, but it is more complicated than that.
>
>There are some carbon functions, I believe, but I don't know anything
>about carbon. Also, I think there are some functions for wide
>characters, but I don't think it is the same thing.
>
>Thanks,
>robbie
Interesting.
If the only fancy character (i.e., non-ascii) is the English pound
symbol, you can just use regular old C style lower, since you'll only
have ascii letters. If you have to worry about things like accented
characters (and I'm assuming even old Spanish would have them) you'll
have to go further.
What I'd do is take advantage of Obj-C++ and create a map to cache
the results of converting via the "lowercase" method where the result
is a single character, something like:
std::map<unichar, unichar> lowerMap;
NSCharacterSet *upperSet = [NSCharacterSet
uppercaseLetterCharacterSet];
for (unsigned i=0;i<[str length];i++) {
...
unichar c = [str characterAtIndex: i];
std::map<unichar, unichar>::iterator m = lowerMap.find(c);
if (m != lowerMap.end()) {
c = m->second; // get what we mapped into
} else {
if ([upperSet characterIsMember: c]) {
NSString *lowerStr = [[str
substringFromRange: NSMakeRange(i,1)] lower];
if ([lowerStr length] == 1) { //
converted to a single character
unichar lowerC = [lowerStr
characterAtIndex: 0];
lowerMap[c] = lowerC;
c = lowerC;
} else {
// lower case version isn't a
single character, keep as uppercase
lowerMap[c] = c;
}
} else {
// not an uppercase, leave as is, or
do something else
lowerMap[c] = c; // but enter into
map for the next time we see it
}
}
// c is now in your "canonical form"
...
}
The other advantage of this is that you can perform other
canonicalization - say this was Latin and you wanted to canonicalize
I/J and U/V, you could just add before the for loop:
lowerMap['V'] = 'u';
lowerMap['v'] = 'u';
lowerMap['I'] = 'j';
lowerMap['i'] = 'j';
You can fill in that map with as many special cases as you want as
well (so you could for "(char i='A';i<='Z';i++) lowerMap[i] = i;"
before putting in the special case handling for "V" and then
everything else will stay as uppercase).
--
Glenn Andreas <email_removed>
<http://www.gandreas.com/> oh my!
Mad, Bad, and Dangerous to Know
DATE : Thu Nov 25 17:44:06 2004
At 9:37 PM -0700 11/24/04, Robbie Haertel wrote:
>Levenshtein edit distance of a Mayan language. Have to compare each
>character one-by-one. The old Spanish priest often writes b, u, and w
>as 'V', but this is one of the few cases (there are a few others) I
>want to change the case. I'm already necessarily comparing
>character-by-character due to the algorithm, so it isn't a problem. I
>can already guarantee that there will be no fancy characters other
>than "option-3" (the English pound symbol). It may seem like just
>checking for 'V' is an option, but it is more complicated than that.
>
>There are some carbon functions, I believe, but I don't know anything
>about carbon. Also, I think there are some functions for wide
>characters, but I don't think it is the same thing.
>
>Thanks,
>robbie
Interesting.
If the only fancy character (i.e., non-ascii) is the English pound
symbol, you can just use regular old C style lower, since you'll only
have ascii letters. If you have to worry about things like accented
characters (and I'm assuming even old Spanish would have them) you'll
have to go further.
What I'd do is take advantage of Obj-C++ and create a map to cache
the results of converting via the "lowercase" method where the result
is a single character, something like:
std::map<unichar, unichar> lowerMap;
NSCharacterSet *upperSet = [NSCharacterSet
uppercaseLetterCharacterSet];
for (unsigned i=0;i<[str length];i++) {
...
unichar c = [str characterAtIndex: i];
std::map<unichar, unichar>::iterator m = lowerMap.find(c);
if (m != lowerMap.end()) {
c = m->second; // get what we mapped into
} else {
if ([upperSet characterIsMember: c]) {
NSString *lowerStr = [[str
substringFromRange: NSMakeRange(i,1)] lower];
if ([lowerStr length] == 1) { //
converted to a single character
unichar lowerC = [lowerStr
characterAtIndex: 0];
lowerMap[c] = lowerC;
c = lowerC;
} else {
// lower case version isn't a
single character, keep as uppercase
lowerMap[c] = c;
}
} else {
// not an uppercase, leave as is, or
do something else
lowerMap[c] = c; // but enter into
map for the next time we see it
}
}
// c is now in your "canonical form"
...
}
The other advantage of this is that you can perform other
canonicalization - say this was Latin and you wanted to canonicalize
I/J and U/V, you could just add before the for loop:
lowerMap['V'] = 'u';
lowerMap['v'] = 'u';
lowerMap['I'] = 'j';
lowerMap['i'] = 'j';
You can fill in that map with as many special cases as you want as
well (so you could for "(char i='A';i<='Z';i++) lowerMap[i] = i;"
before putting in the special case handling for "V" and then
everything else will stay as uppercase).
--
Glenn Andreas <email_removed>
<http://www.gandreas.com/> oh my!
Mad, Bad, and Dangerous to Know
| Related mails | Author | Date |
|---|---|---|
| Robbie Haertel | Nov 25, 03:04 | |
| Kevin Ballard | Nov 25, 04:55 | |
| Glenn Andreas | Nov 25, 05:01 | |
| Robbie Haertel | Nov 25, 05:03 | |
| Robbie Haertel | Nov 25, 05:37 | |
| Frederick Cheung | Nov 25, 09:41 | |
| Glenn Andreas | Nov 25, 17:44 |






Cocoa mail archive

