FROM : Aki Inoue
DATE : Wed Feb 06 23:19:01 2008
John,
Right now AppKit is using UCFindTextBreak under the cover and the
Unicode Utility API function is, in turn, implemented on top of ICU.
> - When double-clicking, any Kanji or hiragana is treated as its own
> "word." e.g. if I double click on "$BEl5~Bg3X1!@8(B" I get a single
> kanji selected, or if I double click on "$B$3$s$K$A$O(B" I get a
> single kana selected. Any consecutive katakana do count as one word,
> so "$B%^%&%9%/(B $B%j%C%/(B" or "$B%*%U%#%9%l!<%G%#!<!I(B are
> fully selected on double-click, even though the IME actually
> considers them to be two separate words.
This is the default word breaking behavior for non-Japanese locales.
With Japanese locale (you can specify in International Pref's Word
Break popup), UCFindTextBreak uses slightly more Japanese friendly
algorithm.
We have plans to enhance the AppKit user experience by integrating
CFStringTokenizer in the future.
Thanks for your inputs,
Aki
On 2008/02/06, at 9:25, John Stiles wrote:
> Well, in practical terms, it looks like AppKit doesn't try very hard
> for Japanese:
>
> - When double-clicking, any Kanji or hiragana is treated as its own
> "word." e.g. if I double click on "$BEl5~Bg3X1!@8(B" I get a single
> kanji selected, or if I double click on "$B$3$s$K$A$O(B" I get a
> single kana selected. Any consecutive katakana do count as one word,
> so "$B%^%&%9%/(B $B%j%C%/(B" or "$B%*%U%#%9%l!<%G%#!<!I(B are
> fully selected on double-click, even though the IME actually
> considers them to be two separate words.
>
> - Search results with "Full word" enabled seem to follow this same
> pattern, as nonsensical as it may seem. I can't find the full word
> "$BEl(B $B5~!I(B in "$BEl5~Bg3X1!@8(B", but I can find
> "$BEl!I(B just fine. I can't find "$B%^%&(B $B%9(B", but I can
> find "$B%^%&%9%/%j%C%/!I(B.
>
> So while your code may not work for all languages, AppKit does not
> seem to do a better job. If anything I'd call AppKit's behavior
> pretty broken.
>
> This is all using Leopard 10.5.1, by the way.
>
> I might look into UCFindTextBreak since it seems to exist all the
> way back to 10.0 and will probably work just fine for me. In
> actuality I don't think I need to support Asian languages that well
> anyway$B!=(Bin my case, this is for editing code/scripts, not
> freeform text. While Asian input is probably going to be supported,
> it should only be used in comments or string literals, and I don't
> expect that users will have high expectations for searching within
> it. (And it looks like the bar is set appropriately low for me
> anyway ;) )
>
>
>
> Mike Wright wrote:
>> On Feb. 5, 2008, at 22:30 , Deborah Goldsmith wrote:
>>
>>> This doesn't work for all languages. What constitutes a "word" is
>>> rather more complex than this. In Thai, for a particularly
>>> egregious example, you can't find word boundaries without looking
>>> up the words in a dictionary.
>>
>> Whole-word searching seems pretty unnecessary (and virtually
>> impossible) for Japanese and Chinese. Does it really make sense for
>> Thai? There are lots of skeptics regarding "word" even being a
>> valid concept in reference Chinese languages. San Duanmu (The
>> Phonology of Standard Chinese) makes a good case for the concept in
>> Mandarin, at least, but I can't see any way that it could be used
>> as a basis for whole-word text searches. And long strings of
>> hiragana in Japanese seem to require human intuition to find word
>> breaks. (And better intuition than mine.)
>>
>>> On Tiger, you can use either the double-click API in Cocoa, or
>>> UCFindTextBreak, to find word boundaries. On Leopard or later, use
>>> CFStringTokenizer. All of them will do the right thing for word
>>> boundaries in every language we support.
>>
>> Is there a list somewhere of the supported languages? (I assume you
>> mean supported by those APIs, and that writing systems like
>> Japanese and Chinese that don't include some set of word delimiters
>> are excluded. And Thai and other Brahmi-derived scripts?)
>>
>> Did you happen to see my response to Douglas Davidson the next day
>> (Jan 30, 2008 Message-ID: <<email_removed>
>> >)? Here's a restatement of it:
>>
>> From my perspective, the problem is that the "whole words" to be
>> searched for are not always words in any linguistic sense. Judging
>> by the TextEdit Find panel, the double-click API doesn't seem to be
>> capable of treating "a:" as a word, but it's just the kind of thing
>> that I might want to perform a whole-word search for, trying to
>> find something like " a: " or "\na:\n" in a mishmash of strings
>> like "5a:--w7".
>>
>> And, as John Stiles pointed out somewhere, the Text Edit Find panel
>> can't find something like "way home" doing a Full Word search in
>> text containing: I don't know the way home. Go away homewrecker.
>> Look at the way Homer ran.
>>
>> My method can find the desired target text in both of those cases--
>> at least in English, and presumably in other Roman-based scripts.
>> And, it's pretty easy to change the set of word delimiters.
>>
>> So, maybe "whole phrase" is a more accurate term than "whole word",
>> but it's the kind of behavior that I expect--and that I think my
>> customers expect. (My customers aren't real big on providing
>> feedback as long as they're happy, but I figure no news is good
>> news.)
>>
>> Will UCFindTextBreak do any better in this kind of case? Or
>> CFStringTokenizer? (Although, given my customer base, I don't
>> expect to use any Leopard-only APIs for a long time.)
>>
>> Mike Wright
>> http://www.idata3.com/
>> http://www.raccoonbend.com/
>>
>>
>>> Deborah Goldsmith
>>> Apple Inc.
>>> <email_removed>
>>>
>>> On Jan 29, 2008, at 12:28 PM, Mike Wright wrote:
>>>
>>>> On Jan 29, 2008, at 10:12:21 -0800, John Stiles <<email_removed>
>>>> > wrote:
>>>>
>>>>> I'm trying to find a substring in an NSString. But I want to
>>>>> find whole
>>>>> words (e.g. like in the Find panel when you choose "Full word"
>>>>> from the
>>>>> popup, rather than "Contains" or "Starts With").
>>>>>
>>>>> Unless I'm missing something, it looks like NSString's
>>>>> -rangeOfString:options:range:locale: doesn't have an option for
>>>>> finding
>>>>> whole words.
>>>>>
>>>>> How does the Find panel do it, then? Am I going to have to "roll
>>>>> my own"
>>>>> code for string searching? That sounds error-prone to me; I'd much
>>>>> rather have the OS do it.
>>>>
>>>> Here's a Tiger approach that's worked pretty well for me (or, at
>>>> least, no non-English-using customers have complained--so far).
>>>>
>>>> NSString *fieldContent; // the string I'm searching in
>>>> NSString *targetString; // the string to be found
>>>> NSRange hitRange; // the range of targetString found within
>>>> fieldContent
>>>> NSRange testRange; // in the beginning, this covers all of
>>>> fieldContent
>>>> BOOL caseSensitive; // specified by the user
>>>> BOOL isWholeWord = NO; // this is used in two sequential tests
>>>>
>>>> // set up the search mask
>>>> unsigned searchMask = NSLiteralSearch;
>>>> if (! caseSensitive)
>>>> searchMask |= NSCaseInsensitiveSearch;
>>>>
>>>> // set up the character set for words
>>>> NSCharacterSet *wordCharacterSet = [NSCharacterSet
>>>> alphanumericCharacterSet];
>>>>
>>>> // look for targetString in fieldContent
>>>> hitRange = [fieldContent rangeOfString:targetString options:
>>>> searchMask range:testRange];
>>>>
>>>> // if we found something, do the whole-word test
>>>> if (hitRange.location != NSNotFound)
>>>> {
>>>> // test the beginning of targetString
>>>> isWholeWord = ((hitRange.location == 0) || (!
>>>> [wordCharacterSet characterIsMember:[fieldContent
>>>> characterAtIndex:(hitRange.location - 1)]]));
>>>>
>>>> // if the beginning is okay, test the end of targetString
>>>> if (isWholeWord)
>>>> {
>>>> unsigned nextCharPosition = hitRange.location +
>>>> hitRange.length;
>>>> isWholeWord = ((nextCharPosition == [fieldContent length])
>>>> || (! [wordCharacterSet characterIsMember:[fieldContent
>>>> characterAtIndex:nextCharPosition]]));
>>>> }
>>>> }
>>>>
>>>> Finally:
>>>>
>>>> if (isWholeWord)
>>>> {
>>>> // show it to the user
>>>> }
>>>>
>>>> Hope this helps. (And, since it's not just copied from my own
>>>> code, I hope it doesn't contain any serious errors.)
>>>>
>>>> Regards,
>>>> Mike Wright
>>>> http://www.idata3.com/
>>>> http://www.raccoonbend.com/
>>>> _______________________________________________
>>>>
>>>> Cocoa-dev mailing list (<email_removed>)
>>>>
>>>> Please do not post admin requests or moderator comments to the
>>>> list.
>>>> Contact the moderators at cocoa-dev-admins(at)lists.apple.com
>>>>
>>>> Help/Unsubscribe/Update your Subscription:
>>>> http://lists.apple.com/mailman/options/cocoa-dev/<email_removed>
>>>>
>>>> This email sent to <email_removed>
>>>
>>>
>>
> _______________________________________________
>
> Cocoa-dev mailing list (<email_removed>)
>
> Please do not post admin requests or moderator comments to the list.
> Contact the moderators at cocoa-dev-admins(at)lists.apple.com
>
> Help/Unsubscribe/Update your Subscription:
> http://lists.apple.com/mailman/options/cocoa-dev/<email_removed>
>
> This email sent to <email_removed>
DATE : Wed Feb 06 23:19:01 2008
John,
Right now AppKit is using UCFindTextBreak under the cover and the
Unicode Utility API function is, in turn, implemented on top of ICU.
> - When double-clicking, any Kanji or hiragana is treated as its own
> "word." e.g. if I double click on "$BEl5~Bg3X1!@8(B" I get a single
> kanji selected, or if I double click on "$B$3$s$K$A$O(B" I get a
> single kana selected. Any consecutive katakana do count as one word,
> so "$B%^%&%9%/(B $B%j%C%/(B" or "$B%*%U%#%9%l!<%G%#!<!I(B are
> fully selected on double-click, even though the IME actually
> considers them to be two separate words.
This is the default word breaking behavior for non-Japanese locales.
With Japanese locale (you can specify in International Pref's Word
Break popup), UCFindTextBreak uses slightly more Japanese friendly
algorithm.
We have plans to enhance the AppKit user experience by integrating
CFStringTokenizer in the future.
Thanks for your inputs,
Aki
On 2008/02/06, at 9:25, John Stiles wrote:
> Well, in practical terms, it looks like AppKit doesn't try very hard
> for Japanese:
>
> - When double-clicking, any Kanji or hiragana is treated as its own
> "word." e.g. if I double click on "$BEl5~Bg3X1!@8(B" I get a single
> kanji selected, or if I double click on "$B$3$s$K$A$O(B" I get a
> single kana selected. Any consecutive katakana do count as one word,
> so "$B%^%&%9%/(B $B%j%C%/(B" or "$B%*%U%#%9%l!<%G%#!<!I(B are
> fully selected on double-click, even though the IME actually
> considers them to be two separate words.
>
> - Search results with "Full word" enabled seem to follow this same
> pattern, as nonsensical as it may seem. I can't find the full word
> "$BEl(B $B5~!I(B in "$BEl5~Bg3X1!@8(B", but I can find
> "$BEl!I(B just fine. I can't find "$B%^%&(B $B%9(B", but I can
> find "$B%^%&%9%/%j%C%/!I(B.
>
> So while your code may not work for all languages, AppKit does not
> seem to do a better job. If anything I'd call AppKit's behavior
> pretty broken.
>
> This is all using Leopard 10.5.1, by the way.
>
> I might look into UCFindTextBreak since it seems to exist all the
> way back to 10.0 and will probably work just fine for me. In
> actuality I don't think I need to support Asian languages that well
> anyway$B!=(Bin my case, this is for editing code/scripts, not
> freeform text. While Asian input is probably going to be supported,
> it should only be used in comments or string literals, and I don't
> expect that users will have high expectations for searching within
> it. (And it looks like the bar is set appropriately low for me
> anyway ;) )
>
>
>
> Mike Wright wrote:
>> On Feb. 5, 2008, at 22:30 , Deborah Goldsmith wrote:
>>
>>> This doesn't work for all languages. What constitutes a "word" is
>>> rather more complex than this. In Thai, for a particularly
>>> egregious example, you can't find word boundaries without looking
>>> up the words in a dictionary.
>>
>> Whole-word searching seems pretty unnecessary (and virtually
>> impossible) for Japanese and Chinese. Does it really make sense for
>> Thai? There are lots of skeptics regarding "word" even being a
>> valid concept in reference Chinese languages. San Duanmu (The
>> Phonology of Standard Chinese) makes a good case for the concept in
>> Mandarin, at least, but I can't see any way that it could be used
>> as a basis for whole-word text searches. And long strings of
>> hiragana in Japanese seem to require human intuition to find word
>> breaks. (And better intuition than mine.)
>>
>>> On Tiger, you can use either the double-click API in Cocoa, or
>>> UCFindTextBreak, to find word boundaries. On Leopard or later, use
>>> CFStringTokenizer. All of them will do the right thing for word
>>> boundaries in every language we support.
>>
>> Is there a list somewhere of the supported languages? (I assume you
>> mean supported by those APIs, and that writing systems like
>> Japanese and Chinese that don't include some set of word delimiters
>> are excluded. And Thai and other Brahmi-derived scripts?)
>>
>> Did you happen to see my response to Douglas Davidson the next day
>> (Jan 30, 2008 Message-ID: <<email_removed>
>> >)? Here's a restatement of it:
>>
>> From my perspective, the problem is that the "whole words" to be
>> searched for are not always words in any linguistic sense. Judging
>> by the TextEdit Find panel, the double-click API doesn't seem to be
>> capable of treating "a:" as a word, but it's just the kind of thing
>> that I might want to perform a whole-word search for, trying to
>> find something like " a: " or "\na:\n" in a mishmash of strings
>> like "5a:--w7".
>>
>> And, as John Stiles pointed out somewhere, the Text Edit Find panel
>> can't find something like "way home" doing a Full Word search in
>> text containing: I don't know the way home. Go away homewrecker.
>> Look at the way Homer ran.
>>
>> My method can find the desired target text in both of those cases--
>> at least in English, and presumably in other Roman-based scripts.
>> And, it's pretty easy to change the set of word delimiters.
>>
>> So, maybe "whole phrase" is a more accurate term than "whole word",
>> but it's the kind of behavior that I expect--and that I think my
>> customers expect. (My customers aren't real big on providing
>> feedback as long as they're happy, but I figure no news is good
>> news.)
>>
>> Will UCFindTextBreak do any better in this kind of case? Or
>> CFStringTokenizer? (Although, given my customer base, I don't
>> expect to use any Leopard-only APIs for a long time.)
>>
>> Mike Wright
>> http://www.idata3.com/
>> http://www.raccoonbend.com/
>>
>>
>>> Deborah Goldsmith
>>> Apple Inc.
>>> <email_removed>
>>>
>>> On Jan 29, 2008, at 12:28 PM, Mike Wright wrote:
>>>
>>>> On Jan 29, 2008, at 10:12:21 -0800, John Stiles <<email_removed>
>>>> > wrote:
>>>>
>>>>> I'm trying to find a substring in an NSString. But I want to
>>>>> find whole
>>>>> words (e.g. like in the Find panel when you choose "Full word"
>>>>> from the
>>>>> popup, rather than "Contains" or "Starts With").
>>>>>
>>>>> Unless I'm missing something, it looks like NSString's
>>>>> -rangeOfString:options:range:locale: doesn't have an option for
>>>>> finding
>>>>> whole words.
>>>>>
>>>>> How does the Find panel do it, then? Am I going to have to "roll
>>>>> my own"
>>>>> code for string searching? That sounds error-prone to me; I'd much
>>>>> rather have the OS do it.
>>>>
>>>> Here's a Tiger approach that's worked pretty well for me (or, at
>>>> least, no non-English-using customers have complained--so far).
>>>>
>>>> NSString *fieldContent; // the string I'm searching in
>>>> NSString *targetString; // the string to be found
>>>> NSRange hitRange; // the range of targetString found within
>>>> fieldContent
>>>> NSRange testRange; // in the beginning, this covers all of
>>>> fieldContent
>>>> BOOL caseSensitive; // specified by the user
>>>> BOOL isWholeWord = NO; // this is used in two sequential tests
>>>>
>>>> // set up the search mask
>>>> unsigned searchMask = NSLiteralSearch;
>>>> if (! caseSensitive)
>>>> searchMask |= NSCaseInsensitiveSearch;
>>>>
>>>> // set up the character set for words
>>>> NSCharacterSet *wordCharacterSet = [NSCharacterSet
>>>> alphanumericCharacterSet];
>>>>
>>>> // look for targetString in fieldContent
>>>> hitRange = [fieldContent rangeOfString:targetString options:
>>>> searchMask range:testRange];
>>>>
>>>> // if we found something, do the whole-word test
>>>> if (hitRange.location != NSNotFound)
>>>> {
>>>> // test the beginning of targetString
>>>> isWholeWord = ((hitRange.location == 0) || (!
>>>> [wordCharacterSet characterIsMember:[fieldContent
>>>> characterAtIndex:(hitRange.location - 1)]]));
>>>>
>>>> // if the beginning is okay, test the end of targetString
>>>> if (isWholeWord)
>>>> {
>>>> unsigned nextCharPosition = hitRange.location +
>>>> hitRange.length;
>>>> isWholeWord = ((nextCharPosition == [fieldContent length])
>>>> || (! [wordCharacterSet characterIsMember:[fieldContent
>>>> characterAtIndex:nextCharPosition]]));
>>>> }
>>>> }
>>>>
>>>> Finally:
>>>>
>>>> if (isWholeWord)
>>>> {
>>>> // show it to the user
>>>> }
>>>>
>>>> Hope this helps. (And, since it's not just copied from my own
>>>> code, I hope it doesn't contain any serious errors.)
>>>>
>>>> Regards,
>>>> Mike Wright
>>>> http://www.idata3.com/
>>>> http://www.raccoonbend.com/
>>>> _______________________________________________
>>>>
>>>> Cocoa-dev mailing list (<email_removed>)
>>>>
>>>> Please do not post admin requests or moderator comments to the
>>>> list.
>>>> Contact the moderators at cocoa-dev-admins(at)lists.apple.com
>>>>
>>>> Help/Unsubscribe/Update your Subscription:
>>>> http://lists.apple.com/mailman/options/cocoa-dev/<email_removed>
>>>>
>>>> This email sent to <email_removed>
>>>
>>>
>>
> _______________________________________________
>
> Cocoa-dev mailing list (<email_removed>)
>
> Please do not post admin requests or moderator comments to the list.
> Contact the moderators at cocoa-dev-admins(at)lists.apple.com
>
> Help/Unsubscribe/Update your Subscription:
> http://lists.apple.com/mailman/options/cocoa-dev/<email_removed>
>
> This email sent to <email_removed>






Cocoa mail archive

