NSTextView, Mail, and umlauts / combining marks / diacriticals

  • Hi All!

    we have the following issue, Unicode can use 2 representation for the
    german umlaut ä
    on the one hand
    U+00e4
    and on the other hand
    U+0061 U+0308

    The cocoa text system usally displays them identical, except that the
    latter one counts as 2 characters. However in mail it is displayed as
    two characters too: ä

    As we do our texteditor we want to display the separate unicode
    codepoints as such and see an a and an " in that case. The display in
    Mail suggests there is an switch for getting that behaviour. Does
    anybody know what i have to turn on or subclass to make this work?

    Best,
      dom

    p.s.: on a sidenote - if you want some fun and place many U+0308s
    after an vow you get many many "s above it - and if you want more fun
    you can turn on the spell checker and boom goes the spellchecker...

    --
    Dominik Wagner          Mail: <dom...>
    TheCodingMonkeys        http://www.codingmonkeys.de/
    Blog - DasGenie: !Scrap http://scrap.dasgenie.com/
  • Hi Dominik,

    I guess i didn't clearly understood want you'd like to do.
    Display ä as a and then ¨ be it a composite character or not,
    Or whatever the "text editor" receives change it into a composite
    character?

    In both cases, I think you'd need to rely on cases or an array of
    decompositions.

    For the latter case, I'd give a look at INput Methods, (NSTextInput
    or something like that) to translate what the user types into
    something else.

    Regards.

    On Oct 18, 2007, at 5:16 AM, Dominik Wagner tcm wrote:

    > Hi All!
    >
    > we have the following issue, Unicode can use 2 representation for
    > the german umlaut ä
    > on the one hand
    > U+00e4
    > and on the other hand
    > U+0061 U+0308
    >
    > The cocoa text system usally displays them identical, except that
    > the latter one counts as 2 characters. However in mail it is
    > displayed as two characters too: ä
    >
    > As we do our texteditor we want to display the separate unicode
    > codepoints as such and see an a and an " in that case. The display
    > in Mail suggests there is an switch for getting that behaviour.
    > Does anybody know what i have to turn on or subclass to make this
    > work?
    >
    > Best,
    > dom
    >
    > p.s.: on a sidenote - if you want some fun and place many U+0308s
    > after an vow you get many many "s above it - and if you want more
    > fun you can turn on the spell checker and boom goes the
    > spellchecker...
    >
    > --
    > Dominik Wagner          Mail: <dom...>
    > TheCodingMonkeys        http://www.codingmonkeys.de/
    > Blog - DasGenie: !Scrap http://scrap.dasgenie.com/
  • On 18.10.2007, at 13:59, Half Activist wrote:

    > Hi Dominik,
    >
    > I guess i didn't clearly understood want you'd like to do.
    > Display ä as a and then ¨ be it a composite character or not,
    > Or whatever the "text editor" receives change it into a composite
    > character?

    What our customers want is to clearly be able to see and edit every
    codepoint of unicode. so they see a a" if  it has two codepoints and
    a ä if it has one. On top of that we'd like to be able to edit the
    codepoints, so delete the " separatly of the a as well as add or
    remove them.

    This might also be an optional thing to turn on in the view menu.

    > In both cases, I think you'd need to rely on cases or an array of
    > decompositions.
    >
    > For the latter case, I'd give a look at INput Methods,
    > (NSTextInput or something like that) to translate what the user
    > types into something else.

    I'm not so much interested in the input side as in the representation
    side.

    best,
      dom

    >
    > Regards.
    >
    > On Oct 18, 2007, at 5:16 AM, Dominik Wagner tcm wrote:
    >
    >> Hi All!
    >>
    >> we have the following issue, Unicode can use 2 representation for
    >> the german umlaut ä
    >> on the one hand
    >> U+00e4
    >> and on the other hand
    >> U+0061 U+0308
    >>
    >> The cocoa text system usally displays them identical, except that
    >> the latter one counts as 2 characters. However in mail it is
    >> displayed as two characters too: ä
    >>
    >> As we do our texteditor we want to display the separate unicode
    >> codepoints as such and see an a and an " in that case. The display
    >> in Mail suggests there is an switch for getting that behaviour.
    >> Does anybody know what i have to turn on or subclass to make this
    >> work?
    >>
    >> Best,
    >> dom
    >>
    >> p.s.: on a sidenote - if you want some fun and place many U+0308s
    >> after an vow you get many many "s above it - and if you want more
    >> fun you can turn on the spell checker and boom goes the
    >> spellchecker...
    >>
    >>
    --
    Dominik Wagner          Mail: <dom...>
    TheCodingMonkeys        http://www.codingmonkeys.de/
    Blog - DasGenie: !Scrap http://scrap.dasgenie.com/
  • > What our customers want is to clearly be able to see and edit every
    > codepoint of unicode. so they see a a" if  it has two codepoints
    > and a ä if it has one.

    Would it help to insert e.g. a "Zero Width No-Break Space" (Unicode
    0001) between every codepoint?

    Regards,
    Mani
  • On Thursday, October 18, 2007, at 07:17AM, "Manfred Schwind" <lists...> wrote:
    >> What our customers want is to clearly be able to see and edit every
    >> codepoint of unicode. so they see a a" if  it has two codepoints
    >> and a ä if it has one.
    >
    > Would it help to insert e.g. a "Zero Width No-Break Space" (Unicode
    > 0001) between every codepoint?
    >
    > Regards,
    > Mani

    I would not recommend doing that.  ZWNBS happens to also be used for the UTF-16 BOM.  There is another character to used instead should you need to go this route (see unicode.org "BOM" FAQ)

    Having said that, a proper solution I think would be to step down to lower-level text APIs and not use the higher-level ones.

    --
    Rick Sharp
    Instant Interactive(tm)
  • On 18.10.2007, at 15:21, Ricky Sharp wrote:

    >
    > On Thursday, October 18, 2007, at 07:17AM, "Manfred Schwind"
    > <lists...> wrote:
    >>> What our customers want is to clearly be able to see and edit every
    >>> codepoint of unicode. so they see a a" if  it has two codepoints
    >>> and a ä if it has one.
    >>
    >> Would it help to insert e.g. a "Zero Width No-Break Space" (Unicode
    >> 0001) between every codepoint?
    >>
    >> Regards,
    >> Mani
    >
    > I would not recommend doing that.  ZWNBS happens to also be used
    > for the UTF-16 BOM.  There is another character to used instead
    > should you need to go this route (see unicode.org "BOM" FAQ)
    >
    > Having said that, a proper solution I think would be to step down
    > to lower-level text APIs and not use the higher-level ones.

    Since i don't want to do that and still have seen that mail actually
    does seperate the a and the " in its plain text mode: Doug or Aki,
    can you tell me how this is achieved?

    Best,
      dom

    --
    Dominik Wagner          Mail: <dom...>
    TheCodingMonkeys        http://www.codingmonkeys.de/
    Blog - DasGenie: !Scrap http://scrap.dasgenie.com/
  • Maybe you should try asking it on WebKit dev since you say Mail does
    what you want to achieve and Mail.app uses a WebView for composing &
    displaying contents

    Regards

    On Oct 18, 2007, at 3:31 PM, Dominik Wagner tcm wrote:

    >
    > On 18.10.2007, at 15:21, Ricky Sharp wrote:
    >
    >>
    >> On Thursday, October 18, 2007, at 07:17AM, "Manfred Schwind"
    >> <lists...> wrote:
    >>>> What our customers want is to clearly be able to see and edit every
    >>>> codepoint of unicode. so they see a a" if  it has two codepoints
    >>>> and a ä if it has one.
    >>>
    >>> Would it help to insert e.g. a "Zero Width No-Break Space" (Unicode
    >>> 0001) between every codepoint?
    >>>
    >>> Regards,
    >>> Mani
    >>
    >> I would not recommend doing that.  ZWNBS happens to also be used
    >> for the UTF-16 BOM.  There is another character to used instead
    >> should you need to go this route (see unicode.org "BOM" FAQ)
    >>
    >> Having said that, a proper solution I think would be to step down
    >> to lower-level text APIs and not use the higher-level ones.
    >
    > Since i don't want to do that and still have seen that mail
    > actually does seperate the a and the " in its plain text mode: Doug
    > or Aki, can you tell me how this is achieved?
    >
    > Best,
    > dom
    >
    >
    > --
    > Dominik Wagner          Mail: <dom...>
    > TheCodingMonkeys        http://www.codingmonkeys.de/
    > Blog - DasGenie: !Scrap http://scrap.dasgenie.com/
  • Dominik,

    The Mail (or WebView) behavior is not a feature, but rather a
    limitation of the rendering engine 8-).
    Since XML contents (including HTML) are recommended to be maximally
    precomposed, WebKit is favoring not to bother accent compositions in
    certain cases to gain performance.

    There are several ways to achieve what you're looking for here.

    If what your users are requesting really is an ability to edit accent
    marks, you could use -deleteBackwardByDecomposingPreviousCharacter:
    method (it's bound to Control-Delete key by default).
    The variant of -deleteBackward: method treats accent characters
    individually.

    If you really need to display the base character and accents
    separately, you could subclass NSTextStorage and dynamically switch
    the content of string returned.  For example, you could insert a space
    between the base character and accent characters when displaying them
    separately.  Conventionally Unicode suggests the space character (U
    +0020) as the spacer.

    A little more involved approach is to unmarking combining marks so
    that the rendering engine treats them as normal.
    NSGlyphGenerator marks combining marks as NSGlyphInscribeOverstrike
    indicating the need for rendering time composition.
    You could have a custom NSGlyphGenerator subclass that filters the
    flag out.  One gotcha here is that there are some fonts that use
    negative glyph advancement by default for combining marks.  You might
    have to tweak the glyph position for these cases.

    Aki

    On 2007/10/18, at 6:31, Dominik Wagner tcm wrote:

    >
    > On 18.10.2007, at 15:21, Ricky Sharp wrote:
    >
    >>
    >> On Thursday, October 18, 2007, at 07:17AM, "Manfred Schwind" <lists...>
    >>> wrote:
    >>>> What our customers want is to clearly be able to see and edit every
    >>>> codepoint of unicode. so they see a a" if  it has two codepoints
    >>>> and a ä if it has one.
    >>>
    >>> Would it help to insert e.g. a "Zero Width No-Break Space" (Unicode
    >>> 0001) between every codepoint?
    >>>
    >>> Regards,
    >>> Mani
    >>
    >> I would not recommend doing that.  ZWNBS happens to also be used
    >> for the UTF-16 BOM.  There is another character to used instead
    >> should you need to go this route (see unicode.org "BOM" FAQ)
    >>
    >> Having said that, a proper solution I think would be to step down
    >> to lower-level text APIs and not use the higher-level ones.
    >
    > Since i don't want to do that and still have seen that mail actually
    > does seperate the a and the " in its plain text mode: Doug or Aki,
    > can you tell me how this is achieved?
    >
    > Best,
    > dom
    >
    >
    > --
    > Dominik Wagner          Mail: <dom...>
    > TheCodingMonkeys        http://www.codingmonkeys.de/
    > Blog - DasGenie: !Scrap http://scrap.dasgenie.com/
previous month october 2007 next month
MTWTFSS
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31        
Go to today