Re: [PATCH] console UTF-8 fixes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Egmont Koblinger wrote:

- If a certain (otherwise valid UTF-8) character is not found in the glyph
  table, the current code does one of these two (depending on other
  circumstances):

  - Either it displays the replacement character U+FFFD, falling back to a
    simple question mark. Note that the Unicode replacement character U+FFFD
    is to be used for invalid sequences. However, it shouldn't necessarily
    be used when replacing a valid but undisplayable character. Think of
    Pango for example that renders these as four hex digits inside a square.
    To be able to visually distinguish between illegal sequences and legal
    but undisplayable characters, I think U+FFFD or the question mark are
    bad choices. In fact, any symbol that may normally occur in the text is
    a bad choice if is displayed simply. Hence I chose to display an
    inverted dot.


I strongly disagree. First of all, you're changing the semantics of a 13-year-old API. The semantics of the Linux console is that by specifying U+FFFD SUBSTITUTION GLYPH in your unicode table, you have specified the fallback glyph.

What's worse, you've hard-coded the uses of specific visual representations. That is completely unacceptable.

  - Another possible thing the current code may do (for latin1-compatible
    characters) is to simply display the glyph loaded in that position.
    Suppose I have loaded a latin2 font. In latin2, 0xFB is an "u with
    double accent". An applications prints U+00FB, which is an "u with
    circumflex". Since this glyph is not present in latin2, it cannot be
    printed with the current font. Still, the current code falls back to
    printing the glyph from the 0xFB position of the glyph table. Hence my
    app asked to print "u with circumflex" but an "u with double accent"
    appears on the screen. This is totally contrary to the goals of Unicode
    and shouldn't ever happen.

When does that happen?  That is clearly a bug.

- The replacement character for invalid UTF-8 sequences is U+FFFD, falling
  back to a question mark. I've changed the fallback version to an inverted
  question mark. This way it's more similar to the common glyph of U+FFFD,
  and it's more trivial to the user that it's not a literal question mark
  but rather some erroneous situation.

Brilliant. You've picked a fallback glyph which is unlikely to exist in all fonts. The whole point of falling back to ? is that it's an ASCII character, which means that if the font designer failed to designate a fallback glyph -- which is an error!!! -- there is at least some hope of conveying the error back to the user.

- Overlong sequences are not caught currently, they're displayed as if these
  were valid representations. This may even have security impacts.

- Lone continuation bytes (section 3.1 of the UTF-8 stress test) are
  currently displayed as some "random" glyphs rather than the replacement
  character.

- Incomplete sequences (sections 3.2 and 3.3) emit no replacement character,
  but rather cause the subsequent valid character to be displayed more
  times(!).

These are valid issues.

- There's no concept of double-width characters. It's way beyond the scope
  of my patch to try to display them, but at least I think it's important
  for the cursor to jump two positions when printing such characters, since
  this is what applications (such as text editors) expect. If the cursor
  didn't jump two positions, applications would suffer from displaying and
  refreshing problems, and editing some English letters that are preceded by
  some CJK characters in the same line became a nightmare. With my patch an
  inverted dot followed by an inverted space is displayed for double-width
  characters so it's quite easy to see that they are tied together.

To be able to do CJK you need something like Kon anyway. This feels like bloat.

- There's no concept of zero-width characters (such as combining accents)
  either. Yet again it's beyond the scope of my patch to properly handle
  them. Instead of the current behavior (write a replacement character) I
  just ignore them so that full-screen applications can keep track of the
  cursor position correctly.

There is a concept of combining sequences. Anything else, I suspect it's better to let the user know that something bad is happening.

- I believe (at least I do hope) that my code is cleaner, more
  straightforward, easier to understand, and is slightly better documented
  than the current version. The current code doesn't separate UTF-8 decoding
  and glyph displaying parts. I clearly separated them. First I perform
  UTF-8 decoding (this emits U+FFFD for invalid sequences), then check for
  the width of the resulting character, change it to U+FFFD if it's
  unprintable (e.g. an UTF-16 surrogate), and finally comes the part that
  does its best in displaying the character on the screen.

I hope you like it. :)

Please see above comments.

	-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux