[BlueOnyx:15083] Re: YUM updates for 5106R/5107R/5108R

Michael Stauber mstauber at blueonyx.it
Fri Apr 4 13:57:53 -05 2014


Hi Hisao,

> I checked on my BlueQuartz 5200R, the result as same as you tested.
> And, this file is written in UTF-8.
> The result of vavationMsg looks like corrupted, cceclient doesn’t support multibyte
> to display, because of why.

Yeah, it is doing some encoding and that also depends on the charset
that the submitted text was initially in.

I did some test, too. I installed an old 5108R from two years ago and
did not YUM update it. I changed the language to Japanese and created a
site with a dozen users that had Japanese names, Japanese comments and
Japanese vacation messages.

Then I fully YUM updated it and the display was indeed as broken as in
the screenshots that Eiji posted in [BlueOnyx:15081].

I then examined the CODB object of one of the Users. In the GUI I had
entered his name as this:

ベルタ

In CODB it looked like this:

102 DATA fullName = "\245\331\245\353\245\277"

With the GUI now being in UTF-8 (even for Japanese) I saved this user
again. After changing his"fullname" back to "ベルタ". It got stored in
CODB as this:

102 DATA fullName = "\343\203\231\343\203\253\343\202\277"

So we can assume this:

CODB data when submitted as EUC-JP:
102 DATA fullName = "\245\331\245\353\245\277"

Same CODB data when submitted as UTF-8:
102 DATA fullName = "\343\203\231\343\203\253\343\202\277"

To me it isn't entirely clear where, why or how CODB does the
transformation. I don't understand the C code well enough. But when I
look at the Perl client module CCE.pm
(http://devel.blueonyx.it/trac/browser/BlueOnyx/utils/cce/client/perl/CCE.pm)
it appears that the sub _escape does the encoding and the sub unescape
does the decoding. We can assume that the Perl module does the same
procedure as the PHP library of the same purpose.

If that's the case, then the encoded values appear to be stored in octal
format.

As you can see, in UTF-8 the same Japanese text is also longer. It is
almost twice as long, but not quite: 6 Groups for EUC-JP, 9 for UTF-8.

It can probably be explained with multibyte encoding. For some
characters it needs just two bytes and for others it might need three or
more.

I just did some math and it looks like this:

Char:	タ				(Kanji)
=	\343\202\277 			(octal)
=	75A0BF				(hex)
=	11101011010000010111111		(binary)

And that explains it. The Kanji character "タ" equals U+30BF in the
UTF-8 table:

http://www.eva.hi-ho.ne.jp/cgi-bin/user/zxcv/decodeUTF8.cgi?req=url&url=%E3%83%9F%E3%83%A4%E3%83%95%E3%82%B8%E3%83%AA%E3%83%A8%E3%82%A6%E3%82%BF

In the EUC-JP table "タ" = A5 BF (hex). Which doesn't match any of the
octal numbers in "\245\331\245\353\245\277".

See: http://fcd3.org/nihongo/euc-jp/index.html

In the Shift-JIS table "タ" = 83 5E (hex). In decimal it is "12479"

See:
http://www.kreativekorp.com/charset/encoding.php?file=shift-jis.kte&char=835E

Can it be that EUC-JP encoded data is stored in CODB in Shift-JIS? To me
that is a bit inconclusive.

Anyway: To fix this issue for the moment (pending a more thorough
solution) I did this two part update, which is now available via YUM:

1.) sausalito-18n-*:

The Class I18n.php got modified again. The function I18n::Utf8Encode()
now checks if the input string is in EUC-JP. If so, it is converted to
UTF-8. After that check the result is passed through I18n::detectUTF8(),
which might (or might not) run the string through BXEncoding::toUTF8(),
which fixes damaged UTF-8 text.

SVN: http://devel.blueonyx.it/trac/changeset/1397/BlueOnyx/5107R/i18n

2.) base-user.mod:

The pages where we might run into these problems are just a few. Namely:
The user-list, the page where users are edited, personal profile and
personal email.

The affected input fields are:

- Username
- Comments
- Vacation message text

I modified the GUI pages for these to pass the above CODB data through
I18n::Utf8Encode() for cleaning. If the text *was* EUC-JP, it will be
shown as correct UTF-8. Upon saving it will be stored in CODB as UTF-8.
On subsequent usage of the same pages no further EUC-JP to UTF-8
transformation will be required, as the text is shown (and saved)
correctly by then.

SVN: http://devel.blueonyx.it/trac/changeset/1400/BlueOnyx/ui/base-user.mod

There *might* be other fields in the GUI where this may also be needed,
but right now I can't think of any.

So I think that might do it for now.

-- 
With best regards

Michael Stauber



More information about the Blueonyx mailing list