[BlueOnyx:15085] Re: YUM updates for 5106R/5107R/5108R

Fri Apr 4 15:22:22 -05 2014

Hi Michael,

I understood the situation after your update.
There is no translation function from GUI to CODB, so after update there are
two encoding if the user add the users.
If the charset is EUC-JP, the object is stored as EUC-JP.
If the charset is UTF-8, the object is stored as UTF-8.

Before : GUI(EUC-JP) -> CODB(EUC-JP)
After  : GUI(UTF-8)  -> CODB(UTF-8)

So we need to handle objects with EUC-JP and UTF-8 and translate to UTF-8 for
display.

> 2.) base-user.mod:
> 
> The pages where we might run into these problems are just a few. Namely:
> The user-list, the page where users are edited, personal profile and
> personal email.
> 
> The affected input fields are:
> 
> - Username
> - Comments
> - Vacation message text
> 
> I modified the GUI pages for these to pass the above CODB data through
> I18n::Utf8Encode() for cleaning. If the text *was* EUC-JP, it will be
> shown as correct UTF-8. Upon saving it will be stored in CODB as UTF-8.
> On subsequent usage of the same pages no further EUC-JP to UTF-8
> transformation will be required, as the text is shown (and saved)
> correctly by then.
> 
> SVN: http://devel.blueonyx.it/trac/changeset/1400/BlueOnyx/ui/base-user.mod
> 
> There *might* be other fields in the GUI where this may also be needed,
> but right now I can't think of any.

This is one of the way to resolve, but we need to add translation code for
all modules, because we can enter Japanese for like Description on other GUI.

The get() function at /usr/sausalito/ui/libPhp/CceClient.php is called by php
to get CODB data.
So, to add the translation code into get() function will be resolve this
issue, I think.
The result of ccephp_get($this->handle, $oid, $namespace); is multi dimension
array, so we need to translate all of value to UTF-8.

I didn’t write the codes to translate all values.
I believe this will be the way to resolve, but we need to check with other language
that this doesn’t effect.

How do you think, Michael?

Thanks,
Hisao

On Apr 5, 2014, at 3:57 AM, Michael Stauber <mstauber at blueonyx.it> wrote:

> Hi Hisao,
> 
>> I checked on my BlueQuartz 5200R, the result as same as you tested.
>> And, this file is written in UTF-8.
>> The result of vavationMsg looks like corrupted, cceclient doesn’t support multibyte
>> to display, because of why.
> 
> Yeah, it is doing some encoding and that also depends on the charset
> that the submitted text was initially in.
> 
> I did some test, too. I installed an old 5108R from two years ago and
> did not YUM update it. I changed the language to Japanese and created a
> site with a dozen users that had Japanese names, Japanese comments and
> Japanese vacation messages.
> 
> Then I fully YUM updated it and the display was indeed as broken as in
> the screenshots that Eiji posted in [BlueOnyx:15081].
> 
> I then examined the CODB object of one of the Users. In the GUI I had
> entered his name as this:
> 
> ベルタ
> 
> In CODB it looked like this:
> 
> 102 DATA fullName = "\245\331\245\353\245\277"
> 
> With the GUI now being in UTF-8 (even for Japanese) I saved this user
> again. After changing his"fullname" back to "ベルタ". It got stored in
> CODB as this:
> 
> 102 DATA fullName = "\343\203\231\343\203\253\343\202\277"
> 
> So we can assume this:
> 
> CODB data when submitted as EUC-JP:
> 102 DATA fullName = "\245\331\245\353\245\277"
> 
> Same CODB data when submitted as UTF-8:
> 102 DATA fullName = "\343\203\231\343\203\253\343\202\277"
> 
> To me it isn't entirely clear where, why or how CODB does the
> transformation. I don't understand the C code well enough. But when I
> look at the Perl client module CCE.pm
> (http://devel.blueonyx.it/trac/browser/BlueOnyx/utils/cce/client/perl/CCE.pm)
> it appears that the sub _escape does the encoding and the sub unescape
> does the decoding. We can assume that the Perl module does the same
> procedure as the PHP library of the same purpose.
> 
> If that's the case, then the encoded values appear to be stored in octal
> format.
> 
> As you can see, in UTF-8 the same Japanese text is also longer. It is
> almost twice as long, but not quite: 6 Groups for EUC-JP, 9 for UTF-8.
> 
> It can probably be explained with multibyte encoding. For some
> characters it needs just two bytes and for others it might need three or
> more.
> 
> I just did some math and it looks like this:
> 
> Char:	タ				(Kanji)
> =	\343\202\277 			(octal)
> =	75A0BF				(hex)
> =	11101011010000010111111		(binary)
> 
> And that explains it. The Kanji character "タ" equals U+30BF in the
> UTF-8 table:
> 
> http://www.eva.hi-ho.ne.jp/cgi-bin/user/zxcv/decodeUTF8.cgi?req=url&url=%E3%83%9F%E3%83%A4%E3%83%95%E3%82%B8%E3%83%AA%E3%83%A8%E3%82%A6%E3%82%BF
> 
> In the EUC-JP table "タ" = A5 BF (hex). Which doesn't match any of the
> octal numbers in "\245\331\245\353\245\277".
> 
> See: http://fcd3.org/nihongo/euc-jp/index.html
> 
> In the Shift-JIS table "タ" = 83 5E (hex). In decimal it is "12479"
> 
> See:
> http://www.kreativekorp.com/charset/encoding.php?file=shift-jis.kte&char=835E
> 
> Can it be that EUC-JP encoded data is stored in CODB in Shift-JIS? To me
> that is a bit inconclusive.
> 
> Anyway: To fix this issue for the moment (pending a more thorough
> solution) I did this two part update, which is now available via YUM:
> 
> 1.) sausalito-18n-*:
> 
> The Class I18n.php got modified again. The function I18n::Utf8Encode()
> now checks if the input string is in EUC-JP. If so, it is converted to
> UTF-8. After that check the result is passed through I18n::detectUTF8(),
> which might (or might not) run the string through BXEncoding::toUTF8(),
> which fixes damaged UTF-8 text.
> 
> SVN: http://devel.blueonyx.it/trac/changeset/1397/BlueOnyx/5107R/i18n
> 
> 2.) base-user.mod:
> 
> The pages where we might run into these problems are just a few. Namely:
> The user-list, the page where users are edited, personal profile and
> personal email.
> 
> The affected input fields are:
> 
> - Username
> - Comments
> - Vacation message text
> 
> I modified the GUI pages for these to pass the above CODB data through
> I18n::Utf8Encode() for cleaning. If the text *was* EUC-JP, it will be
> shown as correct UTF-8. Upon saving it will be stored in CODB as UTF-8.
> On subsequent usage of the same pages no further EUC-JP to UTF-8
> transformation will be required, as the text is shown (and saved)
> correctly by then.
> 
> SVN: http://devel.blueonyx.it/trac/changeset/1400/BlueOnyx/ui/base-user.mod
> 
> There *might* be other fields in the GUI where this may also be needed,
> but right now I can't think of any.
> 
> So I think that might do it for now.
> 
> -- 
> With best regards
> 
> Michael Stauber
> _______________________________________________
> Blueonyx mailing list
> Blueonyx at mail.blueonyx.it
> http://mail.blueonyx.it/mailman/listinfo/blueonyx