[BlueOnyx:25088] Re: 5210R i18n Bug??

Michael Stauber mstauber at blueonyx.it
Thu Sep 9 01:38:56 -05 2021


Hi Sasaki,

> I'll keep you posted on what I find out.

I think I found the issue. And much quicker than I thought.

The GUI page UserMod.php uses the function getFullName() for the
"fullName" field and the function getTextBlock() for the "description"
field.

To get a string that contains the "fullName" we use this code:

$factory->getFullName("fullNameField", $User["fullName"]);

Where $User["fullName"] is the data from CODB and it already *should* be
back in the format in which it was originally stored. So in our case: Kanji.

For the "description" we use this to get the description into a string:

$factory->getTextBlock("userDescField", Utf8Encode($User["description"]));

The actual "description" from CODB is in $User["description"] and for
additional safety we pass it through a function called Utf8Encode().
Which we don't do for the "fullName".

But even passing "fullName" through Utf8Encode() as well didn't solve
the issue.

So I looked at the difference between the getFullName() and the
getTextBlock() functions and quickly identified the issue:

getFullName() runs the text of fields with the read/write attribute
(editable GUI fields) through an additional htmlspecialchars() filter.

And getTextBlock() doesn't.

If I remove the additional htmlspecialchars() filter from getFullName(),
then the text showed.

In fact: If I run htmlspecialchars() on a string that contains Kanji,
then the entire string is wiped clean.

This is where it gets a bit complicated and I myself sometimes have
issues remembering that: The translation texts are stored in UTF-8. In
CODB text is stored in UTF-8 *or* (if it has Umlauts, Accents or Kanji)
it's stored in Octal.

On read access anything but Japanese is directly restored into UTF-8,
whereas Japanese is restored to EUC-JP encoding first and then to UTF-8.

According to https://www.php.net/manual/en/function.htmlspecialchars.php
the function htmlspecialchars() supports a wide range of character sets
(including EUC-JP), so I did some more tests.

In FormFieldBuilder.php in the function makeTextField() - which is used
by getFullName() - this is how and where we use htmspecialchars() in
fields of this type that have the read/write flag:

      case "rw":
        // HTML safe
        $value = htmlspecialchars($value);
      break;

If I change the code to this ...

      case "rw":
        // HTML safe
        $value = htmlspecialchars($value, ENT_COMPAT, 'UTF-8');
      break;

... the Japanese text will show correctly in the "fullName" field. \o/

So that's the issue that causes *this* particular problem:
htmlspecialchars() needs to be told it's working with UTF-8 when it's
handling strings containing Kanji. Otherwise it just goes off the rails
and discards the data.

I realize there are other places in the code where we use
htmlspecialchars() without specifying the UTF-8 as the intended
character set. So other form fields in the GUI could have the same issue
with Japanese text.

For that reason I'll do an extensive code review to find all relevant
parts of the code where htmlspecialchars() is used and will adapt them
accordingly.

I'll publish YUM updates that fix this issue within the next 1-2 days
and will let you know once they are available.

Again: Many thanks for reporting this issue!

-- 
With best regards

Michael Stauber



More information about the Blueonyx mailing list