Summary | Check for ISO-8859-1/Windows-1252 improper charset labeling |
Queue | IMP |
Queue Version | Git master |
Type | Enhancement |
State | Resolved |
Priority | 1. Low |
Owners | Horde Developers (at) , jan (at) horde (dot) org, slusarz (at) horde (dot) org |
Requester | slusarz (at) horde (dot) org |
Created | 08/25/2010 (5430 days ago) |
Due | |
Updated | 10/21/2010 (5373 days ago) |
Assigned | |
Resolved | 10/21/2010 (5373 days ago) |
Milestone | 5 |
Patch | No |
State ⇒ Resolved
Ticket #9201: part might not existhttp://git.horde.org/diff.php/imp/lib/Contents.php?rt=horde-git&r1=74cc881c526c261d9acfc7ccfbaf3a4e7009141e&r2=fad70e02a52d0c9d80172c6c61b28f8765856e48
Ticket #9201: Treat ISO-8859-1 as windows-1252of mailers do a similar thing. I think that scanning for the unused
8859-1 codepoints is too much overhead.
Ticket #9201: Treat ISO-8859-1 as windows-1252http://git.horde.org/diff.php/imp/lib/Contents.php?rt=horde-git&r1=a2e63c9945413bc8d0487ded4a6f505ad6d20386&r2=74cc881c526c261d9acfc7ccfbaf3a4e7009141e
tried upgrading PHP?
I tried so many things that I'm not sure what did it for me, but my
guess is the update of php from 5.3.2 to 5.3.3
Can this ticket be related to php bug #50661 ?
I now run php-5.3.3-1.el5.remi on CentOS release 5.4
encoding/loading. It seems to be working perfect on my system - but
that could be because I am using en_US.UTF-8. It might not be
working properly on, e.g., de or fr locales.
I would suggest playing around with charsets in Horde_Domhtml
(located in the horde/Util package).
Horde_Domhtml is called once. The initial loadHTML() call fails as
the encoding is not auto-determined. It then moves into the forced
loadHTML() call after converting to UTF-8. The charset passed into
the constructor is UTF-8.
Pseudocode:
public function __construct($text, 'UTF-8)
{
$doc = new DOMDocument();
$doc->loadHTML($text);
// $doc->encoding is empty
$this->encoding = $doc->encoding;
if (!is_null($charset)) {
if (!$doc->encoding) {
$doc->loadHTML('<?xml encoding="UTF-8">' .
Horde_String::convertCharset($text, $charset, 'UTF-8'));
$this->encoding = 'UTF-8';
}
}
}
encoding/loading. It seems to be working perfect on my system - but
that could be because I am using en_US.UTF-8. It might not be working
properly on, e.g., de or fr locales.
I would suggest playing around with charsets in Horde_Domhtml (located
in the horde/Util package).
1) the plain text is double encoded, i.e. the euro sign is turned
into =C3=A2=C2=82=C2=AC while it's still correct in the html part
(=E2=82=AC)
2) even though it's correct in the mail part, it's not displayed
correctly, i.e. as "â¬"
dimp/HTML compose:
----
Subject: Test
Content-Type: multipart/alternative; boundary="=_GSd1CTVMdqosdDl3SYlomXUo"
MIME-Version: 1.0
This message is in MIME format.
--=_GSd1CTVMdqosdDl3SYlomXUo
Content-Type: text/plain; charset=UTF-8; format=flowed; DelSp=Yes
Content-Description: Plaintext Version
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
Euro Character: =E2=82=AC =E2=82=AC
--=_GSd1CTVMdqosdDl3SYlomXUo
Content-Type: text/html; charset=UTF-8
Content-Description: HTML Version
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns=3D"http://www.w3.org/1999/xhtml">
<head>
<!--a75c305b1c0a6022--><title></title>
</head>
<body style=3D"font-family:Arial;font-size:14px">
<p>Euro Character: <span style=3D"font-size: 14px;">=E2=82=AC
=E2=82=AC</span></p>
</body>
</html>
--=_GSd1CTVMdqosdDl3SYlomXUo--
1) the plain text is double encoded, i.e. the euro sign is turned into
=C3=A2=C2=82=C2=AC while it's still correct in the html part (=E2=82=AC)
2) even though it's correct in the mail part, it's not displayed
correctly, i.e. as "â¬"
following string : "azerty "
the HTML part looks ok and render just fine in Outlook and
Thunderbird, but not IMP. I get this with IMP/FF : "azerty â¬"
where the text is being mangled on your system.
following string : "azerty "
the HTML part looks ok and render just fine in Outlook and
Thunderbird, but not IMP. I get this with IMP/FF : "azerty â¬"
that could just be fortune that we are not doing any ISO-8859-1
conversions in the codepaths I have tested.
Using IMP :
compose a new text message and set charset to UTF-8. Send the
following string : "azerty "
the received message looks fine in IMP, Thunderbird and Outlook.
compose a new HTML message and set charset to UTF-8. Send the
following string : "azerty "
the HTML part looks ok and render just fine in Outlook and
Thunderbird, but not IMP. I get this with IMP/FF : "azerty â¬"
Looking at the source of the message, the encoded string doesn't look
the same in the text part :
--=_MgggdEx2InJNFCIGJXUlY0k0
Content-Type: text/plain; charset=UTF-8; format=flowed; DelSp=Yes
Content-Description: Plaintext Version
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
azerty =C3=A2=C2=82=C2=AC
--=_MgggdEx2InJNFCIGJXUlY0k0
Content-Type: text/html; charset=UTF-8
Content-Description: HTML Version
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns=3D"http://www.w3.org/1999/xhtml">
<head>
<!--a75c305b1c0a6022--><title></title>
</head>
<body style=3D"font-family:Arial;font-size:14px">
<p>azerty =E2=82=AC<br /></p>
</body>
</html>
--=_MgggdEx2InJNFCIGJXUlY0k0--
State ⇒ Feedback
that could just be fortune that we are not doing any ISO-8859-1
conversions in the codepaths I have tested.
Ticket #9201: Better to convert things to UTF-8, to prevent lossy conversion.http://git.horde.org/diff.php/framework/Support/lib/Horde/Support/Domhtml.php?rt=horde-git&r1=6149c84e973f3fb5c61760834c148bae4cbf04b8&r2=699d059d4fa0faeed9273862ce3e19474bb8fd2d
found, change charset representation to windows-1252
Assigned to Michael Slusarz
Assigned to
Priority ⇒ 1. Low
State ⇒ Accepted
Patch ⇒ No
Milestone ⇒ 5
Summary ⇒ Check for ISO-8859-1/Windows-1252 improper charset labeling
Type ⇒ Enhancement
Queue ⇒ IMP
Not sure if this is something we should do in Horde_String or in IMP.
Possible ideas:
1. Always treat ISO-8859-1 data as windows-1252
2. Look at X-mailer (or equivalent) and if it looks like Outlook, do
#1.3. For ISO-8859-1 parts, check for 0x80 to 0x9F characters and, if
found, change charset representation to windows-1252