6.0.0-beta1
7/7/25

[#9201] Check for ISO-8859-1/Windows-1252 improper charset labeling
Summary Check for ISO-8859-1/Windows-1252 improper charset labeling
Queue IMP
Queue Version Git master
Type Enhancement
State Resolved
Priority 1. Low
Owners Horde Developers (at) , jan (at) horde (dot) org, slusarz (at) horde (dot) org
Requester slusarz (at) horde (dot) org
Created 08/25/2010 (5430 days ago)
Due
Updated 10/21/2010 (5373 days ago)
Assigned
Resolved 10/21/2010 (5373 days ago)
Milestone 5
Patch No

History
10/21/2010 06:09:00 AM Michael Slusarz Comment #19
State ⇒ Resolved
Reply to this comment
Things seem to be working well with this fix.  Resolving ticket.
08/29/2010 12:53:34 AM Michael Slusarz Comment #17 Reply to this comment
Changes have been made in Git for this ticket:

Ticket #9201: Treat ISO-8859-1 as windows-1252
I've gone ahead and committed this - doing a bit of research, a bunch 
of mailers do a similar thing.  I think that scanning for the unused 
8859-1 codepoints is too much overhead.
08/29/2010 12:48:22 AM Michael Slusarz Comment #15 Reply to this comment
FWIW I sent the message with IMP, not DIMP.
No difference for me - still works fine.
Everything is fine now.
Jan - looks like you are the only one still seeing this.  Have you 
tried upgrading PHP?
08/27/2010 12:38:28 PM rsalmon (at) mbpgroup (dot) com Comment #14 Reply to this comment
FWIW I sent the message with IMP, not DIMP.
No difference for me - still works fine.
Everything is fine now.

I tried so many things that I'm not sure what did it for me, but my 
guess is the update of php from 5.3.2 to 5.3.3
Can this ticket be related to php bug #50661 ?

I now run  php-5.3.3-1.el5.remi on CentOS release 5.4


08/26/2010 08:43:41 PM Michael Slusarz Comment #13 Reply to this comment
FWIW I sent the message with IMP, not DIMP.
No difference for me - still works fine.
08/26/2010 08:31:38 PM Jan Schneider Comment #12 Reply to this comment
FWIW I sent the message with IMP, not DIMP.
08/26/2010 06:08:47 PM Michael Slusarz Comment #11 Reply to this comment
My guess is that there is something weird going on with the DOM 
encoding/loading.  It seems to be working perfect on my system - but 
that could be because I am using en_US.UTF-8.  It might not be 
working properly on, e.g., de or fr locales.

I would suggest playing around with charsets in Horde_Domhtml 
(located in the horde/Util package).
For reference... when I view the HTML part in a new window, 
Horde_Domhtml is called once.  The initial loadHTML() call fails as 
the encoding is not auto-determined.  It then moves into the forced 
loadHTML() call after converting to UTF-8.  The charset passed into 
the constructor is UTF-8.

Pseudocode:

public function __construct($text, 'UTF-8)
{
         $doc = new DOMDocument();
         $doc->loadHTML($text);

         // $doc->encoding is empty
         $this->encoding = $doc->encoding;

         if (!is_null($charset)) {
             if (!$doc->encoding) {
                 $doc->loadHTML('<?xml encoding="UTF-8">' . 
Horde_String::convertCharset($text, $charset, 'UTF-8'));
                 $this->encoding = 'UTF-8';
             }
         }
}
08/26/2010 06:03:10 PM Michael Slusarz Comment #10 Reply to this comment
My guess is that there is something weird going on with the DOM 
encoding/loading.  It seems to be working perfect on my system - but 
that could be because I am using en_US.UTF-8.  It might not be working 
properly on, e.g., de or fr locales.

I would suggest playing around with charsets in Horde_Domhtml (located 
in the horde/Util package).
08/26/2010 05:56:17 PM Michael Slusarz Comment #9 Reply to this comment
I see exactly the same behavior. And there are actually two errors.
1) the plain text is double encoded, i.e. the euro sign is turned 
into =C3=A2=C2=82=C2=AC while it's still correct in the html part 
(=E2=82=AC)
2) even though it's correct in the mail part, it's not displayed 
correctly, i.e. as "€"
Still works for me.  Here's what my test message looks like.  Sent via 
dimp/HTML compose:

----

Subject: Test
Content-Type: multipart/alternative; boundary="=_GSd1CTVMdqosdDl3SYlomXUo"
MIME-Version: 1.0

This message is in MIME format.

--=_GSd1CTVMdqosdDl3SYlomXUo
Content-Type: text/plain; charset=UTF-8; format=flowed; DelSp=Yes
Content-Description: Plaintext Version
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Euro Character: =E2=82=AC =E2=82=AC

--=_GSd1CTVMdqosdDl3SYlomXUo
Content-Type: text/html; charset=UTF-8
Content-Description: HTML Version
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns=3D"http://www.w3.org/1999/xhtml">
<head>
<!--a75c305b1c0a6022--><title></title>
</head>
<body style=3D"font-family:Arial;font-size:14px">
<p>Euro Character: <span style=3D"font-size: 14px;">=E2=82=AC 
=E2=82=AC</span></p>
</body>
</html>
--=_GSd1CTVMdqosdDl3SYlomXUo--
08/26/2010 04:53:10 PM Jan Schneider Comment #8 Reply to this comment

[Show Quoted Text - 45 lines]
I see exactly the same behavior. And there are actually two errors.
1) the plain text is double encoded, i.e. the euro sign is turned into 
=C3=A2=C2=82=C2=AC while it's still correct in the html part (=E2=82=AC)
2) even though it's correct in the mail part, it's not displayed 
correctly, i.e. as "€"
08/26/2010 04:19:57 PM Michael Slusarz Comment #7 Reply to this comment
compose a new HTML message and set charset to UTF-8. Send the 
following string : "azerty €"
the HTML part looks ok and render just fine in Outlook and
Thunderbird, but not IMP. I get this with IMP/FF : "azerty €"
Actually, any accent aren't render correctly at all.
Nope - works perfectly here.  You are going to have to trace down 
where the text is being mangled on your system.
08/26/2010 07:52:54 AM rsalmon (at) mbpgroup (dot) com Comment #6 Reply to this comment
compose a new HTML message and set charset to UTF-8. Send the 
following string : "azerty €"
the HTML part looks ok and render just fine in Outlook and 
Thunderbird, but not IMP. I get this with IMP/FF : "azerty €"
Actually, any accent aren't render correctly at all.

08/26/2010 07:38:01 AM rsalmon (at) mbpgroup (dot) com Comment #5 Reply to this comment
This change *seems* to fix the conversion issues for me.  Although 
that could just be fortune that we are not doing any ISO-8859-1 
conversions in the codepaths I have tested.
I still have problems with the euro sign.

Using IMP :
compose a new text message and set charset to UTF-8. Send the 
following string : "azerty €"
the received message looks fine in IMP, Thunderbird and Outlook.

compose a new HTML message and set charset to UTF-8. Send the 
following string : "azerty €"
the HTML part looks ok and render just fine in Outlook and 
Thunderbird, but not IMP. I get this with IMP/FF : "azerty €"

Looking at the source of the message, the encoded string doesn't look 
the same in the text part  :

--=_MgggdEx2InJNFCIGJXUlY0k0
Content-Type: text/plain; charset=UTF-8; format=flowed; DelSp=Yes
Content-Description: Plaintext Version
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

azerty =C3=A2=C2=82=C2=AC


--=_MgggdEx2InJNFCIGJXUlY0k0
Content-Type: text/html; charset=UTF-8
Content-Description: HTML Version
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns=3D"http://www.w3.org/1999/xhtml">
<head>
<!--a75c305b1c0a6022--><title></title>
</head>
<body style=3D"font-family:Arial;font-size:14px">
<p>azerty =E2=82=AC<br /></p>
</body>
</html>
--=_MgggdEx2InJNFCIGJXUlY0k0--



08/25/2010 07:39:15 PM Michael Slusarz Comment #4
State ⇒ Feedback
Reply to this comment
This change *seems* to fix the conversion issues for me.  Although 
that could just be fortune that we are not doing any ISO-8859-1 
conversions in the codepaths I have tested.
08/25/2010 07:36:58 PM Git Commit Comment #3 Reply to this comment
08/25/2010 07:34:34 PM Jan Schneider Comment #2 Reply to this comment
3. For ISO-8859-1 parts, check for 0x80 to 0x9F characters and, if 
found, change charset representation to windows-1252
This sounds like the most stable solution to me.
08/25/2010 07:01:40 PM Michael Slusarz Assigned to Jan Schneider
Assigned to Michael Slusarz
Assigned to Horde DevelopersHorde Developers
 
08/25/2010 06:58:45 PM Michael Slusarz Comment #1
Priority ⇒ 1. Low
State ⇒ Accepted
Patch ⇒ No
Milestone ⇒ 5
Summary ⇒ Check for ISO-8859-1/Windows-1252 improper charset labeling
Type ⇒ Enhancement
Queue ⇒ IMP
Reply to this comment
Placing in IMP queue for now.

Not sure if this is something we should do in Horde_String or in IMP.   
Possible ideas:
1. Always treat ISO-8859-1 data as windows-1252
2. Look at X-mailer (or equivalent) and if it looks like Outlook, do #1.
3. For ISO-8859-1 parts, check for 0x80 to 0x9F characters and, if 
found, change charset representation to windows-1252

Saved Queries