6.0.0-beta1
7/6/25

[#9187] compose html2text charset
Summary compose html2text charset
Queue IMP
Queue Version Git master
Type Bug
State Resolved
Priority 1. Low
Owners slusarz (at) horde (dot) org
Requester rsalmon (at) mbpgroup (dot) com
Created 08/19/2010 (5435 days ago)
Due
Updated 08/25/2010 (5429 days ago)
Assigned 08/19/2010 (5435 days ago)
Resolved 08/20/2010 (5434 days ago)
Github Issue Link
Github Pull Request
Milestone
Patch Yes

History
08/25/2010 06:59:07 PM Michael Slusarz Comment #16 Reply to this comment
further testing with the euro character :
And all of this is expected, since, for example, the Euro character 
doesn't exist in ISO-8859-1.  See:
http://en.wikipedia.org/wiki/Windows-1252

As Jan mentioned, it looks like something (Outlook?) is attempting to 
pass off windows-1252 as ISO-8859-1.  So there is nothing technically 
wrong with what we are doing (i.e. there is no bug).

That being said... it may be useful to somehow catch iso-8859-1 text 
that looks like windows-1252 and convert as such.  Moving to Ticket 
#9201.
08/25/2010 02:01:26 PM rsalmon (at) mbpgroup (dot) com Comment #15 Reply to this comment
further testing with the euro character :

compose new text message (setting charset to UTF-8)
- set subject and body to "azerty €"
-> send and open
   - subject ok
   - body ok : source is "azerty =E2=82=AC", but the euro sign is 
displayed just fine in FF.
   -> reply (compose_html enabled)
     - subject ok
     - body displayed ok
     -> click html2text
       - subject ok
       - body *nok* (euro sign: ?)
   -> reply (compose_html disabled)
     - subject ok
     - body displayed ok

compose new text message (setting charset to ISO-8859-15)
- set subject and body to "azerty €"
-> send and open
   - subject ok
   - body ok
   -> reply (compose_html enabled). Charset has automatically switched 
to UTF-8.
     - subject ok
     - body displayed ok
     -> click html2text
       - subject ok
       - body *nok* (euro sign: ?)
   -> reply (compose_html disabled). Charset has automatically 
switched to UTF-8.
     - subject ok
     - body displayed ok


compose new HTML message (setting charset to UTF-8)
- set subject and body to "azerty €"
-> send and open
   - subject ok
   - body text part *nok* : azerty ?
   - body html part *nok* : displayed azerty ?, but source is "azerty 
=E2=82=AC"

compose new HTML message (setting charset to ISO-8859-15)
- set subject and body to "azerty €"
-> send and open
   - subject ok
   - body text part *nok* : euro sign converted to 'EUR'
   - body html part *nok* : source is "azerty =A4" but "EUR" is 
displayed in FF.



08/25/2010 01:59:46 PM rsalmon (at) mbpgroup (dot) com Comment #14
New Attachment: email[1].eml Download
Reply to this comment

[Show Quoted Text - 9 lines]
the attached email was created using Outlook which uses Word to create 
emails. So if this happens for the single quote, it will happen for 
other character I guess.

Another example of a character that doesn't like being converted 
between charset : € (euro sign).
See new attached message.
- open the email : € is transform as "EUR" !
- reply to email (pref compose_html enabled) : euro sign is now a 
question mark.
- click on html to text: all the text is gone.






08/20/2010 08:08:32 PM Michael Slusarz Comment #13
State ⇒ Resolved
Reply to this comment
It's also part of windows-1252, my guess is that some copied this 
text from MS-Word or anything similar.
In that case... depending on the number of charset conversions, this 
character may display correctly but there can be no guarantee.  As 
suspected, there is nothing left to do in this ticket.
08/20/2010 08:00:56 PM Jan Schneider Comment #12 Reply to this comment
It's also part of windows-1252, my guess is that some copied this text 
from MS-Word or anything similar.
08/20/2010 07:57:12 PM Jan Schneider Comment #11 Reply to this comment
Correct, 0092 is part of the so called extended ascii which is not 
really a charset, let alone 8859-1.
08/20/2010 06:11:46 PM Michael Slusarz Comment #10 Reply to this comment
These changes fix things for me.
If I reply to the message attached to this ticket and switch from 
html to text,
I get : "préparer à vendre d?août ;"
I expect : "préparer à vendre d’août ;"

The single quote becomes a '?' !
Nope - that's actually correct.  It appears that converting from 
ISO-8859-1 -> UTF-8 -> ISO-8859-1 loses that character.  Don't know if 
that's a PHP bug or an issue with Horde_String, but there is nothing 
wrong theoretically with that conversions code.

Do note - that weird quote character (it is NOT the standard single 
quote character from US-ASCII) doesn't display in ANY of the messages 
I receive.  It always appears as bytecode [0092] on my FF screen, for 
example.
08/20/2010 07:29:27 AM rsalmon (at) mbpgroup (dot) com Comment #9 Reply to this comment
These changes fix things for me.
If I reply to the message attached to this ticket and switch from 
html to text,
I get : "préparer à vendre d?août ;"
I expect : "préparer à vendre d’août ;"

The single quote becomes a '?' !
- here is the JSON response :

/*-secure-{"response":{"text":"ronan@maison.com a 
\u00e9crit\u00a0:\n\n> Bonjour,\n>\n> \u00a0\n>\n> pr\u00e9parer 
\u00e0 vendre d?ao\u00fbt\u00a0;\n"}}*/

- here is the POST
text        <p>  ronan@maison.com a écrit&nbsp;:</p> <blockquote 
style="background-color: rgb(240, 240, 240); border-left: 1px solid 
blue; padding-left: 1em;" type="cite">  <div class="Section1">  <p 
class="MsoNormal">  <font face="Arial" size="3"><span 
style="font-size: 12pt; font-family: 
Arial;">Bonjour,</span></font></p>  <p class="MsoNormal">  <font 
face="Arial" size="3"><span style="font-size: 12pt; font-family: 
Arial;">&nbsp;</span></font></p>  <p class="MsoNormal">  <font 
face="Arial" size="3"><span style="font-size: 12pt; font-family: 
Arial;">préparer à vendre d’août&nbsp;;</span></font></p>  </div> 
</blockquote>
08/20/2010 07:24:04 AM rsalmon (at) mbpgroup (dot) com Comment #8 Reply to this comment
These changes fix things for me.
If I reply to the message attached to this ticket and switch from html 
to text,
I get : "préparer à vendre d?août ;"
I expect : "préparer à vendre d’août ;"

The single quote becomes a '?' !



08/19/2010 06:54:17 PM Michael Slusarz Comment #7 Reply to this comment
These changes fix things for me.
08/19/2010 06:54:00 PM Git Commit Comment #6 Reply to this comment
Changes have been made in Git for this ticket:

Bug #9187: Fix charset issues when doing Html2text compose conversion.

http://git.horde.org/diff.php/imp/lib/Ui/Compose.php?rt=horde-git&r1=b371414ef2533f1b57355c545afc8b4901c76bfb&r2=cd03906a381a67d4c1c67972e047a875d77eac9d
08/19/2010 05:09:20 PM Michael Slusarz Comment #4
Assigned to Michael Slusarz
State ⇒ Feedback
Reply to this comment
When clicking on "Switch to plain text composition", I get no text.
Fixed.
I have a second issue when switching from html 2 text composition, 
accent get screwed. It appears to come from DOMDocument not being 
able to detect properly the charset.

The following fix does the job for me (inspired from "User 
Contributed Notes" 
http://www.php.net/manual/en/domdocument.loadhtml.php)
This won't work.  It relies on mb_convert_encoding() which may not be 
available (it is not required for Horde).

Try my patch.  It has been working with the XSS filter for a bit now 
and seems to do the right thing.
08/19/2010 09:49:12 AM rsalmon (at) mbpgroup (dot) com Comment #1
Priority ⇒ 1. Low
State ⇒ Unconfirmed
New Attachment: email.eml Download
Patch ⇒ Yes
Milestone ⇒
Queue ⇒ IMP
Summary ⇒ compose html2text charset
Type ⇒ Bug
Reply to this comment
$_prefs['compose_html']['value'] => 1
$mime_drivers['html']['inline'] => true,
php-5.3.2
Mozilla/5.0 (X11; U; Linux i686; fr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8


When clicking on "Switch to plain text composition", I get no text.

Fix :
imp/lib/Ui/Compose.php:384
--            return $msg . "\n" . $sig;
++            return $data . "\n" . $sig;


I have a second issue when switching from html 2 text composition, 
accent get screwed. It appears to come from DOMDocument not being able 
to detect properly the charset.

The following fix does the job for me (inspired from "User Contributed 
Notes" http://www.php.net/manual/en/domdocument.loadhtml.php)

--- Html2text.php        2010-07-27 10:20:23.000000000 +0200
+++ 
/var/www/html/horde/libs/Horde/Text/Filter/Html2text.php        2010-08-19 
12:38:34.000000000 +0200
@@ -102,16 +102,22 @@
      public function postProcess($text)
      {
          if (extension_loaded('dom')) {
-            $text = Horde_String::convertCharset($text, 
$this->_params['charset'], 'UTF-8');
+            if ($this->_params['charset'] != 'UTF-8') {
+                $text = Horde_String::convertCharset($text, 
$this->_params['charset'], 'UTF-8');
+            }

              $old_error = libxml_use_internal_errors(true);
              $doc = new DOMDocument();
-            $doc->loadHTML('<?xml encoding="UTF-8">' . $text);
+            $doc->loadHTML('<?xml encoding="UTF-8">' . 
mb_convert_encoding($text, 'HTML-ENTITIES', "UTF-8"));
              if ($old_error) {
                  libxml_use_internal_errors(false);
              }

-            $text = Horde_String::convertCharset($this->_node($doc, 
$doc), 'UTF-8', $this->_params['charset']);
+            if ($this->_params['charset'] != 'UTF-8') {
+                $text = 
Horde_String::convertCharset($this->_node($doc, $doc), 'UTF-8', 
$this->_params['charset']);
+            } else {
+                $text = $this->_node($doc, $doc);
+            }
          }

Saved Queries