Tickets :: [#4340] Problems with german umlaut

6.0.0-beta1

7/24/25

Summary	Problems with german umlaut
Queue	Jonah
Type	Enhancement
State	Resolved
Priority	1. Low
Owners
Requester	s_gatterbauer (at) idlm (dot) net
Created	08/26/2006 (6907 days ago)
Due
Updated	09/11/2006 (6891 days ago)
Assigned
Resolved	09/11/2006 (6891 days ago)
Milestone
Patch	No

09/11/2006 05:12:41 AM	Chuck Hagenbuch	Comment #11 State ⇒ Resolved	Reply to this comment
Committed, thanks.

09/05/2006 07:27:37 PM	s_gatterbauer (at) idlm (dot) net	Comment #10	Reply to this comment
what about a commit ?

08/29/2006 03:42:51 PM	s_gatterbauer (at) idlm (dot) net	Comment #9 New Attachment: jonah_charset.patch	Reply to this comment
attached patch changes lib/Jonah.php to look for encoding value in the xml declaration and set charset if there is no charset specified in Content-Type.

08/29/2006 08:43:47 AM	s_gatterbauer (at) idlm (dot) net	Comment #8	Reply to this comment
I love pattern-matching - I will look for it at evening.

08/29/2006 08:40:27 AM	Jan Schneider	Comment #7	Reply to this comment
Why not matching the complete <?xml tag on the complete feed content? Something like (untested): /<\?xml[^>]+encoding=["']?([^"'\s?]+)[^?]*?>/i

08/29/2006 08:31:00 AM	s_gatterbauer (at) idlm (dot) net	Comment #6	Reply to this comment
today at evening I will try something like this in lib/Jonah.php : looking in the first 80 characters of the source-file (should contain the XML declaration ordered : version - encoding - standalone) for the string after encoding= (should be the charset). if (preg_match('/.;\s?charset="?([^"])/', $content_type, $match)) { $result['charset'] = $match[1]; + } else { + $t_start = strpos(substr($result['body'], 1, 60), 'encoding=') + 10; + if ($t_start) { + $t_stop = strpos(substr($result['body'], $t_start, 20), '"', $t_start); + $result['charset'] = strtolower(trim(substr($result['body'], $t_start, $t_stop - $t_start))); + } } not very inventive (I do not know php), but it should extract the right thing. yes - there is a problem with $t_stop if the encoding value is included in single quotes (I will look after it). I am not sure about preg_match , but the following should also work : if (preg_match('/.;\s?charset="?([^"])/', $content_type, $match)) { $result['charset'] = $match[1]; + } elsif (preg_match('/.\s?encoding="?([^"])/', substr($result['body'], 1, 80), $match)) { + $result['charset'] = $match[1]; }

08/29/2006 07:55:19 AM	Jan Schneider	Comment #5	Reply to this comment
No, this doesn't help much unfortunately because conversion is not the problem, but detecting the correct charset in the first place.

08/29/2006 03:46:39 AM	Chuck Hagenbuch	Comment #4	Reply to this comment
Remembered this link when thinking about this: http://weierophinney.net/matthew/archives/111-mbstring-comes-to-the-rescue.html might be relevant but might not, too.

08/28/2006 08:15:44 PM	s_gatterbauer (at) idlm (dot) net	Comment #3	Reply to this comment
thank you - I will try to read the first line of the xml-file in lib/Jonah.php if no charset is given in the Content-Type Header ($result['charset'] is NULL) and set the charset to the "encoding=" value from within the xml-file. I am not familiar with php, so it will last a little time.

08/28/2006 11:18:39 AM	Jan Schneider	Comment #2 State ⇒ Feedback	Reply to this comment
The problem is that PHP's xml parser is not able to properly detect the feed's charset from the "encoding" parameter. sigh Thus we have to rely on the charset being sent by the feed's web server in the Content-Type HTTP header. If no charset is sent, we fall back to UTF-8 which is happening here. Ideas for a better solution are welcome.

08/26/2006 08:50:18 AM	s_gatterbauer (at) idlm (dot) net	Comment #1 Priority ⇒ 1. Low State ⇒ New Queue ⇒ Jonah Summary ⇒ Problems with german umlaut Type ⇒ Enhancement	Reply to this comment
with the current HEAD two of my news-channels (http://www.wdr.de/xml/newsticker.rdf and http://www.frag-mutti.de/newsfeed/rss-de.xml) are not showed anymore ("No stories are currently available."). The Problem seems to be the ISO-8859-1 encoding : if I remove any special character, everything is fine. This worked in HEAD to at least June (did not look very nice but the channels have been available). UTF-8 encoded files like http://rss.orf.at/oesterreich.xml are displayed great.