6.0.0-beta1
7/24/25

[#4340] Problems with german umlaut
Summary Problems with german umlaut
Queue Jonah
Type Enhancement
State Resolved
Priority 1. Low
Owners
Requester s_gatterbauer (at) idlm (dot) net
Created 08/26/2006 (6907 days ago)
Due
Updated 09/11/2006 (6891 days ago)
Assigned
Resolved 09/11/2006 (6891 days ago)
Milestone
Patch No

History
09/11/2006 05:12:41 AM Chuck Hagenbuch Comment #11
State ⇒ Resolved
Reply to this comment
Committed, thanks.
09/05/2006 07:27:37 PM s_gatterbauer (at) idlm (dot) net Comment #10 Reply to this comment
what about a commit ?
08/29/2006 03:42:51 PM s_gatterbauer (at) idlm (dot) net Comment #9
New Attachment: jonah_charset.patch Download
Reply to this comment
attached patch changes lib/Jonah.php to look for encoding value in the 
xml declaration and set charset if there is no charset specified in 
Content-Type.
08/29/2006 08:43:47 AM s_gatterbauer (at) idlm (dot) net Comment #8 Reply to this comment
I love pattern-matching  -  I will look for it at evening.
08/29/2006 08:40:27 AM Jan Schneider Comment #7 Reply to this comment
Why not matching the complete <?xml tag on the complete feed content? 
Something like (untested):

/<\?xml[^>]+encoding=["']?([^"'\s?]+)[^?]*?>/i
08/29/2006 08:31:00 AM s_gatterbauer (at) idlm (dot) net Comment #6 Reply to this comment
today at evening I will try something like this in lib/Jonah.php :



looking in the first 80 characters of the source-file (should contain 
the XML declaration ordered : version - encoding - standalone)

for the string after  encoding=  (should be the charset).





         if (preg_match('/.*;\s?charset="?([^"]*)/', $content_type, $match)) {

             $result['charset'] = $match[1];

+        } else {

+            $t_start = strpos(substr($result['body'], 1, 60), 
'encoding=') + 10;

+            if ($t_start) {

+                $t_stop  = strpos(substr($result['body'], $t_start, 
20), '"', $t_start);

+                $result['charset'] = 
strtolower(trim(substr($result['body'], $t_start, $t_stop - $t_start)));

+            }

         }



not very inventive (I do not know php), but it should extract the right thing.

yes - there is a problem with $t_stop if the encoding value is 
included in single quotes (I will look after it).

I am not sure about  preg_match  , but the following should also work :



         if (preg_match('/.*;\s?charset="?([^"]*)/', $content_type, $match)) {

             $result['charset'] = $match[1];

+        } elsif (preg_match('/.*\s?encoding="?([^"]*)/', 
substr($result['body'], 1, 80), $match)) {

+            $result['charset'] = $match[1];

         }




08/29/2006 07:55:19 AM Jan Schneider Comment #5 Reply to this comment
No, this doesn't help much unfortunately because conversion is not the 
problem, but detecting the correct charset in the first place.
08/29/2006 03:46:39 AM Chuck Hagenbuch Comment #4 Reply to this comment
Remembered this link when thinking about this: 
http://weierophinney.net/matthew/archives/111-mbstring-comes-to-the-rescue.html



might be relevant but might not, too.
08/28/2006 08:15:44 PM s_gatterbauer (at) idlm (dot) net Comment #3 Reply to this comment
thank you - I will try to read the first line of the xml-file in 
lib/Jonah.php if no charset is given in the Content-Type Header 
($result['charset'] is NULL) and set the charset to the "encoding=" 
value from within the xml-file.

I am not familiar with php, so it will last a little time.


08/28/2006 11:18:39 AM Jan Schneider Comment #2
State ⇒ Feedback
Reply to this comment
The problem is that PHP's xml parser is not able to properly detect 
the feed's charset from the "encoding" parameter. *sigh*

Thus we have to rely on the charset being sent by the feed's web 
server in the Content-Type HTTP header. If no charset is sent, we fall 
back to UTF-8 which is happening here.

Ideas for a better solution are welcome.
08/26/2006 08:50:18 AM s_gatterbauer (at) idlm (dot) net Comment #1
Priority ⇒ 1. Low
State ⇒ New
Queue ⇒ Jonah
Summary ⇒ Problems with german umlaut
Type ⇒ Enhancement
Reply to this comment
with the current HEAD two of my news-channels 
(http://www.wdr.de/xml/newsticker.rdf and 
http://www.frag-mutti.de/newsfeed/rss-de.xml) are not showed anymore 
("No stories are currently available.").

The Problem seems to be the ISO-8859-1 encoding : if I remove any 
special character, everything is fine.

This worked in HEAD to at least June (did not look very nice but the 
channels have been available).

UTF-8 encoded files like http://rss.orf.at/oesterreich.xml are 
displayed great.


Saved Queries