Summary | Problems with german umlaut |
Queue | Jonah |
Type | Enhancement |
State | Resolved |
Priority | 1. Low |
Owners | |
Requester | s_gatterbauer (at) idlm (dot) net |
Created | 08/26/2006 (6907 days ago) |
Due | |
Updated | 09/11/2006 (6891 days ago) |
Assigned | |
Resolved | 09/11/2006 (6891 days ago) |
Milestone | |
Patch | No |
State ⇒ Resolved
New Attachment: jonah_charset.patch
xml declaration and set charset if there is no charset specified in
Content-Type.
Something like (untested):
/<\?xml[^>]+encoding=["']?([^"'\s?]+)[^?]*?>/i
looking in the first 80 characters of the source-file (should contain
the XML declaration ordered : version - encoding - standalone)
for the string after encoding= (should be the charset).
if (preg_match('/.*;\s?charset="?([^"]*)/', $content_type, $match)) {
$result['charset'] = $match[1];
+ } else {
+ $t_start = strpos(substr($result['body'], 1, 60),
'encoding=') + 10;
+ if ($t_start) {
+ $t_stop = strpos(substr($result['body'], $t_start,
20), '"', $t_start);
+ $result['charset'] =
strtolower(trim(substr($result['body'], $t_start, $t_stop - $t_start)));
+ }
}
not very inventive (I do not know php), but it should extract the right thing.
yes - there is a problem with $t_stop if the encoding value is
included in single quotes (I will look after it).
I am not sure about preg_match , but the following should also work :
if (preg_match('/.*;\s?charset="?([^"]*)/', $content_type, $match)) {
$result['charset'] = $match[1];
+ } elsif (preg_match('/.*\s?encoding="?([^"]*)/',
substr($result['body'], 1, 80), $match)) {
+ $result['charset'] = $match[1];
}
problem, but detecting the correct charset in the first place.
http://weierophinney.net/matthew/archives/111-mbstring-comes-to-the-rescue.html
might be relevant but might not, too.
lib/Jonah.php if no charset is given in the Content-Type Header
($result['charset'] is NULL) and set the charset to the "encoding="
value from within the xml-file.
I am not familiar with php, so it will last a little time.
State ⇒ Feedback
the feed's charset from the "encoding" parameter. *sigh*
Thus we have to rely on the charset being sent by the feed's web
server in the Content-Type HTTP header. If no charset is sent, we fall
back to UTF-8 which is happening here.
Ideas for a better solution are welcome.
Priority ⇒ 1. Low
State ⇒ New
Queue ⇒ Jonah
Summary ⇒ Problems with german umlaut
Type ⇒ Enhancement
(http://www.wdr.de/xml/newsticker.rdf and
http://www.frag-mutti.de/newsfeed/rss-de.xml) are not showed anymore
("No stories are currently available.").
The Problem seems to be the ISO-8859-1 encoding : if I remove any
special character, everything is fine.
This worked in HEAD to at least June (did not look very nice but the
channels have been available).
UTF-8 encoded files like http://rss.orf.at/oesterreich.xml are
displayed great.