6.0.0-alpha14
6/29/25

[#9617] db_migrate and incorrect charset handling
Summary db_migrate and incorrect charset handling
Queue Horde Framework Packages
Queue Version Git master
Type Bug
State Resolved
Priority 1. Low
Owners jan (at) horde (dot) org, mrubinsk (at) horde (dot) org
Requester leena.heino (at) uta (dot) fi
Created 03/02/2011 (5233 days ago)
Due
Updated 04/06/2011 (5198 days ago)
Assigned 04/04/2011 (5200 days ago)
Resolved 04/06/2011 (5198 days ago)
Github Issue Link
Github Pull Request
Milestone
Patch No

History
04/06/2011 06:38:52 PM Michael Rubinsky Comment #25
State ⇒ Resolved
Reply to this comment
Original problem is fixed, resolving. Testing problems like this is 
impossible due to SQLite issues, but nothing we can do about that.
04/05/2011 02:45:48 PM Chuck Hagenbuch Comment #24 Reply to this comment
Well, sqlite isn't going to scale for a full Horde installation in the 
real world anyway, so while it's a pain, I'm not sure how much of a 
problem it is...
04/04/2011 05:09:53 PM Jan Schneider Comment #23 Reply to this comment
I don't see any short-term solution for that. We MUST expect the 
database to do case-insensitive searches, it's completely insane that 
it doesn't work by default with SQLite. I'm surprised it didn't break 
anything else yet. This makes SQLite pretty useless for any real-world 
usage of Horde.
04/04/2011 05:07:45 PM Git Commit Comment #22 Reply to this comment
Changes have been made in Git for this ticket:

Fix case-insensitive filtering of duplicate tags (Bug #9617).
This simplifies the _checkTags() method a lot too. Unfortunately it
doesn't work at all with SQLite, so unit tests are rather useless.

  3 files changed, 19 insertions(+), 26 deletions(-)
http://git.horde.org/horde-git/-/commit/a90c671771adbbb1aa08576a2b9d13e011ca6790
04/04/2011 05:07:41 PM Git Commit Comment #21 Reply to this comment
Changes have been made in Git for this ticket:

Add failing test for bug #9617.

  1 files changed, 5 insertions(+), 4 deletions(-)
http://git.horde.org/horde-git/-/commit/55802691eafbb6931b93e8c96ee8d2d4fd5b441b
04/04/2011 04:55:18 PM Jan Schneider Comment #20 Reply to this comment
http://www.sqlite.org/faq.html#q18
This makes unit testing this stuff a PITA.
04/04/2011 04:46:58 PM Jan Schneider Comment #19 Reply to this comment
Okay, but this opens a complete new can of worms. The "SELECT ... 
WHERE tag_name IN (...)" won't work in this case, because it is case 
sensitive.
The correct solution would be to delegate the lowercasing to the 
database, but at least for SQLite this doesn't seem to work. "SELECT 
LOWER('TYÖ')" returns "tyÖ" there. It works fine in MySQL though.
04/04/2011 03:37:38 PM leena (dot) heino (at) uta (dot) fi Comment #18 Reply to this comment
Nevermind, I found one. Just to get this straight, the idea is that 
tags TYÖ and työ are considered equal, right?
That is right.
04/04/2011 03:30:26 PM Jan Schneider Comment #17 Reply to this comment
Nevermind, I found one. Just to get this straight, the idea is that 
tags TYÖ and työ are considered equal, right?
04/04/2011 03:15:37 PM Jan Schneider Comment #16
State ⇒ Feedback
Reply to this comment
The test suite runs fine though. Can you provide a patch to 
TaggerTest.php that demonstrates the broken behavior?
04/04/2011 02:33:13 PM Jan Schneider Comment #15
State ⇒ Assigned
Assigned to Jan Schneider
Reply to this comment

[Show Quoted Text - 9 lines]
Which is exactly what we want.
But if we look at php's source code for strtoupper() it works by 
bytes, therefore it will not work correctly with UTF-8 encoded 
strings that contain non ascii characters.
So the manual is plain wrong.

[Show Quoted Text - 18 lines]
Thanks for tracking this down so deep.
04/04/2011 02:24:39 PM leena (dot) heino (at) uta (dot) fi Comment #14 Reply to this comment
PHP's manual suggest that one should not assume that 
strtolower()/strtoupper() work correctly with
multibyte charset like utf-8.
Where does it say that? I don't see any such suggestions in the man pages.
It does not it say it in so many words or at least says it 
ambiguously: "Note that 'alphabetic' is determined by the current 
locale"

But if we look at php's source code for strtoupper() it works by 
bytes, therefore it will not work correctly with UTF-8 encoded strings 
that contain non ascii characters.

Excerpt from ext/standard/string.c:
char *php_strtoupper(char *s, size_t len)
{
         unsigned char *c, *e;

         c = (unsigned char *)s;
         e = (unsigned char *)c+len;

         while (c < e) {
                 *c = toupper(*c);
                 c++;
         }
         return s;
}

The non ascii characters in UTF-8 are multi byte. Therefore using 
php's strtoupper()/strtolower() will not work correctly with UTF-8 
encoded strings with non ascii characters.

03/07/2011 09:19:30 PM Jan Schneider Comment #13 Reply to this comment
PHP's manual suggest that one should not assume that 
strtolower()/strtoupper() work correctly with multibyte charset like 
utf-8.
Where does it say that? I don't see any such suggestions in the man pages.
03/07/2011 09:07:44 PM leena (dot) heino (at) uta (dot) fi Comment #12 Reply to this comment
The original problem still remains. PHP's manual suggest that one 
should not assume that strtolower()/strtoupper() work correctly with 
multibyte charset like utf-8.

Should the code use mb_strtoupper()/mb_strtolower() or Horde::String 
instead of strtolower()/strtoupper()?
03/07/2011 09:02:50 PM leena (dot) heino (at) uta (dot) fi Comment #11 Reply to this comment
Changes have been made in Git for this ticket:

Bug #9617: Fix property name.

  1 files changed, 2 insertions(+), 2 deletions(-)
http://git.horde.org/horde-git/-/commit/7d484c517ddde6a6818845ea1b33be3f20c36c89
03/07/2011 04:23:21 PM Git Commit Comment #10 Reply to this comment
Changes have been made in Git for this ticket:

Bug #9617: Fix property name.

  1 files changed, 2 insertions(+), 2 deletions(-)
http://git.horde.org/horde-git/-/commit/7d484c517ddde6a6818845ea1b33be3f20c36c89
03/05/2011 07:49:14 AM Michael Rubinsky State ⇒ Resolved
 
03/05/2011 07:45:59 AM Git Commit Comment #9 Reply to this comment
Changes have been made in Git for this ticket:

Fix charset handling in tagger
Bug: 9617

  3 files changed, 26 insertions(+), 8 deletions(-)
http://git.horde.org/horde-git/-/commit/5ac680c13a93b32df97fc5bb3c29cb3c4b8e4cbb
03/04/2011 09:52:55 PM Michael Rubinsky Comment #8
Summary ⇒ db_migrate and incorrect charset handling
Reply to this comment
Attached patch seemed to fix it for me.
It should be sufficient to convert from the db charset to utf-8 here. 
However, there was the missing conversion from the database to utf-8 
in the migration script that probably made the extra conversion in 
your patch necessary. This has been fixed.
Another bug has appeared. Output from debug log:
DEBUG: SQL SELECT user_id, user_name FROM `rampage_users` WHERE 
user_name IN ('ntllt')
DEBUG: SQL QUERY FAILED: Duplicate entry 'ntllt' for key 
rampage_users_user_name' INSERT INTO `rampage_users` (user_name) 
VALUES ('ntllt')
Yeah, looks like there is a bunch of charset conversions missing in 
content. Working on it, and updated the title of the ticket to reflect 
the actual problem.
03/04/2011 09:49:52 PM Git Commit Comment #7 Reply to this comment
Changes have been made in Git for this ticket:

Need to convert to utf-8 when reading the category before tagging.
Bug: 9617

  1 files changed, 2 insertions(+), 2 deletions(-)
http://git.horde.org/horde-git/-/commit/3baf0b99425470dfdd77e02de1da4f32bf4851ff
03/04/2011 09:15:24 PM leema (dot) heino (at) uta (dot) fi Comment #6
New Attachment: Tagger.php.patch Download
Reply to this comment
Changes have been made in Git for this ticket:

Need to convert from database's charset before comparing
Bug: 9617

  1 files changed, 1 insertions(+), 1 deletions(-)
http://git.horde.org/horde-git/-/commit/0f36ff69b64c96e3ab91d6f479fa34cb15a451a9
Attached patch seemed to fix it for me.

Another bug has appeared. Output from debug log:
DEBUG: SQL SELECT user_id, user_name FROM `rampage_users` WHERE 
user_name IN ('ntllt')
DEBUG: SQL QUERY FAILED: Duplicate entry 'ntllt' for key 
rampage_users_user_name' INSERT INTO `rampage_users` (user_name) 
VALUES ('ntllt')

03/04/2011 07:36:48 PM Git Commit Comment #5 Reply to this comment
Changes have been made in Git for this ticket:

Need to convert from database's charset before comparing
Bug: 9617

  1 files changed, 1 insertions(+), 1 deletions(-)
http://git.horde.org/horde-git/-/commit/0f36ff69b64c96e3ab91d6f479fa34cb15a451a9
03/04/2011 02:57:31 PM leena (dot) heino (at) uta (dot) fi Comment #4 Reply to this comment
Can you try what I just committed?
I tried, but script fails with the same error.
03/04/2011 02:38:29 PM Michael Rubinsky Comment #3
Assigned to Michael Rubinsky
State ⇒ Feedback
Reply to this comment
Can you try what I just committed?
03/04/2011 02:37:52 PM Git Commit Comment #2 Reply to this comment
Changes have been made in Git for this ticket:

Use Horde_String::lower
Possibly fixes Bug: 9617

  1 files changed, 1 insertions(+), 1 deletions(-)
http://git.horde.org/horde-git/-/commit/ae5714cbb917c43e184f942db0e1b5f3197f679f
03/02/2011 12:02:40 PM leena (dot) heino (at) uta (dot) fi Comment #1
Priority ⇒ 1. Low
State ⇒ Unconfirmed
Patch ⇒ No
Milestone ⇒
Summary ⇒ db_migrate and duplicate tags in rampage
Type ⇒ Bug
Queue ⇒ Horde Framework Packages
Reply to this comment
I been testing data migration from framework3 to h4 using db_migrate.

Either it is mysql which is case insentive or it is the migration 
script, but it seems as if you cannot add tags to rampage_tags if tags 
differ only by their case.

Eg. these tags are consider the same:
TYÖ
työ

If those tags exists in old data then db_migrate will fail with error:
QUERY FAILED: Duplicate entry 'työ' for key 'rampage_tags_tag_name'
INSERT INTO `rampage_tags` (tag_name) VALUES ('työ')

Saved Queries