Support

Admin Tools

#33976 to Change DB Collation - BUG

Posted in ‘Admin Tools for Joomla! 4 & 5’
This is a public ticket

Everybody will be able to see its contents. Do not include usernames, passwords or any other sensitive information.

Environment Information

Joomla! version
n/a
PHP version
n/a
Admin Tools version
n/a

Latest post by nicholas on Friday, 06 November 2020 01:47 CST

joomleb

Hi guys,
- on a DB set as  latin1_swedish_ci https://photos.app.goo.gl/P54qPKZg8QBQNbLe8 
- looking into the tables, some are set as  latin1_swedish_ci  / utf8_general_ci / utf8mb4_unicode_ci / utf8mb4_unicode_ci

First - I change the entire DB collation through phpMyAdmin, following the "Changing the default collation for a database" guide, by selecting:
utf8mb4_unicode_ci + "Change all tables collations" + "Change all tables columns collations"

Second - Admin Tools > Change Database Collation > Choose a collation > UTF-8 Multibyte recommended = as you told me it should mean  utf8mb4_unicode_ci > Apply
The Tables and the DB are changed to utf8mb4_unicode_ci and NOT to utf8mb4_unicode_ci

1 - Please, Am I missing anything or it is a BUG ?

Anyway, I tested it (first and second steps):
- from latin1_swedish_ci to utf8mb4_unicode_ci
- or from latin1_swedish_ci to utf8_generali_ci and then from utf8_general_ci to utf8mb4_unicode_ci (by choosing in second step "Choose a collation > Custom > utf8mb4_unicode_ci" )

But 15 of all the 356 Tables are not moved and are always as utf8_general_ci
2 - Please, Am I missing anything or it is a BUG ? Do you have any workaround for it ?

 

Many Thanks in advance for support

 

nicholas
Akeeba Staff
Manager

Please read the documentation first. Do not make assumptions.

1. No. It is, indeed, utf8mb4_unicode_ci. That's the collation Joomla 3 is using for UTF8 Multibyte collation. Per the documenation:

Using the Change Database Collation change your database's collation to UTF-8 Multibyte. This uses the utf8mb4_unicode_ci collation which tells MySQL to add support for 4-byte characters. On top of that, the System - Admin Tools plugin tells Joomla!'s database driver to add support for 4-byte characters.

Note that the information about the plugin does not apply in your case since Joomla 3.9 is already doing that by itself. This bit referred to old versions of Joomla which did not have the multibyte support out of the box.

2. No, it is NOT a bug. It is a MySQL limitation and bad implementation from the developer of the extension which owns the table. The technical explanation is in the accepted solution of this StackExchange page: https://stackoverflow.com/questions/15157227/mysql-varchar-index-length

MySQL has a limit of 767 bytes for each index. Many extensions use a VARCHAR(255) column which is also indexed without a character limit. This essentially tells MySQL to create an index with the maximum amount of data the column can hold. Plain old UTF-8 is up to 3 bytes per character and the column is set to use 255 characters therefore the index size is 3x255 = 765 bytes which is less than the 767 byte limit.

If this column were to be converted to UTF8MB4 each character can take up to 4 bytes (that's what MB4 means: MutiByte, up to 4 bytes per character). Therefore the index size would need to be 4*255 = 1020 bytes. However this is bigger than MySQL's 767 byte limit for indices. Therefore the table cannot be converted to UTF8MB4.

There are two solutions BUT they must be implemented by the developer of the affected extension:

  1. Set the column to VARCHAR(191). The index in this case is 4*191 = 764 bytes which is under MySQL's limit. This is perfectly fine for most practical cases. Very few title fields need to go that big, even in languages which use a massive amount of characters per word (e.g. German and Greek)
  2. Keep the columns as VARCHAR(255) but set a character limit to the index. This is the recommended solution for performance reasons but its applicability depends on the data shape. For example, a developer may choose to create an index on the column but limit it to 64 characters. This will only be a problem if you expect to have multiple rows with their first 64 characters in common. 

Since the second case is very difficult to determine when creating a mass distributed extension which can be used in contexts the developer can't even imagine the typical solution preferred is converting the column to VARCHAR(191).

Admin Tools WILL deliberately NOT do this conversion itself. Reducing the size of a column without knowing the context this column is used in could end up with data loss or break a third party extension. Therefore we choose to play it safe and simply not try to convert the problem table.

You should really get in contact with the developer of the affected extensions. Joomla has added UTF8MB4 support nearly three years ago. They should already have addressed it.

Nicholas K. Dionysopoulos

Lead Developer and Director

🇬🇷Greek: native 🇬🇧English: excellent 🇫🇷French: basic • 🕐 My time zone is Europe / Athens
Please keep in mind my timezone and cultural differences when reading my replies. Thank you!

joomleb

Hi Nicholas, 
to clear explication, many thanks to your informations I'm been able to run the right searches. Just to close the circle:

A - So, right now for "Joomla", Please, What is the best solution in your opinion, utf8mb4_unicode_ci (UTF-8 Multibyte) OR utf8mb4_unicode_ci ?

B - I yet contacted the developers to ask the "fixes".

C - Joomla 4 and utf8_unicode_520_ci / utf8mb4_unicode_ci - Do you know if they are talking about , if there will be any news regarding Collations ?

D - I'd like to know your opinion, this guy conclude: "If you need a database, don’t use MySQL or MariaDB. Use PostgreSQL" - What is your opinion about ?

nicholas
Akeeba Staff
Manager

A. None is "best". The differences are very subtle and have to do with how ordering works when your text contains characters with diacritics. For example, u with umlaut sorts separately from u using *_unicode_ci collations but together with u using *_general_ci. For non-Latin alphabets, like Greek, there is no difference. Please read this too: https://stackoverflow.com/questions/766809/whats-the-difference-between-utf8-general-ci-and-utf8-unicode-ci

B. OK

C. See the link in my answer to A. This basically has to do with the fact that a character can be expressed in different ways in Unicode. U with umlaut can be represented as a single, two byte, precomposed character OR as two separate characters, u and the combining umlaut character. For most European, Latin-based languages both forms will be considered as the same character for sorting purposes under all *_unicode_* collations. When you have languages like traditional Chinese, Kanji Japanese etc the old _unicode_ci may consider different ways to write the same character as different characters whereas the newer ones (*_unicode_520_ci, *_0900_ai_ci) won't. The numbers have to do with the Unicode version standard supported.

THIS IS A FRINGE CASE WHICH IS UNLIKELY TO APPLY TO YOU. We are talking about support for Traditional Chinese and very uncommon languages' scripts. The sites you create most likely use European languages based on the Latin, Greek or Cyrillic alphabets and probably text in Arabic, Thai and the main Indian dialects (the names of which unfortunately I cannot remember — my friends from India will be pissed at me). These are all well supported even with the old *_unicode_ci collation. As for the difference between _generic_ci and _unicode_ci it only matters if you want to treat European, Latin-based characters with diacritics as separate lexicographical entities.

So, whatever you use it won't really be a real issue for you. Definitely not a display issue.

D. Ah, I see. Failed developer became journalist and offers his views, which didn't cut it in the real world, as a gospel. Makes perfect sense. Idiots gonna be idiots.

MySQL was quite upfront that they only support three byte UTF-8 and they had explained the technical reasoning (which made perfect sense in the early 00s when the fastest server processor had the same performance envelope as a $50 Android crap phone you can buy at Walmart nowadays). A decade ago they introduced the 4-byte UTF-8 (UTF8MB4) specifically to address the proper 4-byte UTF-8 standard. The reason it got traction with CMS (used predominantly by people NOT using 4-byte text) was publicly to support 4-byte characters like Emoji but the real reason was security. I have explained this in my original PR to Joomla regarding UTF8MB4 support and has to do with how MySQL would try to fail gracefully when encountering 4 byte Unicode characters which could be used to trigger PHP session serialization attacks. 

In the end of the day MySQL is a perfectly fine database if the developers using it understand how it works and how to use it. Conversely, PostgreSQL will bite you in the ass if you are a developer who does not understand how it works.

I would like to add that major services like Twitter and Facebook have been using MySQL, even though nowadays in a heavily modified form.

Furthermore, MySQL does have a number of useful extensions e.g. a NoSQL extension and a JSON parsing extension. I am not aware of these existing for PostgreSQL but I will be the first to admit that I might be wrong because I do not know PostgreSQL anywhere near as well as I know MySQL.

Finally, MySQL is very convenient for exporting whole databases including structure and data. This is awesome for mass distributed CMS which need to be deployed by mere mortals without the assistance of an IT team and dedicated deployment instrumentation. PostgreSQL assumes that you are on a more traditional environment which employs a dedicated database administrator team and your deployments will take place using an IT team and dedicated instrumentation. You can easily export data but not structure. While you can introspect the database (e.g. using the INFORMATION_SCHEMA views) what you export WILL NEVER BE an absolutely accurate reflection of your database structure. That's a major reason why I stopped supporting PostgreSQL in Akeeba Backup. We could export the database but restoring it was NEVER resulting in the exact same structure you had when you backed up and this couldn't be fixed. Nobody has solved that issue because that issue does not exist where PostgreSQL is normally used.

So, if you want to use a mass distributed CMS and be able to move your site between servers without having to pay a dozens-to-hundreds of thousands of dollars per month IT team you should be using MySQL.

Nicholas K. Dionysopoulos

Lead Developer and Director

🇬🇷Greek: native 🇬🇧English: excellent 🇫🇷French: basic • 🕐 My time zone is Europe / Athens
Please keep in mind my timezone and cultural differences when reading my replies. Thank you!

joomleb

Hi Nicholas,

Even today I learned something new, thanks!

A - Yes, I didn't explain myself well, sorry. And, yes, I had read that article. My question was always intended to know your opinion from a security point of view and in a generic situation (the need to work with many languages, included Chinese) > Because I have always found your reasoning very logical and documented (this help me to learn). I love logic and knowledge.

Again, really Many Thanks

nicholas
Akeeba Staff
Manager

From a security perspective, using any UTF8MB4 collation is adequate.

The attack is a bit complicated to explain but fairly easy to pull off. An attacker sends a specially crafted request parameter which will end up in the database as part of the serialised PHP session. Because of how MySQL reacts to characters it cannot map (4 byte Unicode characters trying to be stored in a 3 byte UTF-8 string) the specially crafted message causes PHP to deserialise a somewhat different, malicious serialised string which can lead to remote code execution. That was the idea behind the Christmas 2015 zero-day attack that affected Joomla, WordPress, Drupal, Typo-3 and other CMS.

Using UTF8MB4 allows mapping the full 32-bit (4 byte) Unicode character space, mitigating the issue. That is to say, there are no unmapped characters which would give an attacker a foothold to corrupt serialised data to their advantage.

Regarding handling of languages: most written languages in current use map to Unicode's Basic Multilingual Plane (BMP) which can be represented with 3 bytes (plain old UTF-8). Anything beyond that plane requires four bytes to represent. Case in point the Emoticons block of the Supplementary Multilingual Plane (SMP), commonly called "Emoji". The SMP is first plane which requires 4 byte characters and Emoji is the most common use for it so that's why 4-byte UTF8 support has been commonly marketed as "Emoji support". So, even if you support Chinese and Japanese sites you are still most likely to only use the BMP which is covered in 3 byte plain old UTF8 in MySQL unless you are dealing with scholars using traditional Chinese / Kanji ideograms which are unlikely to be known to many other people.

So, yeah, UTF8MB4 support is there mainly for security reasons and oh, look, it can also support Emoji so that our users can see a palpable benefit instead of the more nebulous "security reasons" argument.

Nicholas K. Dionysopoulos

Lead Developer and Director

🇬🇷Greek: native 🇬🇧English: excellent 🇫🇷French: basic • 🕐 My time zone is Europe / Athens
Please keep in mind my timezone and cultural differences when reading my replies. Thank you!

joomleb

Hi Nicholas,
just "Thank you very much", once again you anticipated me.

I received first responses from the 3 developers I wrote to (they are not "newbies" developers):

- AcyMailing 6 by Acyba - by email: "...When we set the format to utf8_general_ci it's because we know it's working on all kinds of MySQL version and CMS (Joomla, WordPress), so if you change that it's not our fault.
Like I said the best solution here is to respect the format for each AcyMailing table.
We won't change the format of the table, this might break everything in some user configuration..."

- HikaShop by Hikari Software - by forum 

- SEF Translate by Ordasoft - by forum

and really the answers scared me, it seems that they totally ignore the utf8mb4 safety question and its importance (which is fundamental for me, as you closed the previous post).

I found your previous post the perfect resume of what I found online. Please, Can I copy and paste it by answering to them ? Do you give me this option /permission ?

PS - you have already scolded me, but sure I do not want to fill you with the responsibility of third parties, making you aware is a form of education that, from my point of view, is the minimum to thank you (I do not want misunderstanding with you)

nicholas
Akeeba Staff
Manager

Regarding the security aspect, I don't blame them for being oblivious to it. It doesn't affect them. It only really affects the core CMS contributors and people who write their own application frameworks. I am both which is why I was the one that originally proposed and implemented the utf8mb4 upgrade in Joomla itself. The current implementation has been greatly improved by Richard Fath and George Wilson when I ran out of available time and good will to keep on updating that code after Joomla 3.5 (it coincided with me getting married and the then-budding Joomla 4 development taking a bad direction, a combination which didn't exactly lend itself to me spending any more time on core Joomla code).

Now, I have two comments to make and please link these third party developers to this reply of mine.

First, I'd like to say that everything I have said here and everything I am about to say is already explained in minute detail in my PR which added backwards compatible UTF8MB4 support to Joomla 3.5.0, submitted 5 ½ years ago on June 2015.

Regarding the practical considerations of UTF8MB4 vs UTF8 support in each extension, it's about which characters can be stored with each collation. When using a utf8_*_ci collation you are essentially only supporting the Unicode Basic Multilingual Plane (BMP). This means that four byte Unicode characters such as Emoji and extended Chinese, Japanese and Korean characters are not supported and instead converted to question marks at best or result in truncated data at worst. Joomla has supported UTF8MB4 for more than five years and people expect these characters to work correctly with it and all of its extensions. There is no good reason why I shouldn't be able to send a mailing that contains an Emoji, make a product description that includes some extended Korean character or have automatic translation fail when using Unicode aliases in Global Configuration for an article whose title uses some less often used traditional Chinese character. 

As to how they can convert their extensions, all they have to do is stop supporting Joomla 3.4 and earlier. I would say that's a reasonable thing to do considering that these versions of Joomla are less than 0.4% of all installed Joomla sites. Joomla 3.5 and later allows you to simply write your SQL queries using the utf8mb4_unicode_ci collation and it will magically and automatically convert that to utf8_unicode_ci collation if the database server and/or PHP MySQL driver used on the site do not support utf8mb4. From a third party developer's perspective you DO NOT have to worry about whether the database server supports utf8mb4 or not. That's Joomla's problem. In fact, Joomla's own SQL code does the same – it uses utf8mb4. It still works on old versions of MySQL and PHP which don't support UTF8MB4. This alone should have gotten these developers thinking instead of waving off utf8mb4 support as infeasible and prompt them to take a look at Joomla's code where they would have easily spotted the  convertUtf8mb4QueryToUtf8() method in the core Joomla database driver code.

So, it's absolutely feasible, it is necessary and there are no more excuses not to do it. If they really want to support Joomla 3.0 to 3.4 (less than 0.4% of Joomla installations) they can of course create two separate packages the only difference between them being the SQL files. This can be trivially automated, so it's not even a valid excuse. If they still refuse to upgrade to utf8mb4 then the problem is with them and only them.

Nicholas K. Dionysopoulos

Lead Developer and Director

🇬🇷Greek: native 🇬🇧English: excellent 🇫🇷French: basic • 🕐 My time zone is Europe / Athens
Please keep in mind my timezone and cultural differences when reading my replies. Thank you!

joomleb

Hi Nicholas, 
always thanking you for the great explications, as far as I understand, when I run "Change Database Collation" in Admin Tools it is performed on all the DB Tables data (if no conflicts are encountered).

1 - Is there a way to "exclude some tables" when I run it ? OR, Is there a way to select the DB Tables on where to run it ?

2 - From a theoretical point of view, moving from uft8 to utf8mb4 collation is a reversible utf8mb4 to utf8 process ? Or, Is there a high risk of making mistakes?

I'm asking it to you because, right now my DB is quite new / empty, but some Extension's Developer answered (don't answering) in this way (for ex.):
"...A- We don't need to specify anything in the documentation because it's in the code when we create the tables in the file administrator/components/com_acym/tables.sql
At the end of every new table we specify to add the format utf8_general_ci
=> so if you change that it's at your own risk
B- I recommend you to use the format that we specify under each table.
C- We will see if we change the format in the future, but for now there are too many risks regarding the configuration of our users..."

So, I'm have to take a decision if is better for me to move all the Tables collation to utf8mb4_unicode_ci  or if would be better to leave the ones of the mentioned Extensions, like AcyMailing 6 as utf8_general_ci

 

Now, you wrote me: "..when I ran out of available time and good will to keep on updating that code after Joomla 3.5 (it coincided with me getting married and the then-budding Joomla 4 development taking a bad direction, a combination which didn't exactly lend itself to me spending any more time on core Joomla code)..."
Well, I'm not a developer, but I'm on Joomla from its first day (august 2005) as moving from Mambo. I noted something strange on Joomla on last 3 years, something is changed. Developers are very "electric", many extensions of historical developers have ended up on a dead end. Few new entries. Everything seems to have slowed down. Many small BUG reports are simply discarded hastily.
This is starting to worry me because I'm starting now with big Joomla based projects.
Have you experienced the same things or is it just my impression?

nicholas
Akeeba Staff
Manager

1. No.

2. This is way beyond the scope of our support. I will say this and close the ticket. You can go from 3 byte UTF8 to 4 byte UTF8MB4 without data loss. You are expanding the encoding. You CAN NOT go from 4 byte UTF8MB4 to UTF8 without some data loss. You are contracting the encoding. Any 4 byte characters won't be mappable to a 3 byte character encoding therefore they will be lost (they will display as question marks).

Regarding Joomla. People have been complaining about it since its inception 15 years ago. Nothing new here.

Between 2008 and 2013 Joomla was more inflated than it should be because people was using it to create e-commerce sites with VirtueMart, even when they really didn't need a CMS with their e-commerce site. Now we see the same with WooCommerce which inflates WordPress' numbers (and costs far more than Joomla + VirtueMart cost a decade ago, but that's another problem). The people who actually only needed e-commerce, not a CMS to go with it, moved to PrestaShop and similar solutions. I think that's a good thing. Trying to make a CMS into something that it's not was only having bad consequences for Joomla.

The corollary is that there were a lot of smaller extensions teetering on the edge of oblivion, propped up only by Joomla's artificially inflated market share. When Joomla contracted to its natural size – covering the not insignificant and very lucrative niche of bespoke sites without going for resource and capital intensive from-scratch development – the smaller extensions died off.  Other extensions no longer had an appeal. For example, in the late 00s and early 10s it was common to have directories, file download archives and galleries. Nobody does that anymore. Extensions in those niches died. Other extensions were superseded by core Joomla. For example CCKs like K2 are obsolete since sometime between Joomla! 3.4 to 3.8 depending on what you were using them for. That's not a bad thing. It's Joomla and its market maturing.

I would also posit that it's a good thing for another reason. Look at WordPress and its Plugins Directory. WordPress has been nominally expanding in market share, yet plugin developers are left without much in the way of income. Long standing plugins are abandoned, go on a paid-only model or, far more frequently and alarmingly, get scooped up by malfeasants which use them as a backdoor to hack people's sites. I would say that this is far more concerning, especially considering that WordPress itself offers too little features and you need plugins like that to do essential things including but not limited to have a contact form on your site or send email using an SMTP server of your choice – features that every other CMS has had in the core for over a decade if not since their inception. Between that and the push to use Gutenberg, WordPress is placing itself as a combination toy CMS for marketers who want to make throwaway sites easily and a base platform to run a very expensive e-commerce platform (WooCommerce). For everything else Joomla is superior and far more mature.

I do as I preach. My own blog was converted FROM WordPress TO Joomla in April 2020. All new site projects me or my wife start are based on Joomla. Joomla is an excellent all purpose, extensible CMS with what seems to be a future. WordPress, on the other hand, is a dead end. Matt's declared end goal is that WordPress the CMS goes away, plugins are nothing more than JavaScript / React Software-as-a-Service adding functionality to WordPress.com and he gets the lion's share of all the money flowing through this perverted ecosystem. What are the other choices we have left? Using Drupal, Laravel, Luminas or Symfony to create a bespoke site? That's a different market (the $100k to $1+ million site) than what us using FOSS CMS target (the sub $30k site, typically in the sub-$4k area if you're talking about Europe). So it's either Joomla or start working on a career outside web development.

Nicholas K. Dionysopoulos

Lead Developer and Director

🇬🇷Greek: native 🇬🇧English: excellent 🇫🇷French: basic • 🕐 My time zone is Europe / Athens
Please keep in mind my timezone and cultural differences when reading my replies. Thank you!

Support Information

Working hours: We are open Monday to Friday, 9am to 7pm Cyprus timezone (EET / EEST). Support is provided by the same developers writing the software, all of which live in Europe. You can still file tickets outside of our working hours, but we cannot respond to them until we're back at the office.

Support policy: We would like to kindly inform you that when using our support you have already agreed to the Support Policy which is part of our Terms of Service. Thank you for your understanding and for helping us help you!