Support

Akeeba Backup for Joomla!

#34349 MySQL database backup default character set

Posted in ‘Akeeba Backup for Joomla! 4 & 5’
This is a public ticket

Everybody will be able to see its contents. Do not include usernames, passwords or any other sensitive information.

Environment Information

Joomla! version
n/a
PHP version
n/a
Akeeba Backup version
n/a

Latest post by rbradbury on Friday, 15 January 2021 05:24 CST

rbradbury

 


Description of my issue: 

Can you advise what default character set is used for database backups when using the 'Multiple Database Definitions' facility please?

I'm hoping the answer is UTF8MB4 (noting current Joomla! requirements) but p.161 of the documentation states that "Akeeba Backup stores and parses your data as raw Unicode (UTF-8)" 

I'm not using the RegEx Database Tables Exclusion facility, but the statement there worries me.  The non-Joomla database I'm backing up uses UTF8MB4 which I hope is preserved during backup & restore.

nicholas
Akeeba Staff
Manager

Ah, I see that you are confused about a few things and with a good reason.

Background: character sets, PHP and databases

There are two distinct character sets you need to be aware of:

1. Database storage character set which is a property of the MySQL database, table or column itself. This defines how MySQL stores the database data to disk.

For historical reasons, MySQL's UTF-8 character set was not the canonical UTF-8 character set defined un the Unicode standard (characters consisting of 1 to 4 bytes). It was a botched down version that supported only up to 3 bytes per character. This had to do with memory constraints in the prehistoric days when this decision was made. We're talking about early 00s when computers had 128MB of RAM and anything over 512MB was a beast with plenty of RAM. Of course the problem with that is that a lot of valid Unicode characters could not be represented and MySQL would squash them as the invalid character grapheme which renders as a question mark. MySQL then introduced the UTB8MB4 character set which is the “real deal” UTF-8 character set, i.e. each character can occupy 1 to 4 bytes.

2. PHP character set which is how PHP itself stores data in its memory, expects to read from files and writes to files. This is always the “real deal” UTF-8 encoding that supports characters of 1 to 4 bytes, i.e. equivalent to MySQL's "UTF8MB4". 

PHP can only handle one kind of data, the “real deal” UTF-8 data. When PHP-based software, like Akeeba Backup, reads from the database the following happens. The software interfaces with PHP's MySQL access library (MySQLi or PDOMySQL). The access library interfaces with PHP's low-level MySQL driver (mysqlnd) which uses MySQL's official C library to talk to the database server. The connection character set (the format the database will receive and send back data) is set to “real deal” UTF-8. So, rhe database server reads the data from disk, processes it and returns it as “real deal” UTF-8 data back to mysqlnd which sends it to MySQLi/PDOMySQL which sends it back to Akeeba Backup. What Akeeba Backup got from the database use “real deal” UTF-8 data, i.e Unicode characters with a varying length of 1 to 4 bytes.

Akeeba Backup writes the SQL dump file with a verbatim copy of the data it received. Therefore it will be, by definition, real deal UTF-8 data with characters between 1 and 4 bytes in length. (There's a further note here that PHP always uses Big Endian format for multibyte data even on Little Endian platforms like ARM, e.g. Apple Silicon, which means we don't need to output a Byte Order Mark but that's super technical and besides the point — I only include it in case some poor soul is looking for this information in the future and bumps into this public ticket).

Restoring backup

There are two kinds of backups you can restore: archive files or SQL files.

Archive files are the .jpa, .jps and .zip files you get when you take a full site backup, an "All configured databases" backup or an "Incremental files, full database" backup. In this case the restoration takes place with our database restoration script, ANGIE. ANGIE is smart. It detects your database server and its version and knows when it supports UTF8MB4. If this is the case it will restore the data verbatim, i.e. you will not lose any 4 byte characters such as Emoji, lesser used traditional Chinese characters and less common writing scripts introduced in Unicode version 5 and later. When you are on a very, VERY old database server which does not support UTF8MB4 the 4-byte characters will be replaced by question marks by the database server itself.

SQL files are restored by you using software you provide. What will happen when you restore them depends on BOTH your database server AND the software you are using. If you use something like HeidiSQL, DBeaver or phpMyAdmin and your server supports UTF8MB4 your data will be restored correctly, including 4-byte data. In any other case the 4-byte characters will be converted to question marks by the database server itself.

TL;DR

As long as you are using the latest version of Akeeba Backup to backup and restore your database you don't need to worry about character sets. You will NOT lose any data, as long as the database server you are restoring your site to supports UTF8MB4.

Also, for what it's worth, Joomla only started UTF8MB4 because yours truly insisted on it and provided the original patch which enabled this behaviour. It would be really stupid if my own software didn't correctly support UTF8MB4, don't you think? 😉

Nicholas K. Dionysopoulos

Lead Developer and Director

🇬🇷Greek: native 🇬🇧English: excellent 🇫🇷French: basic • 🕐 My time zone is Europe / Athens
Please keep in mind my timezone and cultural differences when reading my replies. Thank you!

rbradbury

Thanks very much indeed for your comprehensive & reassuring reply Nicholas.  As a long time user of Akeeba Backup Pro I expected you to have covered all the bases and thought this might just be a documentation issue.  I should have known better!!

Support Information

Working hours: We are open Monday to Friday, 9am to 7pm Cyprus timezone (EET / EEST). Support is provided by the same developers writing the software, all of which live in Europe. You can still file tickets outside of our working hours, but we cannot respond to them until we're back at the office.

Support policy: We would like to kindly inform you that when using our support you have already agreed to the Support Policy which is part of our Terms of Service. Thank you for your understanding and for helping us help you!