MySQL 8.0のCharset utf8mb4での日本語環境で使うCollationで文字比較をしてみる

これは、なにをしたくて書いたもの？

MySQL 8.0のCharset utf8mb4で使えるCollationについて、ちょっと見ておこうかなと思いまして。

具体的には、「MySQL徹底入門第4版」の「11.2 Collation」に書かれている文字比較およびソートについて自分で
確認してみたいと思います。

MySQL徹底入門第4版 MySQL 8.0対応

作者:yoku0825,坂井恵,鶴長鎮一,とみたまさひろ,深町日出海,福山裕大,班石悦夫,山﨑由章
発売日: 2020/07/06
メディア: 単行本（ソフトカバー）

utf8mb4でのCharsetとCollation

MySQLのCharsetとCollationに関するドキュメントは、こちらです。

MySQL :: MySQL 8.0 リファレンスマニュアル :: 10 文字セット、照合順序、Unicode

MySQLでは複数のCharset（文字セット）を使うことができ、その環境で使用できるCharsetは以下で確認できます。

mysql> show character set;

そして、CharsetにはCollation（照合順序）があり、文字の比較やソートに関わってくることになります。

MySQL :: MySQL 8.0 リファレンスマニュアル :: 10.2 MySQL での文字セットと照合順序

使用できるCollationは、以下で確認できます。

mysql> show collation;

CharsetおよびCollationは、サーバー、データベース、テーブル、カラム、文字列リテラルそれぞれで指定することが
できます。

MySQL :: MySQL 8.0 リファレンスマニュアル :: 10.3 文字セットと照合順序の指定

で、どのCharset、Collationを使うかですが…。

まずはCharset。

Unicodeをサポートする、UTF-8系のCharsetを選ぶことになるでしょう。
というか、utf8mb4（4バイトのUTF-8 エンコーディング）ですね。

MySQL :: MySQL 8.0 リファレンスマニュアル :: 10.9 Unicode のサポート

昨今はあまり使わないかもしれませんが、cp932、eucjpmsなどもあります。

MySQL :: MySQL 8.0 リファレンスマニュアル :: 10.10.7 アジアの文字セット

今回は、utf8mb4を扱うことにします。

続いてCollation。

utf8mb4_bin、utf8mb4_general_ci、utf8mb4_0900_as_ci、utf8mb4_ja_0900_as_cs、utf8mb4_ja_0900_as_cs_ksなどが
あるわけですが、これらの読み方は以下を見るとわかります。

MySQL :: MySQL 8.0 リファレンスマニュアル :: 10.3.1 照合の命名規則

まず、Collationの名前は関連付けられているCharsetで始まります。

照合順序名は、関連付けられている文字セットの名前で始まり、通常は、他の照合順序特性を示す 1 つ以上の接尾辞が続きます。

jaなどのLocaleが含まれる場合は、言語固有のCollationであることを表しています。

言語固有の照合には、ロケールコードまたは言語名が含まれます。

そのあとのサフィックス（接尾辞）は、以下の意味になります。

_ai … アクセントを区別しない（Accent Insensitive）
_as … アクセントを区別する（Accent Sensitive）
_ci … 大文字・小文字を区別しない（Case Insensitive）
_cs … 大文字・小文字を区別する（Case Sensitive）
_ks … カナを区別する（Kana Sensitive）
_bin … バイナリ

日本語について言うと、アクセントは清音濁音半濁音、かなは平仮名片仮名、これらを区別するかどうかという
話になります。

数字が入っている場合は、Unicode Collation Algorithm（Unicode照合アルゴリズム、UCA）のバージョンを示しています。

utf8mb4_unicode_520_ci … Unicode Collation Algorithm 5.2.0に基づいている
utf8mb4_ja_0900_as_cs … Unicode Collation Algorithm 9.0.0に基づいている

つまり、UCAのバージョンが入っているものについては、Unicode規格に沿ったCollationだというわけですね。

ここまでの内容を踏まえると、たとえばutf8mb4_ja_0900_as_cs_ksだと以下のような解釈になります。

日本語固有のCollation
UCA 9.0.0に基づいている
アクセントを区別する
大文字・小文字を区別する
カナを区別する

とすると、UCAのバージョンが入っていないCollationはどういうものかというと、UCAの規格に従わないMySQL独自の
規則のものだということになります。

_binについてはバイナリなので、utf8mb4_bin、utf8mb4_0900_binはバイナリ比較となります。

この2つの違いは、以下に記載があります。

_bin (バイナリ) 照合順序を除くすべての Unicode 照合順序について、MySQL はテーブル検索を実行して文字照合順序を検索します。

utf8mb4_0900_bin 以外の_bin 照合順序の場合、重みはコードポイントに基づき、先行するゼロバイトが追加される場合があります。

utf8mb4_0900_bin の場合、重みは utf8mb4 エンコーディングバイトです。ソート順序は utf8mb4_bin の場合と同じですが、はるかに高速です。

文字照合ウェイト

コードポイントでの比較か、バイトでの比較かという違いですね。また、utf8mb4_binはPAD SPACE、
utf8mb4_0900_binはNO PADという違いもあります（照合パッド属性については後述）。

あとは、utf8mb4_general_ci、utf8mb4_unicode_ciといったCollationについて。

MySQL :: MySQL 8.0 リファレンスマニュアル :: 10.10.1 Unicode 文字セット

これらについては、以下に記載があります。

_general_ci と unicode_ci の照合順序

xxx_general_ciの方がxxx_unicode_ciよりも高速なようですが、精度も低くなるとか。

Unicode 文字セットの場合、xxx_general_ci 照合順序を使用して実行する演算は、xxx_unicode_ci 照合順序のものよりも高速です。たとえば、utf8_general_ci 照合順序の比較は、utf8_unicode_ci の比較よりも高速ですが、精度は少し低くなります。これは、utf8_unicode_ci で拡張などのマッピングがサポートされているためです。

これだと、ちょっとよくわからないですね。

以下に、こんな記述がありました。

一般照合 (xxx_general_ci) の BMP 文字の場合、重みはコードポイントです。

文字照合ウェイト

つまり、BMPの範囲ではコードポイントでの比較を行うようです。

utf8mb4_general_ci … 大文字・小文字を区別せず（_ci）、他はBMPのコードポイントに沿って比較を行う。ただし、BMPの範囲外（U+10000`以上の文字）は区別できない
utf8mb4_unicode_ci … Unicode拡張をサポートしたもの

こう書くとutf8mb4_unicode_ciの方が良さそうですが、実際にはいろいろ困ったことになるので使わないでしょう…。

また、CollationにはPAD属性があり、NO PADのものは文字列の末尾にあるスペースが文字として扱われます。

UCA 9.0.0 以上に基づく照合は、9.0.0 より前の UCA バージョンに基づく照合より高速です。また、9.0.0 より前の UCA バージョンに基づく照合で使用される PAD SPACE とは対照的に、NO PAD のパッド属性もあります。非バイナリ文字列を比較するために、NO PAD 照合順序では、文字列の末尾のスペースは他の文字と同様に扱われます。

照合パッド属性

要するに、最後のスペースを比較に含めるかどうか、という話です。

そして先に少しドキュメントを出してしまいましたが、Collationにおける文字の重みはWEIGHT_STRING関数を使って
調べることができます。

文字照合ウェイト

WEIGHT_STRING

実際の文字の判定がどうなっているのかを確認するには、こちらを使うとよいでしょう。

デフォルトのutf8mb4のCollation

Charset utf8mb4を選んだ場合、デフォルトのCollationはutf8mb4_0900_ai_ciとなるようです。

mysql> show collation where collation like 'utf8mb4%' and `default` = 'Yes';
+--------------------+---------+-----+---------+----------+---------+---------------+
| Collation          | Charset | Id  | Default | Compiled | Sortlen | Pad_attribute |
+--------------------+---------+-----+---------+----------+---------+---------------+
| utf8mb4_0900_ai_ci | utf8mb4 | 255 | Yes     | Yes      |       0 | NO PAD        |
+--------------------+---------+-----+---------+----------+---------+---------------+
1 row in set (0.00 sec)

ここから先は、実際に動かしながら確認してみましょう。

環境

今回の環境について。

MySQL 8.0.24を使います。こちらは、172.17.0.2で動作しているものとします。

また、確認のための情報はプログラムで作成することにします。今回はPythonを使うことにしました。

$ python3 -V
Python 3.8.5


$ pip3 -V
pip 20.0.2 from /path/to/venv/lib/python3.8/site-packages/pip (python 3.8)

MySQLへのアクセスには、MySQL Connector/Pythonを使います。

MySQL :: MySQL Connector/Python Developer Guide

$ pip3 install mysql-connector-python==8.0.24

プログラム自体は、最後に載せることにします。

今回扱うCollation

今回扱うCollationは、こちらにします。

mysql>  show collation
    ->  where
    ->   collation like 'utf8mb4_ja%' or
    ->   collation like 'utf8mb4_0900%' or
    ->   collation like 'utf8mb4%bin' or
    ->   collation like 'utf8mb4%general%' or
    ->   collation like 'utf8mb4%unicode%';
+--------------------------+---------+-----+---------+----------+---------+---------------+
| Collation                | Charset | Id  | Default | Compiled | Sortlen | Pad_attribute |
+--------------------------+---------+-----+---------+----------+---------+---------------+
| utf8mb4_0900_ai_ci       | utf8mb4 | 255 | Yes     | Yes      |       0 | NO PAD        |
| utf8mb4_0900_as_ci       | utf8mb4 | 305 |         | Yes      |       0 | NO PAD        |
| utf8mb4_0900_as_cs       | utf8mb4 | 278 |         | Yes      |       0 | NO PAD        |
| utf8mb4_0900_bin         | utf8mb4 | 309 |         | Yes      |       1 | NO PAD        |
| utf8mb4_bin              | utf8mb4 |  46 |         | Yes      |       1 | PAD SPACE     |
| utf8mb4_general_ci       | utf8mb4 |  45 |         | Yes      |       1 | PAD SPACE     |
| utf8mb4_ja_0900_as_cs    | utf8mb4 | 303 |         | Yes      |       0 | NO PAD        |
| utf8mb4_ja_0900_as_cs_ks | utf8mb4 | 304 |         | Yes      |      24 | NO PAD        |
| utf8mb4_unicode_520_ci   | utf8mb4 | 246 |         | Yes      |       8 | PAD SPACE     |
| utf8mb4_unicode_ci       | utf8mb4 | 224 |         | Yes      |       8 | PAD SPACE     |
+--------------------------+---------+-----+---------+----------+---------+---------------+
10 rows in set (0.01 sec)

少し比較してみる

ここまでの説明を踏まえて、一部のCollationを少し確認してみましょう。

utf8mb4_0900_ai_ci。アクセント区別なし、大文字・小文字区別なし、ですね。また、NO PADでもあります。

mysql> set names utf8mb4 collate utf8mb4_0900_ai_ci;
Query OK, 0 rows affected (0.00 sec)


mysql> select 'a' = 'A';
+-----------+
| 'a' = 'A' |
+-----------+
|         1 |
+-----------+
1 row in set (0.00 sec)


mysql> select 'あ' = 'ぁ';
+---------------+
| 'あ' = 'ぁ'   |
+---------------+
|             1 |
+---------------+
1 row in set (0.00 sec)


mysql> select 'は' = 'ば';
+---------------+
| 'は' = 'ば'   |
+---------------+
|             1 |
+---------------+
1 row in set (0.00 sec)


mysql> select 'a  ' = 'A';
+-------------+
| 'a  ' = 'A' |
+-------------+
|           0 |
+-------------+
1 row in set (0.00 sec)


mysql> select '🍣' = '🍺';
+-----------+
| '?' = '?' |
+-----------+
|         0 |
+-----------+
1 row in set (0.00 sec)

結果が1なのは、Trueを表しています。

大文字、小文字が区別されず、「あ」と「ぁ」、「は」と「ば」も区別されず。末尾のスペースは区別されました。
🍣と🍺も区別されましたね。

今度は、utf8mb4_general_ciにしてみましょう。

mysql> set names utf8mb4 collate utf8mb4_general_ci;
Query OK, 0 rows affected (0.00 sec)


mysql> select 'a' = 'A';
+-----------+
| 'a' = 'A' |
+-----------+
|         1 |
+-----------+
1 row in set (0.00 sec)


mysql> select 'あ' = 'ぁ';
+---------------+
| 'あ' = 'ぁ'   |
+---------------+
|             0 |
+---------------+
1 row in set (0.00 sec)


mysql> select 'は' = 'ば';
+---------------+
| 'は' = 'ば'   |
+---------------+
|             0 |
+---------------+
1 row in set (0.00 sec)


mysql> select 'a  ' = 'A';
+-------------+
| 'a  ' = 'A' |
+-------------+
|           1 |
+-------------+
1 row in set (0.00 sec)


mysql> select '🍣' = '🍺';
+-----------+
| '?' = '?' |
+-----------+
|         1 |
+-----------+
1 row in set (0.01 sec)

大文字、小文字は区別されません。「あ」と「ぁ」、「は」と「ば」は区別されました。
末尾のスペースは区別されていますね。🍣と🍺は同じ文字になってしまいました。

もうひとつ、utf8mb4_0900_as_cs。アクセント、大文字・小文字を区別するものに。

mysql> set names utf8mb4 collate utf8mb4_0900_as_cs;
Query OK, 0 rows affected (0.00 sec)


mysql> select 'a' = 'A';
+-----------+
| 'a' = 'A' |
+-----------+
|         0 |
+-----------+
1 row in set (0.00 sec)

mysql> select 'あ' = 'ぁ';
+---------------+
| 'あ' = 'ぁ'   |
+---------------+
|             0 |
+---------------+
1 row in set (0.00 sec)


mysql> select 'は' = 'ば';
+---------------+
| 'は' = 'ば'   |
+---------------+
|             0 |
+---------------+
1 row in set (0.00 sec)


mysql> select 'a  ' = 'A';
+-------------+
| 'a  ' = 'A' |
+-------------+
|           0 |
+-------------+
1 row in set (0.00 sec)


mysql> select '🍣' = '🍺';
+-----------+
| '?' = '?' |
+-----------+
|         0 |
+-----------+
1 row in set (0.00 sec)

あとは、weight_string関数で重みを見てみましょう。

mysql> set names utf8mb4 collate utf8mb4_general_ci;
Query OK, 0 rows affected (0.00 sec)


mysql> select 'a', hex('a'), hex(weight_string('a'));
+---+----------+-------------------------+
| a | hex('a') | hex(weight_string('a')) |
+---+----------+-------------------------+
| a | 61       | 0041                    |
+---+----------+-------------------------+
1 row in set (0.00 sec)


mysql> select 'A', hex('A'), hex(weight_string('A'));
+---+----------+-------------------------+
| A | hex('A') | hex(weight_string('A')) |
+---+----------+-------------------------+
| A | 41       | 0041                    |
+---+----------+-------------------------+
1 row in set (0.00 sec)

mysql> select 'あ', hex('あ'), hex(weight_string('あ'));
+-----+------------+---------------------------+
| あ  | hex('あ')  | hex(weight_string('あ'))  |
+-----+------------+---------------------------+
| あ  | E38182     | 3042                      |
+-----+------------+---------------------------+
1 row in set (0.00 sec)


mysql> select 'ぁ', hex('ぁ'), hex(weight_string('ぁ'));
+-----+------------+---------------------------+
| ぁ  | hex('ぁ')  | hex(weight_string('ぁ'))  |
+-----+------------+---------------------------+
| ぁ  | E38181     | 3041                      |
+-----+------------+---------------------------+
1 row in set (0.00 sec)


mysql> select '🍣', hex('🍣'), hex(weight_string('🍣'));
+------+----------+-------------------------+
| ?    | hex('?') | hex(weight_string('?')) |
+------+----------+-------------------------+
| 🍣     | F09F8DA3 | FFFD                    |
+------+----------+-------------------------+
1 row in set (0.00 sec)


mysql> select '🍺', hex('🍺'), hex(weight_string('🍺'));
+------+----------+-------------------------+
| ?    | hex('?') | hex(weight_string('?')) |
+------+----------+-------------------------+
| 🍺     | F09F8DBA | FFFD                    |
+------+----------+-------------------------+
1 row in set (0.00 sec)

こう見ると、文字が区別される、されない理由がわかりますね。また、utf8mb4_general_ciだとBMP外の文字は\ufffdに
なってしまうようです。

もっと比較してみる

では、もっとCollationを広げて比較してみましょう。

以下のCollationを対象にします。

utf8mb4_0900_ai_ci
utf8mb4_0900_as_ci
utf8mb4_0900_as_cs
utf8mb4_ja_0900_as_cs
utf8mb4_ja_0900_as_cs_ks
utf8mb4_bin
utf8mb4_0900_bin
utf8mb4_general_ci
utf8mb4_unicode_ci
utf8mb4_unicode_520_ci

これらのCollationに対して、以下の文字の等値比較・大小を算出みたいと思います。
※「MySQL徹底入門第4版」に書かれていた、「令和」関係の文字は手元の環境だと入力できませんでした…

Aとa
AとＡ（全角）
Ａ（全角）とａ（全角）
あとぁ
あとア
はとば
ばとぱ
1と①
0と〇（漢数字）
平成と㍻
🍣と🍺

また、上記の文字に対してweight_stringで重みも算出します。

結果は、こちら。

文字比較。

f:id:Kazuhira:20210508233752p:plain

utf8mb4_unicode_ciが、ほとんど区別できてないですね。utf8mb4_unicode_520_ciは、🍣と🍺だけ区別できていますが。

表が大きくなったので画像にしましたが、Markdownのままでも貼っておきましょう。

| 比較 | utf8mb4_0900_ai_ci | utf8mb4_0900_as_ci | utf8mb4_0900_as_cs | utf8mb4_ja_0900_as_cs | utf8mb4_ja_0900_as_cs_ks | utf8mb4_bin | utf8mb4_0900_bin | utf8mb4_general_ci | utf8mb4_unicode_ci | utf8mb4_unicode_520_ci |
|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|
| A = a  | ○ | ○ | × | × | × | × | × | ○ | ○ | ○ |
| A = Ａ  | ○ | ○ | × | ○ | ○ | × | × | × | ○ | ○ |
| Ａ = ａ  | ○ | ○ | × | × | × | × | × | ○ | ○ | ○ |
| あ = ぁ  | ○ | ○ | × | × | × | × | × | × | ○ | ○ |
| あ = ア  | ○ | ○ | × | ○ | × | × | × | × | ○ | ○ |
| は = ば  | ○ | × | × | × | × | × | × | × | ○ | ○ |
| ば = ぱ  | ○ | × | × | × | × | × | × | × | ○ | ○ |
| 1 = ①  | ○ | ○ | × | × | × | × | × | × | ○ | ○ |
| 0 = 〇  | ○ | ○ | ○ | ○ | ○ | × | × | × | ○ | ○ |
| 平成 = ㍻  | ○ | ○ | × | × | × | × | × | × | ○ | ○ |
| 🍣 = 🍺  | × | × | × | × | × | × | × | ○ | ○ | × |

大小比較。

f:id:Kazuhira:20210508233225p:plain

Markdownで。

| 比較 | utf8mb4_0900_ai_ci | utf8mb4_0900_as_ci | utf8mb4_0900_as_cs | utf8mb4_ja_0900_as_cs | utf8mb4_ja_0900_as_cs_ks | utf8mb4_bin | utf8mb4_0900_bin | utf8mb4_general_ci | utf8mb4_unicode_ci | utf8mb4_unicode_520_ci |
|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|
| A comp a  | = | = | > | > | > | < | < | = | = | = |
| A comp Ａ  | = | = | < | = | = | < | < | < | = | = |
| Ａ comp ａ  | = | = | > | > | > | < | < | = | = | = |
| あ comp ぁ  | = | = | > | > | > | > | > | > | = | = |
| あ comp ア  | = | = | < | = | < | < | < | < | = | = |
| は comp ば  | = | < | < | < | < | < | < | < | = | = |
| ば comp ぱ  | = | < | < | < | < | < | < | < | = | = |
| 1 comp ①  | = | = | < | < | < | < | < | < | = | = |
| 0 comp 〇  | = | = | = | = | = | < | < | < | = | = |
| 平成 comp ㍻  | = | = | < | < | < | > | > | > | = | = |
| 🍣 comp 🍺  | < | < | < | < | < | < | < | = | = | < |

最後は、文字の重み。

f:id:Kazuhira:20210508233452p:plain

これは、ちょっと見えませんね…。Markdownで。

| 文字（hex） | utf8mb4_0900_ai_ci（weight） | utf8mb4_0900_as_ci（weight） | utf8mb4_0900_as_cs（weight） | utf8mb4_ja_0900_as_cs（weight） | utf8mb4_ja_0900_as_cs_ks（weight） | utf8mb4_bin（weight） | utf8mb4_0900_bin（weight） | utf8mb4_general_ci（weight） | utf8mb4_unicode_ci（weight） | utf8mb4_unicode_520_ci（weight） |
|:----|:----|:----|:----|:----|:----|:----|:----|:----|:----|:----|
| A(41) | 1C47 | 1C4700000020 | 1C470000002000000008 | 1C470000002000000008 | 1C470000002000000008 | 000041 | 41 | 0041 | 0E33 | 120F |
| a(61) | 1C47 | 1C4700000020 | 1C470000002000000002 | 1C470000002000000002 | 1C470000002000000002 | 000061 | 61 | 0041 | 0E33 | 120F |
| Ａ(EFBCA1) | 1C47 | 1C4700000020 | 1C470000002000000009 | 1C470000002000000008 | 1C470000002000000008 | 00FF21 | EFBCA1 | FF21 | 0E33 | 120F |
| ａ(EFBD81) | 1C47 | 1C4700000020 | 1C470000002000000003 | 1C470000002000000002 | 1C470000002000000002 | 00FF41 | EFBD81 | FF21 | 0E33 | 120F |
| あ(E38182) | 3D5A | 3D5A00000020 | 3D5A000000200000000E | 1FB6000000200000000E | 1FB6000000200000000E00000002 | 003042 | E38182 | 3042 | 1E52 | 2B15 |
| ぁ(E38181) | 3D5A | 3D5A00000020 | 3D5A000000200000000D | 1FB6000000200000000D | 1FB6000000200000000D00000002 | 003041 | E38181 | 3041 | 1E52 | 2B15 |
| ア(E382A2) | 3D5A | 3D5A00000020 | 3D5A0000002000000011 | 1FB6000000200000000E | 1FB6000000200000000E00000008 | 0030A2 | E382A2 | 30A2 | 1E52 | 2B15 |
| は(E381AF) | 3D74 | 3D7400000020 | 3D74000000200000000E | 1FD0000000200000000E | 1FD0000000200000000E00000002 | 00306F | E381AF | 306F | 1E6B | 2B2E |
| ば(E381B0) | 3D74 | 3D74000000200037 | 3D740000002000370000000E0002 | 1FD00000002000370000000E0002 | 1FD00000002000370000000E000200000002 | 003070 | E381B0 | 3070 | 1E6B | 2B2E |
| ぱ(E381B1) | 3D74 | 3D74000000200038 | 3D740000002000380000000E0002 | 1FD00000002000380000000E0002 | 1FD00000002000380000000E000200000002 | 003071 | E381B1 | 3071 | 1E6B | 2B2E |
| 1(31) | 1C3E | 1C3E00000020 | 1C3E0000002000000002 | 1C3E0000002000000002 | 1C3E0000002000000002 | 000031 | 31 | 0031 | 0E2A | 1206 |
| ①(E291A0) | 1C3E | 1C3E00000020 | 1C3E0000002000000006 | 1C3E0000002000000006 | 1C3E0000002000000006 | 002460 | E291A0 | 2460 | 0E2A | 1206 |
| 0(30) | 1C3D | 1C3D00000020 | 1C3D0000002000000002 | 1C3D0000002000000002 | 1C3D0000002000000002 | 000030 | 30 | 0030 | 0E29 | 1205 |
| 〇(E38087) | 1C3D | 1C3D00000020 | 1C3D0000002000000002 | 1C3D0000002000000002 | 1C3D0000002000000002 | 003007 | E38087 | 3007 | 0E29 | 1205 |
| 平成(E5B9B3E68890) | FB40DE73FB40E210 | FB40DE73FB40E210000000200020 | FB40DE73FB40E210000000200020000000020002 | 5E4E5A91000000200020000000020002 | 5E4E5A91000000200020000000020002 | 005E73006210 | E5B9B3E68890 | 5E736210 | FB40DE73FB40E210 | FB40DE73FB40E210 |
| ㍻(E38DBB) | FB40DE73FB40E210 | FB40DE73FB40E210000000200020 | FB40DE73FB40E2100000002000200000001C001C | FB40DE73FB40E2100000002000200000001C001C | FB40DE73FB40E2100000002000200000001C001C | 00337B | E38DBB | 337B | FB40DE73FB40E210 | FB40DE73FB40E210 |
| 🍣(F09F8DA3) | 130C | 130C00000020 | 130C0000002000000002 | 130C0000002000000002 | 130C0000002000000002 | 01F363 | F09F8DA3 | FFFD | FFFD | FBC3F363 |
| 🍺(F09F8DBA) | 1323 | 132300000020 | 13230000002000000002 | 13230000002000000002 | 13230000002000000002 | 01F37A | F09F8DBA | FFFD | FFFD | FBC3F37A |

どのCollationを使う？

どうなんでしょう？utf8mb4_ja_0900_as_cs_ksを選択するのが現状はよいのでしょうか？

もしくは、割り切ってutf8mb4_bin、utf8mb4_0900_binを選ぶんでしょうかね？

オマケ

最後に、上記のMarkdownを作成したプログラムを載せておきます。

実行すると、出力の一部にMarkdownが含まれたものが得られます。

$ python3 mysql_collation.py

比較するCollationや、文字の種類を変えてみるといろいろ試せるでしょう。

mysql_collation.py

import mysql.connector
from mysql.connector import MySQLConnection
from mysql.connector.cursor import MySQLCursor

connection_configuration: dict = {
    'user': 'kazuhira',
    'password': 'password',
    'host': '172.17.0.2',
    'database': 'practice'
}

collations = [
    'utf8mb4_0900_ai_ci',
    'utf8mb4_0900_as_ci',
    'utf8mb4_0900_as_cs',
    'utf8mb4_ja_0900_as_cs',
    'utf8mb4_ja_0900_as_cs_ks',
    'utf8mb4_bin',
    'utf8mb4_0900_bin',
    'utf8mb4_general_ci',
    'utf8mb4_unicode_ci',
    'utf8mb4_unicode_520_ci'
]

comparison_string_pairs = [
    ('A', 'a'),
    ('A', 'Ａ'),
    ('Ａ', 'ａ'),
    ('あ', 'ぁ'),
    ('あ', 'ア'),
    ('は', 'ば'),
    ('ば', 'ぱ'),
    ('1', '①'),
    ('0', '〇'),
    ('平成', '㍻'),
    ('🍣', '🍺')
]

try:
    with mysql.connector.connect(**connection_configuration) as conn:
        conn: MySQLConnection = conn

        with conn.cursor() as cur:
            cur: MySQLCursor = cur

            print('=====================================================')
            print('print utf8mb4 collations.')
            print('=====================================================')

            cur.execute("""
                 show collation
                 where
                    collation like 'utf8mb4_ja%' or
                    collation like 'utf8mb4_0900%' or
                    collation like 'utf8mb4%bin' or
                    collation like 'utf8mb4%general%' or
                    collation like 'utf8mb4%unicode%'
                    """)
            rows: list = cur.fetchall()

            sorted_collations = sorted(map(lambda r: r[0], rows), reverse=True)

            for c in sorted_collations:
                print(c)

            print()

            print('=====================================================')
            print('print strings equals comparison')
            print('=====================================================')

            print('| 比較 | ' + ' | '.join(collations) + ' |')
            print('|:----:|:----' + ':|:----'.join(list(map(lambda x: '', collations))) + ':|')

            for pair in comparison_string_pairs:
                print(f'| {pair[0]} = {pair[1]} ', end='')

                for collation in collations:
                    cur.execute("set names utf8mb4 collate %s", (collation,))
                    cur.execute("select case %s = %s when 1 then '○' else '×' end", (pair[0], pair[1]))
                    print(f' | {cur.fetchone()[0]}', end='')

                print(' |')

            print()

            print('=====================================================')
            print('print strings sort comparison')
            print('=====================================================')

            print('| 比較 | ' + ' | '.join(collations) + ' |')
            print('|:----:|:----' + ':|:----'.join(list(map(lambda x: '', collations))) + ':|')

            for pair in comparison_string_pairs:
                print(f'| {pair[0]} comp {pair[1]} ', end='')

                for collation in collations:
                    cur.execute("set names utf8mb4 collate %s", (collation,))
                    cur.execute("select %s > %s, %s < %s", (pair[0], pair[1], pair[0], pair[1]))
                    row = cur.fetchone()

                    if row[0] == 0 and row[1] == 0:
                        result = '='
                    elif row[0] == 1:
                        result = '>'
                    elif row[1] == 1:
                        result = '<'

                    print(f' | {result}', end='')

                print(' |')

            print()

            print('=====================================================')
            print('print strings weight')
            print('=====================================================')

            print('| 文字（hex） | ' + ' | '.join(map(lambda c: f'{c}（weight）', collations)) + ' |')
            print('|:----|:----' + '|:----'.join(list(map(lambda x: '', collations))) + '|')

            characters = []

            for pair in comparison_string_pairs:
                for s in pair:
                    if not s in characters:
                        characters.append(s)

            for c in characters:
                cur.execute('select hex(%s)', (c,))
                c_hex = cur.fetchone()[0]
                print(f'| {c}({c_hex})', end='')

                for collation in collations:
                    cur.execute("set names utf8mb4 collate %s", (collation,))
                    cur.execute("select hex(weight_string(%s))", (c,))
                    print(f' | {cur.fetchone()[0]}', end='')

                print(' |')

            print()

except mysql.connector.Error as err:
    print(f'VendorError: {err.errno}')
    print(f'SQLState: {err.sqlstate}')
    print(f'SQLException: {err.msg}')

CLOVER🍀

That was when it all began.

MySQL 8.0のCharset utf8mb4での日本語環境で使うCollationで文字比較をしてみる

これは、なにをしたくて書いたもの？

utf8mb4でのCharsetとCollation

デフォルトのutf8mb4のCollation

環境

今回扱うCollation

少し比較してみる

もっと比較してみる

どのCollationを使う？

オマケ