Table of Contents
    MySQL includes character set support that enables you to store data
    using a variety of character sets and perform comparisons according
    to a variety of collations. You can specify character sets at the
    server, database, table, and column level. MySQL supports the use of
    character sets for the MyISAM,
    MEMORY, NDBCluster, and
    InnoDB storage engines.
  
This chapter discusses the following topics:
What are character sets and collations?
The multiple-level default system for character set assignment
Syntax for specifying character sets and collations
Affected functions and operations
Unicode support
The character sets and collations that are available, with notes
    Character set issues affect data storage, but also communication
    between client programs and the MySQL server. If you want the client
    program to communicate with the server using a character set
    different from the default, you'll need to indicate which one. For
    example, to use the utf8 Unicode character set,
    issue this statement after connecting to the server:
  
SET NAMES 'utf8';
For more information about character set-related issues in client/server communication, see Section聽10.4, 鈥淐onnection Character Sets and Collations鈥.
A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set. Let's make the distinction clear with an example of an imaginary character set.
      Suppose that we have an alphabet with four letters:
      鈥A鈥,
      鈥B鈥,
      鈥a鈥,
      鈥b鈥. We give each letter a number:
      鈥A鈥 = 0,
      鈥B鈥 = 1,
      鈥a鈥 = 2,
      鈥b鈥 = 3. The letter
      鈥A鈥 is a symbol, the number 0 is
      the encoding for
      鈥A鈥, and the combination of all
      four letters and their encodings is a
      character set.
    
      Suppose that we want to compare two string values,
      鈥A鈥 and
      鈥B鈥. The simplest way to do this is
      to look at the encodings: 0 for 鈥A鈥
      and 1 for 鈥B鈥. Because 0 is less
      than 1, we say 鈥A鈥 is less than
      鈥B鈥. What we've just done is apply
      a collation to our character set. The collation is a set of rules
      (only one rule in this case): 鈥compare the
      encodings.鈥 We call this simplest of all possible
      collations a binary collation.
    
      But what if we want to say that the lowercase and uppercase
      letters are equivalent? Then we would have at least two rules: (1)
      treat the lowercase letters 鈥a鈥 and
      鈥b鈥 as equivalent to
      鈥A鈥 and
      鈥B鈥; (2) then compare the
      encodings. We call this a case-insensitive
      collation. It's a little more complex than a binary collation.
    
      In real life, most character sets have many characters: not just
      鈥A鈥 and
      鈥B鈥 but whole alphabets, sometimes
      multiple alphabets or eastern writing systems with thousands of
      characters, along with many special symbols and punctuation marks.
      Also in real life, most collations have many rules, not just for
      whether to distinguish lettercase, but also for whether to
      distinguish accents (an 鈥accent鈥 is a mark attached
      to a character as in German 鈥脰鈥),
      and for multiple-character mappings (such as the rule that
      鈥脰鈥 =
      鈥OE鈥 in one of the two German
      collations).
    
MySQL can do these things for you:
Store strings using a variety of character sets
Compare strings using a variety of collations
Mix strings with different character sets or collations in the same server, the same database, or even the same table
Allow specification of character set and collation at any level
In these respects, MySQL is far ahead of most other database management systems. However, to use these features effectively, you need to know what character sets and collations are available, how to change the defaults, and how they affect the behavior of string operators and functions.
      The MySQL server can support multiple character sets. To list the
      available character sets, use the SHOW CHARACTER
      SET statement. A partial listing follows. For more
      complete information, see Section聽10.10, 鈥淐haracter Sets and Collations That MySQL Supports鈥.
    
mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+--------+
| Charset  | Description                 | Default collation   | Maxlen |
+----------+-----------------------------+---------------------+--------+
| big5     | Big5 Traditional Chinese    | big5_chinese_ci     |      2 |
| dec8     | DEC West European           | dec8_swedish_ci     |      1 |
| cp850    | DOS West European           | cp850_general_ci    |      1 |
| hp8      | HP West European            | hp8_english_ci      |      1 |
| koi8r    | KOI8-R Relcom Russian       | koi8r_general_ci    |      1 |
| latin1   | cp1252 West European        | latin1_swedish_ci   |      1 |
| latin2   | ISO 8859-2 Central European | latin2_general_ci   |      1 |
| swe7     | 7bit Swedish                | swe7_swedish_ci     |      1 |
| ascii    | US ASCII                    | ascii_general_ci    |      1 |
| ujis     | EUC-JP Japanese             | ujis_japanese_ci    |      3 |
| sjis     | Shift-JIS Japanese          | sjis_japanese_ci    |      2 |
| hebrew   | ISO 8859-8 Hebrew           | hebrew_general_ci   |      1 |
| tis620   | TIS620 Thai                 | tis620_thai_ci      |      1 |
| euckr    | EUC-KR Korean               | euckr_korean_ci     |      2 |
| koi8u    | KOI8-U Ukrainian            | koi8u_general_ci    |      1 |
| gb2312   | GB2312 Simplified Chinese   | gb2312_chinese_ci   |      2 |
| greek    | ISO 8859-7 Greek            | greek_general_ci    |      1 |
| cp1250   | Windows Central European    | cp1250_general_ci   |      1 |
| gbk      | GBK Simplified Chinese      | gbk_chinese_ci      |      2 |
| latin5   | ISO 8859-9 Turkish          | latin5_turkish_ci   |      1 |
...
      Any given character set always has at least one collation. It may
      have several collations. To list the collations for a character
      set, use the SHOW COLLATION statement. For
      example, to see the collations for the latin1
      (cp1252 West European) character set, use this statement to find
      those collation names that begin with latin1:
    
mysql> SHOW COLLATION LIKE 'latin1%';
+---------------------+---------+----+---------+----------+---------+
| Collation           | Charset | Id | Default | Compiled | Sortlen |
+---------------------+---------+----+---------+----------+---------+
| latin1_german1_ci   | latin1  |  5 |         |          |       0 |
| latin1_swedish_ci   | latin1  |  8 | Yes     | Yes      |       1 |
| latin1_danish_ci    | latin1  | 15 |         |          |       0 |
| latin1_german2_ci   | latin1  | 31 |         | Yes      |       2 |
| latin1_bin          | latin1  | 47 |         | Yes      |       1 |
| latin1_general_ci   | latin1  | 48 |         |          |       0 |
| latin1_general_cs   | latin1  | 49 |         |          |       0 |
| latin1_spanish_ci   | latin1  | 94 |         |          |       0 |
+---------------------+---------+----+---------+----------+---------+
      The latin1 collations have the following
      meanings:
    
| Collation | Meaning | 
| latin1_german1_ci | German DIN-1 | 
| latin1_swedish_ci | Swedish/Finnish | 
| latin1_danish_ci | Danish/Norwegian | 
| latin1_german2_ci | German DIN-2 | 
| latin1_bin | Binary according to latin1encoding | 
| latin1_general_ci | Multilingual (Western European) | 
| latin1_general_cs | Multilingual (ISO Western European), case sensitive | 
| latin1_spanish_ci | Modern Spanish | 
Collations have these general characteristics:
Two different character sets cannot have the same collation.
          Each character set has one collation that is the
          default collation. For example, the
          default collation for latin1 is
          latin1_swedish_ci. The output for
          SHOW CHARACTER SET indicates which
          collation is the default for each displayed character set.
        
          There is a convention for collation names: They start with the
          name of the character set with which they are associated, they
          usually include a language name, and they end with
          _ci (case insensitive),
          _cs (case sensitive), or
          _bin (binary).
        
There are default settings for character sets and collations at four levels: server, database, table, and column. The following description may appear complex, but it has been found in practice that multiple-level defaulting leads to natural and obvious results.
      CHARACTER SET is used in clauses that specify a
      character set. CHARSET may be used as a synonym
      for CHARACTER SET.
    
MySQL Server has a server character set and a server collation. These can be set at server startup and changed at runtime.
        Initially, the server character set and collation depend on the
        options that you use when you start mysqld.
        You can use --character-set-server for the
        character set. Along with it, you can add
        --collation-server for the collation. If you
        don't specify a character set, that is the same as saying
        --character-set-server=latin1. If you specify
        only a character set (for example, latin1)
        but not a collation, that is the same as saying
        --character-set-server=latin1
        --collation-server=latin1_swedish_ci because
        latin1_swedish_ci is the default collation
        for latin1. Therefore, the following three
        commands all have the same effect:
      
shell>mysqldshell>mysqld --character-set-server=latin1shell>mysqld --character-set-server=latin1 \--collation-server=latin1_swedish_ci
        One way to change the settings is by recompiling. If you want to
        change the default server character set and collation when
        building from sources, use: --with-charset and
        --with-collation as arguments for
        configure. For example:
      
shell> ./configure --with-charset=latin1
Or:
shell>./configure --with-charset=latin1 \--with-collation=latin1_german1_ci
Both mysqld and configure verify that the character set/collation combination is valid. If not, each program displays an error message and terminates.
        The current server character set and collation can be determined
        from the values of the character_set_server
        and collation_server system variables. These
        variables can be changed at runtime.
      
        Every database has a database character set and a database
        collation. The CREATE DATABASE and
        ALTER DATABASE statements have optional
        clauses for specifying the database character set and collation:
      
CREATE DATABASEdb_name[[DEFAULT] CHARACTER SETcharset_name] [[DEFAULT] COLLATEcollation_name] ALTER DATABASEdb_name[[DEFAULT] CHARACTER SETcharset_name] [[DEFAULT] COLLATEcollation_name]
        The keyword SCHEMA can be used instead of
        DATABASE.
      
        All database options are stored in a text file named
        db.opt that can be found in the database
        directory.
      
        The CHARACTER SET and
        COLLATE clauses make it possible to create
        databases with different character sets and collations on the
        same MySQL server.
      
Example:
CREATE DATABASE db_name CHARACTER SET latin1 COLLATE latin1_swedish_ci;
MySQL chooses the database character set and database collation in the following manner:
            If both CHARACTER SET
             and XCOLLATE
             were specified, then
            character set YX and collation
            Y.
          
            If CHARACTER SET
             was specified without
            XCOLLATE, then character set
            X and its default collation.
          
            If COLLATE 
            was specified without YCHARACTER SET, then
            the character set associated with
            Y and collation
            Y.
          
Otherwise, the server character set and server collation.
        The database character set and collation are used as default
        values if the table character set and collation are not
        specified in CREATE TABLE statements. They
        have no other purpose.
      
        The character set and collation for the default database can be
        determined from the values of the
        character_set_database and
        collation_database system variables. The
        server sets these variables whenever the default database
        changes. If there is no default database, the variables have the
        same value as the corresponding server-level system variables,
        character_set_server and
        collation_server.
      
        Every table has a table character set and a table collation. The
        CREATE TABLE and ALTER
        TABLE statements have optional clauses for specifying
        the table character set and collation:
      
CREATE TABLEtbl_name(column_list) [[DEFAULT] CHARACTER SETcharset_name] [COLLATEcollation_name]] ALTER TABLEtbl_name[[DEFAULT] CHARACTER SETcharset_name] [COLLATEcollation_name]
Example:
CREATE TABLE t1 ( ... ) CHARACTER SET latin1 COLLATE latin1_danish_ci;
MySQL chooses the table character set and collation in the following manner:
            If both CHARACTER SET
             and XCOLLATE
             were specified, then
            character set YX and collation
            Y.
          
            If CHARACTER SET
             was specified without
            XCOLLATE, then character set
            X and its default collation.
          
            If COLLATE 
            was specified without YCHARACTER SET, then
            the character set associated with
            Y and collation
            Y.
          
Otherwise, the database character set and collation.
The table character set and collation are used as default values if the column character set and collation are not specified in individual column definitions. The table character set and collation are MySQL extensions; there are no such things in standard SQL.
        Every 鈥character鈥 column (that is, a column of type
        CHAR, VARCHAR, or
        TEXT) has a column character set and a column
        collation. Column definition syntax has optional clauses for
        specifying the column character set and collation:
      
col_name{CHAR | VARCHAR | TEXT} (col_length) [CHARACTER SETcharset_name] [COLLATEcollation_name]
Example:
CREATE TABLE Table1
(
    column1 VARCHAR(5) CHARACTER SET latin1 COLLATE latin1_german1_ci
);
MySQL chooses the column character set and collation in the following manner:
            If both CHARACTER SET
             and XCOLLATE
             were specified, then
            character set YX and collation
            Y are used.
          
            If CHARACTER SET
             was specified without
            XCOLLATE, then character set
            X and its default collation are
            used.
          
            If COLLATE 
            was specified without YCHARACTER SET, then
            the character set associated with
            Y and collation
            Y.
          
Otherwise, the table character set and collation are used.
        The CHARACTER SET and
        COLLATE clauses are standard SQL.
      
Every character string literal has a character set and a collation.
        A character string literal may have an optional character set
        introducer and COLLATE clause:
      
[_charset_name]'string' [COLLATEcollation_name]
Examples:
SELECT 'string'; SELECT _latin1'string'; SELECT _latin1'string' COLLATE latin1_danish_ci;
        For the simple statement SELECT
        ', the string has
        the character set and collation defined by the
        string'character_set_connection and
        collation_connection system variables.
      
        The _
        expression is formally called an
        introducer. It tells the parser, 鈥the
        string that is about to follow uses character set
        charset_nameX.鈥 Because this has confused
        people in the past, we emphasize that an introducer does not
        cause any conversion; it is strictly a signal that does not
        change the string's value. An introducer is also legal before
        standard hex literal and numeric hex literal notation
        (x' and
        literal'0x)>.
      nnnn
Examples:
SELECT _latin1 x'AABBCC'; SELECT _latin1 0xAABBCC;
MySQL determines a literal's character set and collation in the following manner:
            If both _X and COLLATE
             were specified, then
            character set YX and collation
            Y are used.
          
            If _X is specified but
            COLLATE is not specified, then character
            set X and its default collation
            are used.
          
            Otherwise, the character set and collation given by the
            character_set_connection and
            collation_connection system variables are
            used.
          
Examples:
            A string with latin1 character set and
            latin1_german1_ci collation:
          
SELECT _latin1'M眉ller' COLLATE latin1_german1_ci;
            A string with latin1 character set and
            its default collation (that is,
            latin1_swedish_ci):
          
SELECT _latin1'M眉ller';
A string with the connection default character set and collation:
SELECT 'M眉ller';
        Character set introducers and the COLLATE
        clause are implemented according to standard SQL specifications.
      
        An introducer indicates the character set for the following
        string, but does not change now how the parser performs escape
        processing within the string. Escapes are always interpreted by
        the parser according to the character set given by
        character_set_connection.
      
        The following examples show that escape processsing occurs using
        character_set_connection even in the presence
        of an introducer. The examples use SET NAMES
        (which changes character_set_connection, as
        discussed in Section聽10.4, 鈥淐onnection Character Sets and Collations鈥), and display
        the resulting strings using the HEX()
        function so that the exact string contents can be seen.
      
Example 1:
mysql>SET NAMES latin1;Query OK, 0 rows affected (0.01 sec) mysql>SELECT HEX('脿\n'), HEX(_sjis'脿\n');+------------+-----------------+ | HEX('脿\n') | HEX(_sjis'脿\n') | +------------+-----------------+ | E00A | E00A | +------------+-----------------+ 1 row in set (0.00 sec)
        Here, 鈥脿鈥 (hex value
        E0) is followed by
        鈥\n鈥, the escape sequence for
        newline. The escape sequence is interpreted using the
        character_set_connection value of
        latin1 to produce a literal newline (hex
        value 0A). This happens even for the second
        string. That is, the introducer of _sjis does
        not affect the parser's escape processing.
      
Example 2:
mysql>SET NAMES sjis;Query OK, 0 rows affected (0.00 sec) mysql>SELECT HEX('脿\n'), HEX(_latin1'脿\n');+------------+-------------------+ | HEX('脿\n') | HEX(_latin1'脿\n') | +------------+-------------------+ | E05C6E | E05C6E | +------------+-------------------+ 1 row in set (0.04 sec)
        Here, character_set_connection is
        sjis, a character set in which the sequence
        of 鈥脿鈥 followed by
        鈥\鈥 (hex values
        05 and 5C) is a valid
        multi-byte character. Hence, the first two bytes of the string
        are interpreted as a single sjis character,
        and the 鈥\鈥 is not intrepreted as
        an escape character. The following
        鈥n鈥 (hex value
        6E) is not interpreted as part of an escape
        sequence. This is true even for the second string; the
        introducer of _latin1 does not affect escape
        processing.
      
        Standard SQL defines NCHAR or
        NATIONAL CHAR as a way to indicate that a
        CHAR column should use some predefined
        character set. MySQL 5.1 uses
        utf8 as this predefined character set. For
        example, these data type declarations are equivalent:
      
CHAR(10) CHARACTER SET utf8 NATIONAL CHARACTER(10) NCHAR(10)
As are these:
VARCHAR(10) CHARACTER SET utf8 NATIONAL VARCHAR(10) NCHAR VARCHAR(10) NATIONAL CHARACTER VARYING(10) NATIONAL CHAR VARYING(10)
        You can use
        N' to
        create a string in the national character set. These two
        statements are equivalent:
      literal'
SELECT N'some text'; SELECT _utf8'some text';
For information on upgrading character sets to MySQL 5.1 from versions prior to 4.1, see the MySQL 3.23, 4.0, 4.1 Reference Manual.
The following examples show how MySQL determines default character set and collation values.
Example 1: Table and Column Definition
CREATE TABLE t1
(
    c1 CHAR(10) CHARACTER SET latin1 COLLATE latin1_german1_ci
) DEFAULT CHARACTER SET latin2 COLLATE latin2_bin;
        Here we have a column with a latin1 character
        set and a latin1_german1_ci collation. The
        definition is explicit, so that's straightforward. Notice that
        there is no problem with storing a latin1
        column in a latin2 table.
      
Example 2: Table and Column Definition
CREATE TABLE t1
(
    c1 CHAR(10) CHARACTER SET latin1
) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;
        This time we have a column with a latin1
        character set and a default collation. Although it might seem
        natural, the default collation is not taken from the table
        level. Instead, because the default collation for
        latin1 is always
        latin1_swedish_ci, column
        c1 has a collation of
        latin1_swedish_ci (not
        latin1_danish_ci).
      
Example 3: Table and Column Definition
CREATE TABLE t1
(
    c1 CHAR(10)
) DEFAULT CHARACTER SET latin1 COLLATE latin1_danish_ci;
        We have a column with a default character set and a default
        collation. In this circumstance, MySQL checks the table level to
        determine the column character set and collation. Consequently,
        the character set for column c1 is
        latin1 and its collation is
        latin1_danish_ci.
      
Example 4: Database, Table, and Column Definition
CREATE DATABASE d1
    DEFAULT CHARACTER SET latin2 COLLATE latin2_czech_ci;
USE d1;
CREATE TABLE t1
(
    c1 CHAR(10)
);
        We create a column without specifying its character set and
        collation. We're also not specifying a character set and a
        collation at the table level. In this circumstance, MySQL checks
        the database level to determine the table settings, which
        thereafter become the column settings.) Consequently, the
        character set for column c1 is
        latin2 and its collation is
        latin2_czech_ci.
      
Several character set and collation system variables relate to a client's interaction with the server. Some of these have been mentioned in earlier sections:
          The server character set and collation can be determined from
          the values of the character_set_server and
          collation_server system variables.
        
          The character set and collation of the default database can be
          determined from the values of the
          character_set_database and
          collation_database system variables.
        
Additional character set and collation system variables are involved in handling traffic for the connection between a client and the server. Every client has connection-related character set and collation system variables.
Consider what a 鈥connection鈥 is: It's what you make when you connect to the server. The client sends SQL statements, such as queries, over the connection to the server. The server sends responses, such as result sets, over the connection back to the client. This leads to several questions about character set and collation handling for client connections, each of which can be answered in terms of system variables:
What character set is the statement in when it leaves the client?
          The server takes the character_set_client
          system variable to be the character set in which statements
          are sent by the client.
        
What character set should the server translate a statement to after receiving it?
          For this, the server uses the
          character_set_connection and
          collation_connection system variables. It
          converts statements sent by the client from
          character_set_client to
          character_set_connection (except for string
          literals that have an introducer such as
          _latin1 or _utf8).
          collation_connection is important for
          comparisons of literal strings. For comparisons of strings
          with column values, collation_connection
          does not matter because columns have their own collation,
          which has a higher collation precedence.
        
What character set should the server translate to before shipping result sets or error messages back to the client?
          The character_set_results system variable
          indicates the character set in which the server returns query
          results to the client. This includes result data such as
          column values, and result metadata such as column names.
        
You can fine-tune the settings for these variables, or you can depend on the defaults (in which case, you can skip the rest of this section).
There are two statements that affect the connection character sets:
SET NAMES 'charset_name' SET CHARACTER SETcharset_name
      SET NAMES indicates what character set the
      client will use to send SQL statements to the server. Thus,
      SET NAMES 'cp1251' tells the server
      鈥future incoming messages from this client are in character
      set cp1251.鈥 It also specifies the
      character set that the server should use for sending results back
      to the client. (For example, it indicates what character set to
      use for column values if you use a SELECT
      statement.)
    
      A SET NAMES '
      statement is equivalent to these three statements:
    x'
SET character_set_client =x; SET character_set_results =x; SET character_set_connection =x;
      Setting character_set_connection to
      x also sets
      collation_connection to the default collation
      for x. It is not necessary to set that
      collation explicitly. To specify a particular collation for the
      character sets, use the optional COLLATE
      clause:
    
SET NAMES 'charset_name' COLLATE 'collation_name'
      SET CHARACTER SET is similar to SET
      NAMES but sets
      character_set_connection and
      collation_connection to
      character_set_database and
      collation_database. A SET CHARACTER
      SET  statement is equivalent
      to these three statements:
    x
SET character_set_client =x; SET character_set_results =x; SET collation_connection = @@collation_database;
      Setting collation_connection also sets
      character_set_connection to the character set
      associated with the collation (equivalent to executing
      SET character_set_connection =
      @@character_set_database). It is not necessary to set
      character_set_connection explicitly.
    
      When a client connects, it sends to the server the name of the
      character set that it wants to use. The server uses the name to
      set the character_set_client,
      character_set_results, and
      character_set_connection system variables. In
      effect, the server performs a SET NAMES
      operation using the character set name.
    
      With the mysql client, it is not necessary to
      execute SET NAMES every time you start up if
      you want to use a character set different from the default. You
      can add the --default-character-set option
      setting to your mysql statement line, or in
      your option file. For example, the following option file setting
      changes the three character set variables set to
      koi8r each time you invoke
      mysql:
    
[mysql] default-character-set=koi8r
      If you are using the mysql client with
      auto-reconnect enabled (which is not recommended), it is
      preferable to use the charset command rather
      than SET NAMES. For example:
    
mysql> charset utf8
Charset changed
      The charset command issues a SET
      NAMES statement, and also changes the default character
      set that is used if mysql reconnects after the
      connection has dropped.
    
      Example: Suppose that column1 is defined as
      CHAR(5) CHARACTER SET latin2. If you do not say
      SET NAMES or SET CHARACTER
      SET, then for SELECT column1 FROM t,
      the server sends back all the values for
      column1 using the character set that the client
      specified when it connected. On the other hand, if you say
      SET NAMES 'latin1' or SET CHARACTER
      SET latin1 before issuing the SELECT
      statement, the server converts the latin2
      values to latin1 just before sending results
      back. Conversion may be lossy if there are characters that are not
      in both character sets.
    
      If you do not want the server to perform any conversion of result
      sets, set character_set_results to
      NULL:
    
SET character_set_results = NULL;
      Note: Currently, UCS-2 cannot be
      used as a client character set, which means that SET
      NAMES 'ucs2' does not work.
    
To see the values of the character set and collation system variables that apply to your connection, use these statements:
SHOW VARIABLES LIKE 'character_set%'; SHOW VARIABLES LIKE 'collation%';
The following sections discuss various aspects of character set collations.
        With the COLLATE clause, you can override
        whatever the default collation is for a comparison.
        COLLATE may be used in various parts of SQL
        statements. Here are some examples:
      
            With ORDER BY:
          
SELECT k FROM t1 ORDER BY k COLLATE latin1_german2_ci;
            With AS:
          
SELECT k COLLATE latin1_german2_ci AS k1 FROM t1 ORDER BY k1;
            With GROUP BY:
          
SELECT k FROM t1 GROUP BY k COLLATE latin1_german2_ci;
With aggregate functions:
SELECT MAX(k COLLATE latin1_german2_ci) FROM t1;
            With DISTINCT:
          
SELECT DISTINCT k COLLATE latin1_german2_ci FROM t1;
            With WHERE:
          
     SELECT *
     FROM t1
     WHERE _latin1 'M眉ller' COLLATE latin1_german2_ci = k;
     SELECT *
     FROM t1
     WHERE k LIKE _latin1 'M眉ller' COLLATE latin1_german2_ci;
            With HAVING:
          
SELECT k FROM t1 GROUP BY k HAVING k = _latin1 'M眉ller' COLLATE latin1_german2_ci;
        The COLLATE clause has high precedence
        (higher than ||), so the following two
        expressions are equivalent:
      
x || y COLLATE z x || (y COLLATE z)
        The BINARY operator casts the string
        following it to a binary string. This is an easy way to force a
        comparison to be done byte by byte rather than character by
        character. BINARY also causes trailing spaces
        to be significant.
      
mysql>SELECT 'a' = 'A';-> 1 mysql>SELECT BINARY 'a' = 'A';-> 0 mysql>SELECT 'a' = 'a ';-> 1 mysql>SELECT BINARY 'a' = 'a ';-> 0
        BINARY  is
        shorthand for strCAST(.
      str AS
        BINARY)
        The BINARY attribute in character column
        definitions has a different effect. A character column defined
        with the BINARY attribute is assigned the
        binary collation of the column's character set. Every character
        set has a binary collation. For example, the binary collation
        for the latin1 character set is
        latin1_bin, so if the table default character
        set is latin1, these two column definitions
        are equivalent:
      
CHAR(10) BINARY CHAR(10) CHARACTER SET latin1 COLLATE latin1_bin
        The effect of BINARY as a column attribute
        differs from its effect prior to MySQL 4.1. Formerly,
        BINARY resulted in a column that was treated
        as a binary string. A binary string is a string of bytes that
        has no character set or collation, which differs from a
        non-binary character string that has a binary collation. For
        both types of strings, comparisons are based on the numeric
        values of the string unit, but for non-binary strings the unit
        is the character and some character sets allow multi-byte
        characters. Section聽11.4.2, 鈥淭he BINARY and VARBINARY Types鈥.
      
        The use of CHARACTER SET binary in the
        definition of a CHAR,
        VARCHAR, or TEXT column
        causes the column to be treated as a binary data type. For
        example, the following pairs of definitions are equivalent:
      
CHAR(10) CHARACTER SET binary BINARY(10) VARCHAR(10) CHARACTER SET binary VARBINARY(10) TEXT CHARACTER SET binary BLOB
        In the great majority of statements, it is obvious what
        collation MySQL uses to resolve a comparison operation. For
        example, in the following cases, it should be clear that the
        collation is the collation of column x:
      
SELECT x FROM T ORDER BY x; SELECT x FROM T WHERE x = x; SELECT DISTINCT x FROM T;
However, when multiple operands are involved, there can be ambiguity. For example:
SELECT x FROM T WHERE x = 'Y';
        Should this query use the collation of the column
        x, or of the string literal
        'Y'?
      
        Standard SQL resolves such questions using what used to be
        called 鈥coercibility鈥 rules. Basically, this means:
        Both x and 'Y' have
        collations, so which collation takes precedence? This can be
        difficult to resolve, but the following rules cover most
        situations:
      
            An explicit COLLATE clause has a
            coercibility of 0. (Not coercible at all.)
          
The concatenation of two strings with different collations has a coercibility of 1.
The collation of a column or a stored routine parameter or local variable has a coercibility of 2.
            A 鈥system constant鈥 (the string returned by
            functions such as USER() or
            VERSION()) has a coercibility of 3.
          
A literal's collation has a coercibility of 4.
            NULL or an expression that is derived
            from NULL has a coercibility of 5.
          
The preceding coercibility values are current for MySQL 5.1.
Those rules resolve ambiguities in the following manner:
Use the collation with the lowest coercibility value.
If both sides have the same coercibility, then it is an error if the collations aren't the same.
Examples:
| column1 = 'A' | Use collation of column1 | 
| column1 = 'A' COLLATE x | Use collation of 'A' COLLATE x | 
| column1 COLLATE x = 'A' COLLATE y | Error | 
        The COERCIBILITY() function can be used to
        determine the coercibility of a string expression:
      
mysql>SELECT COERCIBILITY('A' COLLATE latin1_swedish_ci);-> 0 mysql>SELECT COERCIBILITY(VERSION());-> 3 mysql>SELECT COERCIBILITY('A');-> 4
        Each character set has one or more collations, but each
        collation is associated with one and only one character set.
        Therefore, the following statement causes an error message
        because the latin2_bin collation is not legal
        with the latin1 character set:
      
mysql> SELECT _latin1 'x' COLLATE latin2_bin;
ERROR 1253 (42000): COLLATION 'latin2_bin' is not valid
for CHARACTER SET 'latin1'
        Suppose that column X in table
        T has these latin1 column
        values:
      
Muffler M眉ller MX Systems MySQL
Suppose also that the column values are retrieved using the following statement:
SELECT X FROM T ORDER BY X COLLATE collation_name;
        The following table shows the resulting order of the values if
        we use ORDER BY with different collations:
      
| latin1_swedish_ci | latin1_german1_ci | latin1_german2_ci | 
| Muffler | Muffler | M眉ller | 
| MX Systems | M眉ller | Muffler | 
| M眉ller | MX Systems | MX Systems | 
| MySQL | MySQL | MySQL | 
        The character that causes the different sort orders in this
        example is the U with two dots over it
        (眉), which the Germans call
        鈥U-umlaut.鈥
      
            The first column shows the result of the
            SELECT using the Swedish/Finnish
            collating rule, which says that U-umlaut sorts with Y.
          
            The second column shows the result of the
            SELECT using the German DIN-1 rule, which
            says that U-umlaut sorts with U.
          
            The third column shows the result of the
            SELECT using the German DIN-2 rule, which
            says that U-umlaut sorts with UE.
          
This section describes operations that take character set information into account.
MySQL has many operators and functions that return a string. This section answers the question: What is the character set and collation of such a string?
        For simple functions that take string input and return a string
        result as output, the output's character set and collation are
        the same as those of the principal input value. For example,
        UPPER( returns a
        string whose character string and collation are the same as that
        of X)X. The same applies for
        INSTR(), LCASE(),
        LOWER(), LTRIM(),
        MID(), REPEAT(),
        REPLACE(), REVERSE(),
        RIGHT(), RPAD(),
        RTRIM(), SOUNDEX(),
        SUBSTRING(), TRIM(),
        UCASE(), and UPPER().
      
        Note: The REPLACE() function, unlike all
        other functions, always ignores the collation of the string
        input and performs a case-sensitive comparison.
      
        If a string input or function result is a binary string, the
        string has no character set or collation. This can be checked by
        using the CHARSET() and
        COLLATION() functions, both of which return
        binary to indicate that their argument is a
        binary string:
      
mysql> SELECT CHARSET(BINARY 'a'), COLLATION(BINARY 'a');
+---------------------+-----------------------+
| CHARSET(BINARY 'a') | COLLATION(BINARY 'a') |
+---------------------+-----------------------+
| binary              | binary                |
+---------------------+-----------------------+
For operations that combine multiple string inputs and return a single string output, the 鈥aggregation rules鈥 of standard SQL apply for determining the collation of the result:
            If an explicit COLLATE
             occurs, use
            XX.
          
            If explicit COLLATE
             and XCOLLATE
             occur, raise an
            error.
          Y
            Otherwise, if all collations are
            X, use
            X.
          
Otherwise, the result has no collation.
        For example, with CASE ... WHEN a THEN b WHEN b THEN c
        COLLATE , the
        resulting collation is X ENDX. The same
        applies for UNION, ||,
        CONCAT(), ELT(),
        GREATEST(), IF(), and
        LEAST().
      
        For operations that convert to character data, the character set
        and collation of the strings that result from the operations are
        defined by the character_set_connection and
        collation_connection system variables. This
        applies only to CAST(),
        CONV(), FORMAT(),
        HEX(), and SPACE().
      
        If you are uncertain about the character set or collation of the
        result returned by a string function, you can use the
        CHARSET() or COLLATE()
        function to find out:
      
mysql> SELECT USER(), CHARSET(USER()), COLLATION(USER());
+----------------+-----------------+-------------------+
| USER()         | CHARSET(USER()) | COLLATION(USER()) |
+----------------+-----------------+-------------------+
| test@localhost | utf8            | utf8_general_ci   | 
+----------------+-----------------+-------------------+
        CONVERT() provides a way to convert data
        between different character sets. The syntax is:
      
CONVERT(exprUSINGtranscoding_name)
In MySQL, transcoding names are the same as the corresponding character set names.
Examples:
SELECT CONVERT(_latin1'M眉ller' USING utf8);
INSERT INTO utf8table (utf8column)
    SELECT CONVERT(latin1field USING utf8) FROM latin1table;
        CONVERT(... USING ...) is implemented
        according to the standard SQL specification.
      
        You may also use CAST() to convert a string
        to a different character set. The syntax is:
      
CAST(character_stringAScharacter_data_typeCHARACTER SETcharset_name)
Example:
SELECT CAST(_latin1'test' AS CHAR CHARACTER SET utf8);
        If you use CAST() without specifying
        CHARACTER SET, the resulting character set
        and collation are defined by the
        character_set_connection and
        collation_connection system variables. If you
        use CAST() with CHARACTER SET
        X, the resulting character set and collation are
        X and the default collation of
        X.
      
        You may not use a COLLATE clause inside a
        CAST(), but you may use it outside. That is,
        CAST(... COLLATE ...) is illegal, but
        CAST(...) COLLATE ... is legal.
      
Example:
SELECT CAST(_latin1'test' AS CHAR CHARACTER SET utf8) COLLATE utf8_bin;
        Several SHOW statements provide additional
        character set information. These include SHOW CHARACTER
        SET, SHOW COLLATION, SHOW
        CREATE DATABASE, SHOW CREATE TABLE
        and SHOW COLUMNS. These statements are
        described here briefly.
      
        For more information, see Section聽13.5.4, 鈥SHOW Syntax鈥.
      
        INFORMATION_SCHEMA has several tables that
        contain information similar to that displayed by the
        SHOW statements. For example, the
        CHARACTER_SETS and
        COLLATIONS tables contain the information
        displayed by SHOW CHARACTER SET and
        SHOW COLLATION.
        Chapter聽22, The INFORMATION_SCHEMA Database.
      
        The SHOW CHARACTER SET command shows all
        available character sets. It takes an optional
        LIKE clause that indicates which character
        set names to match. For example:
      
mysql> SHOW CHARACTER SET LIKE 'latin%';
+---------+-----------------------------+-------------------+--------+
| Charset | Description                 | Default collation | Maxlen |
+---------+-----------------------------+-------------------+--------+
| latin1  | cp1252 West European        | latin1_swedish_ci |      1 |
| latin2  | ISO 8859-2 Central European | latin2_general_ci |      1 |
| latin5  | ISO 8859-9 Turkish          | latin5_turkish_ci |      1 |
| latin7  | ISO 8859-13 Baltic          | latin7_general_ci |      1 |
+---------+-----------------------------+-------------------+--------+
        The output from SHOW COLLATION includes all
        available character sets. It takes an optional
        LIKE clause that indicates which collation
        names to match. For example:
      
mysql> SHOW COLLATION LIKE 'latin1%';
+-------------------+---------+----+---------+----------+---------+
| Collation         | Charset | Id | Default | Compiled | Sortlen |
+-------------------+---------+----+---------+----------+---------+
| latin1_german1_ci | latin1  |  5 |         |          |       0 |
| latin1_swedish_ci | latin1  |  8 | Yes     | Yes      |       0 |
| latin1_danish_ci  | latin1  | 15 |         |          |       0 |
| latin1_german2_ci | latin1  | 31 |         | Yes      |       2 |
| latin1_bin        | latin1  | 47 |         | Yes      |       0 |
| latin1_general_ci | latin1  | 48 |         |          |       0 |
| latin1_general_cs | latin1  | 49 |         |          |       0 |
| latin1_spanish_ci | latin1  | 94 |         |          |       0 |
+-------------------+---------+----+---------+----------+---------+
        SHOW CREATE DATABASE displays the
        CREATE DATABASE statement that creates a
        given database:
      
mysql> SHOW CREATE DATABASE test;
+----------+-----------------------------------------------------------------+
| Database | Create Database                                                 |
+----------+-----------------------------------------------------------------+
| test     | CREATE DATABASE `test` /*!40100 DEFAULT CHARACTER SET latin1 */ |
+----------+-----------------------------------------------------------------+
        If no COLLATE clause is shown, the default
        collation for the character set applies.
      
        SHOW CREATE TABLE is similar, but displays
        the CREATE TABLE statement to create a given
        table. The column definitions indicate any character set
        specifications, and the table options include character set
        information.
      
        The SHOW COLUMNS statement displays the
        collations of a table's columns when invoked as SHOW
        FULL COLUMNS. Columns with CHAR,
        VARCHAR, or TEXT data
        types have collations. Numeric and other non-character types
        have no collation (indicated by NULL as the
        Collation value). For example:
      
mysql> SHOW FULL COLUMNS FROM person\G
*************************** 1. row ***************************
     Field: id
      Type: smallint(5) unsigned
 Collation: NULL
      Null: NO
       Key: PRI
   Default: NULL
     Extra: auto_increment
Privileges: select,insert,update,references
   Comment:
*************************** 2. row ***************************
     Field: name
      Type: char(60)
 Collation: latin1_swedish_ci
      Null: NO
       Key:
   Default:
     Extra:
Privileges: select,insert,update,references
   Comment:
The character set is not part of the display but is implied by the collation name.
MySQL 5.1 supports two character sets for storing Unicode data:
          ucs2, the UCS-2 Unicode character set.
        
          utf8, the UTF-8 encoding of the Unicode
          character set.
        
      In UCS-2 (binary Unicode representation), every character is
      represented by a two-byte Unicode code with the most significant
      byte first. For example: LATIN CAPITAL LETTER A
      has the code 0x0041 and it is stored as a
      two-byte sequence: 0x00 0x41. CYRILLIC
      SMALL LETTER YERU (Unicode 0x044B) is
      stored as a two-byte sequence: 0x04 0x4B. For
      Unicode characters and their codes, please refer to the
      Unicode Home Page.
    
The MySQL implementation of UCS-2 stores characters in big-endian byte order and does not use a byte order mark (BOM) at the beginning of UCS-2 values. Other database systems might use little-ending byte order or a BOM, in which case conversion of UCS-2 values will need to be performed when transferring data between those systems and MySQL.
      Currently, UCS-2 cannot be used as a client character set, which
      means that SET NAMES 'ucs2' does not work.
    
UTF-8 (Unicode Transform representation) is an alternative way to store Unicode data. It is implemented according to RFC 3629. The idea of UTF-8 is that various Unicode characters are encoded using byte sequences of different lengths:
Basic Latin letters, digits, and punctuation signs use one byte.
Most European and Middle East script letters fit into a two-byte sequence: extended Latin letters (with tilde, macron, acute, grave and other accents), Cyrillic, Greek, Armenian, Hebrew, Arabic, Syriac, and others.
Korean, Chinese, and Japanese ideographs use three-byte sequences.
RFC 3629 describes encoding sequences that take from one to four bytes. Currently, MySQL support for UTF-8 does not include four-byte sequences. (An older standard for UTF-8 encoding is given by RFC 2279, which describes UTF-8 sequences that take from one to six bytes. RFC 3629 renders RFC 2279 obsolete; for this reason, sequences with five and six bytes are no longer used.)
      Tip: To save space with UTF-8,
      use VARCHAR instead of CHAR.
      Otherwise, MySQL must reserve three bytes for each character in a
      CHAR CHARACTER SET utf8 column because that is
      the maximum possible length. For example, MySQL must reserve 30
      bytes for a CHAR(10) CHARACTER SET utf8 column.
    
      Metadata is 鈥the data about the
      data.鈥 Anything that describes the
      database 鈥 as opposed to being the
      contents of the database 鈥 is metadata.
      Thus column names, database names, usernames, version names, and
      most of the string results from SHOW are
      metadata. This is also true of the contents of tables in
      INFORMATION_SCHEMA, because those tables by
      definition contain information about database objects.
    
Representation of metadata must satisfy these requirements:
          All metadata must be in the same character set. Otherwise,
          neither the SHOW commands nor
          SELECT statements for tables in
          INFORMATION_SCHEMA would work properly
          because different rows in the same column of the results of
          these operations would be in different character sets.
        
Metadata must include all characters in all languages. Otherwise, users would not be able to name columns and tables using their own languages.
To satisfy both requirements, MySQL stores metadata in a Unicode character set, namely UTF-8. This does not cause any disruption if you never use accented or non-Latin characters. But if you do, you should be aware that metadata is in UTF-8.
      The metadata requirements mean that the return values of the
      USER(), CURRENT_USER(),
      SESSION_USER(),
      SYSTEM_USER(), DATABASE(),
      and VERSION() functions have the UTF-8
      character set by default.
    
      The server sets the character_set_system system
      variable to the name of the metadata character set:
    
mysql> SHOW VARIABLES LIKE 'character_set_system';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| character_set_system | utf8  |
+----------------------+-------+
      Storage of metadata using Unicode does not
      mean that the server returns headers of columns and the results of
      DESCRIBE functions in the
      character_set_system character set by default.
      When you use SELECT column1 FROM t, the name
      column1 itself is returned from the server to
      the client in the character set determined by the value of the
      character_set_results system variable, which
      has a default value of latin1. If you want the
      server to pass metadata results back in a different character set,
      use the SET NAMES statement to force the server
      to perform character set conversion. SET NAMES
      sets the character_set_results and other
      related system variables. (See
      Section聽10.4, 鈥淐onnection Character Sets and Collations鈥.) Alternatively, a client
      program can perform the conversion after receiving the result from
      the server. It is more efficient for the client perform the
      conversion, but this option is not always available for all
      clients.
    
      If character_set_results is set to
      NULL, no conversion is performed and the server
      returns metadata using its original character set (the set
      indicated by character_set_system).
    
Error messages returned from the server to the client are converted to the client character set automatically, as with metadata.
      If you are using (for example) the USER()
      function for comparison or assignment within a single statement,
      don't worry. MySQL performs some automatic conversion for you.
    
SELECT * FROM Table1 WHERE USER() = latin1_column;
      This works because the contents of
      latin1_column are automatically converted to
      UTF-8 before the comparison.
    
INSERT INTO Table1 (latin1_column) SELECT USER();
      This works because the contents of USER() are
      automatically converted to latin1 before the
      assignment. Automatic conversion is not fully implemented yet, but
      should work correctly in a later version.
    
Although automatic conversion is not in the SQL standard, the SQL standard document does say that every character set is (in terms of supported characters) a 鈥subset鈥 of Unicode. Because it is a well-known principle that 鈥what applies to a superset can apply to a subset,鈥 we believe that a collation for Unicode can apply for comparisons with non-Unicode strings.
      To convert a binary or non-binary string column to use a
      particular character set, use ALTER TABLE. For
      successful conversion to occur, one of the following conditions
      must apply:
    
          If the column has a binary data type
          (BINARY, VARBINARY,
          BLOB), all the values that it contains must
          be encoded using a single character set (the character set
          you're converting the column to). If you use a binary column
          to store information in multiple character sets, MySQL has no
          way to know which values use which character set and cannot
          convert the data properly.
        
          If the column has a non-binary data type
          (CHAR, VARCHAR,
          TEXT), its contents should be encoded in
          the column's character set, not some other character set. If
          the contents are encoded in a different character set, you can
          convert the column to use a binary data type first, and then
          to a non-binary column with the desired character set.
        
      Suppose that a table t has a binary column
      named col1 defined as
      BINARY(50). Assuming that the information in
      the column is encoded using a single character set, you can
      convert it to a non-binary column that has that character set. For
      example, if col1 contains binary data
      representing characters in the greek character
      set, you can convert it as follows:
    
ALTER TABLE t MODIFY col1 CHAR(50) CHARACTER SET greek;
      Suppose that table t has a non-binary column
      named col1 defined as CHAR(50)
      CHARACTER SET latin1 but you want to convert it to use
      utf8 so that you can store values from many
      languages. The following statement accomplishes this:
    
ALTER TABLE t MODIFY col1 CHAR(50) CHARACTER SET utf8;
Conversion may be lossy if the column contains characters that are not in both character sets.
      A special case occurs if you have old tables from MySQL 4.0 or
      earlier where a non-binary column contains values that actually
      are encoded in a character set different from the server's default
      character set. For example, an application might have stored
      sjis values in a column, even though MySQL's
      default character set was latin1. It is
      possible to convert the column to use the proper character set but
      an additional step is required. Suppose that the server's default
      character set was latin1 and
      col1 is defined as CHAR(50)
      but its contents are sjis values. The first
      step is to convert the column to a binary data type, which removes
      the existing character set information without performing any
      character conversion:
    
ALTER TABLE t MODIFY col1 BINARY(50);
The next step is to convert the column to a non-binary data type with the proper character set:
ALTER TABLE t MODIFY col1 CHAR(50) CHARACTER SET sjis;
      This procedure requires that the table not have been modified
      already with statements such as INSERT or
      UPDATE after an upgrade to MySQL 4.1 or later.
      In that case, MySQL would store new values in the column using
      latin1, and the column will contain a mix of
      sjis and latin1 values and
      cannot be converted properly.
    
      If you specified attributes when creating a column initially, you
      should also specify them when altering the table with
      ALTER TABLE. For example, if you specified
      NOT NULL and an explicit
      DEFAULT value, you should also provide them in
      the ALTER TABLE statement. Otherwise, the
      resulting column definition will not include those attributes.
    
MySQL supports 70+ collations for 30+ character sets. This section indicates which character sets MySQL supports. There is one subsection for each group of related character sets. For each character set, the allowable collations are listed.
      You can always list the available character sets and their default
      collations with the SHOW CHARACTER SET
      statement:
    
mysql> SHOW CHARACTER SET;
+----------+-----------------------------+---------------------+
| Charset  | Description                 | Default collation   |
+----------+-----------------------------+---------------------+
| big5     | Big5 Traditional Chinese    | big5_chinese_ci     |
| dec8     | DEC West European           | dec8_swedish_ci     |
| cp850    | DOS West European           | cp850_general_ci    |
| hp8      | HP West European            | hp8_english_ci      |
| koi8r    | KOI8-R Relcom Russian       | koi8r_general_ci    |
| latin1   | cp1252 West European        | latin1_swedish_ci   |
| latin2   | ISO 8859-2 Central European | latin2_general_ci   |
| swe7     | 7bit Swedish                | swe7_swedish_ci     |
| ascii    | US ASCII                    | ascii_general_ci    |
| ujis     | EUC-JP Japanese             | ujis_japanese_ci    |
| sjis     | Shift-JIS Japanese          | sjis_japanese_ci    |
| hebrew   | ISO 8859-8 Hebrew           | hebrew_general_ci   |
| tis620   | TIS620 Thai                 | tis620_thai_ci      |
| euckr    | EUC-KR Korean               | euckr_korean_ci     |
| koi8u    | KOI8-U Ukrainian            | koi8u_general_ci    |
| gb2312   | GB2312 Simplified Chinese   | gb2312_chinese_ci   |
| greek    | ISO 8859-7 Greek            | greek_general_ci    |
| cp1250   | Windows Central European    | cp1250_general_ci   |
| gbk      | GBK Simplified Chinese      | gbk_chinese_ci      |
| latin5   | ISO 8859-9 Turkish          | latin5_turkish_ci   |
| armscii8 | ARMSCII-8 Armenian          | armscii8_general_ci |
| utf8     | UTF-8 Unicode               | utf8_general_ci     |
| ucs2     | UCS-2 Unicode               | ucs2_general_ci     |
| cp866    | DOS Russian                 | cp866_general_ci    |
| keybcs2  | DOS Kamenicky Czech-Slovak  | keybcs2_general_ci  |
| macce    | Mac Central European        | macce_general_ci    |
| macroman | Mac West European           | macroman_general_ci |
| cp852    | DOS Central European        | cp852_general_ci    |
| latin7   | ISO 8859-13 Baltic          | latin7_general_ci   |
| cp1251   | Windows Cyrillic            | cp1251_general_ci   |
| cp1256   | Windows Arabic              | cp1256_general_ci   |
| cp1257   | Windows Baltic              | cp1257_general_ci   |
| binary   | Binary pseudo charset       | binary              |
| geostd8  | GEOSTD8 Georgian            | geostd8_general_ci  |
| cp932    | SJIS for Windows Japanese   | cp932_japanese_ci   |
| eucjpms  | UJIS for Windows Japanese   | eucjpms_japanese_ci |
+----------+-----------------------------+---------------------+
MySQL has two Unicode character sets. You can store text in about 650 languages using these character sets.
            ucs2 (UCS-2 Unicode) collations:
          
                ucs2_bin
              
                ucs2_czech_ci
              
                ucs2_danish_ci
              
                ucs2_esperanto_ci
              
                ucs2_estonian_ci
              
                ucs2_general_ci (default)
              
                ucs2_hungarian_ci
              
                ucs2_icelandic_ci
              
                ucs2_latvian_ci
              
                ucs2_lithuanian_ci
              
                ucs2_persian_ci
              
                ucs2_polish_ci
              
                ucs2_roman_ci
              
                ucs2_romanian_ci
              
                ucs2_slovak_ci
              
                ucs2_slovenian_ci
              
                ucs2_spanish2_ci
              
                ucs2_spanish_ci
              
                ucs2_swedish_ci
              
                ucs2_turkish_ci
              
                ucs2_unicode_ci
              
            utf8 (UTF-8 Unicode) collations:
          
                utf8_bin
              
                utf8_czech_ci
              
                utf8_danish_ci
              
                utf8_esperanto_ci
              
                utf8_estonian_ci
              
                utf8_general_ci (default)
              
                utf8_hungarian_ci
              
                utf8_icelandic_ci
              
                utf8_latvian_ci
              
                utf8_lithuanian_ci
              
                utf8_persian_ci
              
                utf8_polish_ci
              
                utf8_roman_ci
              
                utf8_romanian_ci
              
                utf8_slovak_ci
              
                utf8_slovenian_ci
              
                utf8_spanish2_ci
              
                utf8_spanish_ci
              
                utf8_swedish_ci
              
                utf8_turkish_ci
              
                utf8_unicode_ci
              
The MySQL implementation of UCS-2 stores characters in big-endian byte order and does not use a byte order mark (BOM) at the beginning of UCS-2 values. Other database systems might use little-ending byte order or a BOM, in which case conversion of UCS-2 values will need to be performed when transferring data between those systems and MySQL.
        Note that in the ucs2_roman_ci and
        utf8_roman_ci collations,
        I and J compare as equals,
        and U and V compare as
        equals.
      
        The ucs2_hungarian_ci and
        utf8_hungarian_ci collations were added in
        MySQL 5.1.5.
      
        MySQL implements the utf8_unicode_ci
        collation according to the Unicode Collation Algorithm (UCA)
        described at
        http://www.unicode.org/reports/tr10/. The
        collation uses the version-4.0.0 UCA weight keys:
        http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt.
        The following discussion uses
        utf8_unicode_ci, but it is also true for
        ucs2_unicode_ci.
      
        Currently, the utf8_unicode_ci collation has
        only partial support for the Unicode Collation Algorithm. Some
        characters are not supported yet. Also, combining marks are not
        fully supported. This affects primarily Vietnamese, Yoruba, and
        some smaller languages such as Navajo.
      
        The most significant feature in
        utf8_unicode_ci is that it supports
        expansions; that is, when one character compares as equal to
        combinations of other characters. For example, in German and
        some other languages 鈥脽鈥 is
        equal to 鈥ss鈥.
      
        utf8_general_ci is a legacy collation that
        does not support expansions. It can make only one-to-one
        comparisons between characters. This means that comparisons for
        the utf8_general_ci collation are faster, but
        slightly less correct, than comparisons for
        utf8_unicode_ci.
      
        For example, the following equalities hold in both
        utf8_general_ci and
        utf8_unicode_ci:
      
脛 = A 脰 = O 脺 = U
        A difference between the collations is that this is true for
        utf8_general_ci:
      
脽 = s
        Whereas this is true for utf8_unicode_ci:
      
脽 = ss
        MySQL implements language-specific collations for the
        utf8 character set only if the ordering with
        utf8_unicode_ci does not work well for a
        language. For example, utf8_unicode_ci works
        fine for German and French, so there is no need to create
        special utf8 collations for these two
        languages.
      
        utf8_general_ci also is satisfactory for both
        German and French, except that
        鈥脽鈥 is equal to
        鈥s鈥, and not to
        鈥ss鈥. If this is acceptable for
        your application, then you should use
        utf8_general_ci because it is faster.
        Otherwise, use utf8_unicode_ci because it is
        more accurate.
      
        utf8_swedish_ci, like other
        utf8 language-specific collations, is derived
        from utf8_unicode_ci with additional language
        rules. For example, in Swedish, the following relationship
        holds, which is not something expected by a German or French
        speaker:
      
脺 = Y < 脰
        The utf8_spanish_ci and
        utf8_spanish2_ci collations correspond to
        modern Spanish and traditional Spanish, respectively. In both
        collations, 鈥帽鈥 (n-tilde) is a
        separate letter between 鈥n鈥 and
        鈥o鈥. In addition, for traditional
        Spanish, 鈥ch鈥 is a separate
        letter between 鈥c鈥 and
        鈥d鈥, and
        鈥ll鈥 is a separate letter between
        鈥l鈥 and
        鈥m鈥
      
Western European character sets cover most West European languages, such as French, Spanish, Catalan, Basque, Portuguese, Italian, Albanian, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English.
            ascii (US ASCII) collations:
          
                ascii_bin
              
                ascii_general_ci (default)
              
            cp850 (DOS West European) collations:
          
                cp850_bin
              
                cp850_general_ci (default)
              
            dec8 (DEC Western European) collations:
          
                dec8_bin
              
                dec8_swedish_ci (default)
              
            hp8 (HP Western European) collations:
          
                hp8_bin
              
                hp8_english_ci (default)
              
            latin1 (cp1252 West European) collations:
          
                latin1_bin
              
                latin1_danish_ci
              
                latin1_general_ci
              
                latin1_general_cs
              
                latin1_german1_ci
              
                latin1_german2_ci
              
                latin1_spanish_ci
              
                latin1_swedish_ci (default)
              
            latin1 is the default character set.
            MySQL's latin1 is the same as the Windows
            cp1252 character set. This means it is
            the same as the official ISO 8859-1 or
            IANA (Internet Assigned Numbers Authority)
            latin1, but IANA
            latin1 treats the code points between
            0x80 and 0x9f as
            鈥undefined,鈥 whereas cp1252,
            and therefore MySQL's latin1, assign
            characters for those positions. For example,
            0x80 is the Euro sign. For the
            鈥undefined鈥 entries in
            cp1252, MySQL translates
            0x81 to Unicode
            0x0081, 0x8d to
            0x008d, 0x8f to
            0x008f, 0x90 to
            0x0090, and 0x9d to
            0x009d.
          
            The latin1_swedish_ci collation is the
            default that probably is used by the majority of MySQL
            customers. Although it is frequently said that it is based
            on the Swedish/Finnish collation rules, there are Swedes and
            Finns who disagree with this statement.
          
            The latin1_german1_ci and
            latin1_german2_ci collations are based on
            the DIN-1 and DIN-2 standards, where DIN stands for
            Deutsches Institut f眉r
            Normung (the German equivalent of ANSI).
            DIN-1 is called the 鈥dictionary collation鈥 and
            DIN-2 is called the 鈥phone book collation.鈥
          
                latin1_german1_ci (dictionary) rules:
              
脛 = A 脰 = O 脺 = U 脽 = s
                latin1_german2_ci (phone-book) rules:
              
脛 = AE 脰 = OE 脺 = UE 脽 = ss
            In the latin1_spanish_ci collation,
            鈥帽鈥 (n-tilde) is a separate
            letter between 鈥n鈥 and
            鈥o鈥.
          
            macroman (Mac West European) collations:
          
                macroman_bin
              
                macroman_general_ci (default)
              
            swe7 (7bit Swedish) collations:
          
                swe7_bin
              
                swe7_swedish_ci (default)
              
MySQL provides some support for character sets used in the Czech Republic, Slovakia, Hungary, Romania, Slovenia, Croatia, Poland, and Serbia (Latin).
            cp1250 (Windows Central European)
            collations:
          
                cp1250_bin
              
                cp1250_croatian_ci
              
                cp1250_czech_cs
              
                cp1250_general_ci (default)
              
                cp1250_polish_ci
              
            cp852 (DOS Central European) collations:
          
                cp852_bin
              
                cp852_general_ci (default)
              
            keybcs2 (DOS Kamenicky Czech-Slovak)
            collations:
          
                keybcs2_bin
              
                keybcs2_general_ci (default)
              
            latin2 (ISO 8859-2 Central European)
            collations:
          
                latin2_bin
              
                latin2_croatian_ci
              
                latin2_czech_cs
              
                latin2_general_ci (default)
              
                latin2_hungarian_ci
              
            macce (Mac Central European) collations:
          
                macce_bin
              
                macce_general_ci (default)
              
South European and Middle Eastern character sets supported by MySQL include Armenian, Arabic, Georgian, Greek, Hebrew, and Turkish.
            armscii8 (ARMSCII-8 Armenian) collations:
          
                armscii8_bin
              
                armscii8_general_ci (default)
              
            cp1256 (Windows Arabic) collations:
          
                cp1256_bin
              
                cp1256_general_ci (default)
              
            geostd8 (GEOSTD8 Georgian) collations:
          
                geostd8_bin
              
                geostd8_general_ci (default)
              
            greek (ISO 8859-7 Greek) collations:
          
                greek_bin
              
                greek_general_ci (default)
              
            hebrew (ISO 8859-8 Hebrew) collations:
          
                hebrew_bin
              
                hebrew_general_ci (default)
              
            latin5 (ISO 8859-9 Turkish) collations:
          
                latin5_bin
              
                latin5_turkish_ci (default)
              
The Baltic character sets cover Estonian, Latvian, and Lithuanian languages.
            cp1257 (Windows Baltic) collations:
          
                cp1257_bin
              
                cp1257_general_ci (default)
              
                cp1257_lithuanian_ci
              
            latin7 (ISO 8859-13 Baltic) collations:
          
                latin7_bin
              
                latin7_estonian_cs
              
                latin7_general_ci (default)
              
                latin7_general_cs
              
The Cyrillic character sets and collations are for use with Belarusian, Bulgarian, Russian, Ukrainian, and Serbian (Cyrillic) languages.
            cp1251 (Windows Cyrillic) collations:
          
                cp1251_bin
              
                cp1251_bulgarian_ci
              
                cp1251_general_ci (default)
              
                cp1251_general_cs
              
                cp1251_ukrainian_ci
              
            cp866 (DOS Russian) collations:
          
                cp866_bin
              
                cp866_general_ci (default)
              
            koi8r (KOI8-R Relcom Russian) collations:
          
                koi8r_bin
              
                koi8r_general_ci (default)
              
            koi8u (KOI8-U Ukrainian) collations:
          
                koi8u_bin
              
                koi8u_general_ci (default)
              
        The Asian character sets that we support include Chinese,
        Japanese, Korean, and Thai. These can be complicated. For
        example, the Chinese sets must allow for thousands of different
        characters. See Section聽10.10.7.1, 鈥淭he cp932 Character Set鈥, for additional
        information about the cp932 and
        sjis character sets.
      
For answers to some common questions and problems relating support for Asian character sets in MySQL, see Section聽A.11, 鈥淢ySQL 5.1 FAQ 鈥 MySQL Chinese, Japanese, and Korean Character Sets鈥.
            big5 (Big5 Traditional Chinese)
            collations:
          
                big5_bin
              
                big5_chinese_ci (default)
              
            cp932 (SJIS for Windows Japanese)
            collations:
          
                cp932_bin
              
                cp932_japanese_ci (default)
              
            eucjpms (UJIS for Windows Japanese)
            collations:
          
                eucjpms_bin
              
                eucjpms_japanese_ci (default)
              
            euckr (EUC-KR Korean) collations:
          
                euckr_bin
              
                euckr_korean_ci (default)
              
            gb2312 (GB2312 Simplified Chinese)
            collations:
          
                gb2312_bin
              
                gb2312_chinese_ci (default)
              
            gbk (GBK Simplified Chinese) collations:
          
                gbk_bin
              
                gbk_chinese_ci (default)
              
            sjis (Shift-JIS Japanese) collations:
          
                sjis_bin
              
                sjis_japanese_ci (default)
              
            tis620 (TIS620 Thai) collations:
          
                tis620_bin
              
                tis620_thai_ci (default)
              
            ujis (EUC-JP Japanese) collations:
          
                ujis_bin
              
                ujis_japanese_ci (default)
              
          Why is cp932
          needed?
        
          In MySQL, the sjis character set
          corresponds to the Shift_JIS character set
          defined by IANA, which supports JIS X0201 and JIS X0208
          characters. (See
          http://www.iana.org/assignments/character-sets.)
        
          However, the meaning of 鈥SHIFT JIS鈥 as a
          descriptive term has become very vague and it often includes
          the extensions to Shift_JIS that are
          defined by various vendors.
        
          For example, 鈥SHIFT JIS鈥 used in Japanese Windows
          environments is a Microsoft extension of
          Shift_JIS and its exact name is
          Microsoft Windows Codepage : 932 or
          cp932. In addition to the characters
          supported by Shift_JIS,
          cp932 supports extension characters such as
          NEC special characters, NEC selected 鈥 IBM extended
          characters, and IBM extended characters.
        
Many Japanese users have experienced problems using these extension characters. These problems stem from the following factors:
MySQL automatically converts character sets.
              Character sets are converted via Unicode
              (ucs2).
            
              The sjis character set does not support
              the conversion of these extension characters.
            
There are several conversion rules from so-called 鈥SHIFT JIS鈥 to Unicode, and some characters are converted to Unicode differently depending on the conversion rule. MySQL supports only one of these rules (described later).
          The MySQL cp932 character set is designed
          to solve these problems.
        
          Because MySQL supports character set conversion, it is
          important to separate IANA Shift_JIS and
          cp932 into two different character sets
          because they provide different conversion rules.
        
          How does cp932 differ
          from sjis?
        
          The cp932 character set differs from
          sjis in the following ways:
        
              cp932 supports NEC special characters,
              NEC selected 鈥 IBM extended characters, and IBM
              selected characters.
            
              Some cp932 characters have two
              different code points, both of which convert to the same
              Unicode code point. When converting from Unicode back to
              cp932, one of the code points must be
              selected. For this 鈥round trip conversion,鈥
              the rule recommended by Microsoft is used. (See
              http://support.microsoft.com/kb/170559/EN-US/.)
            
The conversion rule works like this:
If the character is in both JIS X 0208 and NEC special characters, use the code point of JIS X 0208.
If the character is in both NEC special characters and IBM selected characters, use the code point of NEC special characters.
If the character is in both IBM selected characters and NEC selected 鈥 IBM extended characters, use the code point of IBM extended characters.
              The table shown at
              http://www.microsoft.com/globaldev/reference/dbcs/932.htm
              provides information about the Unicode values of
              cp932 characters. For
              cp932 table entries with characters
              under which a four-digit number appears, the number
              represents the corresponding Unicode
              (ucs2) encoding. For table entries with
              an underlined two-digit value appears, there is a range of
              cp932 character values that begin with
              those two digits. Clicking such a table entry takes you to
              a page that displays the Unicode value for each of the
              cp932 characters that begin with those
              digits.
            
The following links are of special interest. They correspond to the encodings for the following sets of characters:
NEC special characters:
http://www.microsoft.com/globaldev/reference/dbcs/932/932_87.htm
NEC selected 鈥 IBM extended characters:
http://www.microsoft.com/globaldev/reference/dbcs/932/932_ED.htm http://www.microsoft.com/globaldev/reference/dbcs/932/932_EE.htm
IBM selected characters:
http://www.microsoft.com/globaldev/reference/dbcs/932/932_FA.htm http://www.microsoft.com/globaldev/reference/dbcs/932/932_FB.htm http://www.microsoft.com/globaldev/reference/dbcs/932/932_FC.htm
              cp932 supports conversion of
              user-defined characters in combination with
              eucjpms, and solves the problems with
              sjis/ujis
              conversion. For details, please refer to
              http://www.opengroup.or.jp/jvc/cde/sjis-euc-e.html.
            
          For some characters, conversion to and from
          ucs2 is different for
          sjis and cp932. The
          following tables illustrate these differences.
        
          Conversion to ucs2:
        
| sjis/cp932Value | sjis->ucs2Conversion | cp932->ucs2Conversion | 
| 5C | 005C | 005C | 
| 7E | 007E | 007E | 
| 815C | 2015 | 2015 | 
| 815F | 005C | FF3C | 
| 8160 | 301C | FF5E | 
| 8161 | 2016 | 2225 | 
| 817C | 2212 | FF0D | 
| 8191 | 00A2 | FFE0 | 
| 8192 | 00A3 | FFE1 | 
| 81CA | 00AC | FFE2 | 
          Conversion from ucs2:
        
| ucs2value | ucs2->sjisConversion | ucs2->cp932Conversion | 
| 005C | 815F | 5C | 
| 007E | 7E | 7E | 
| 00A2 | 8191 | 3F | 
| 00A3 | 8192 | 3F | 
| 00AC | 81CA | 3F | 
| 2015 | 815C | 815C | 
| 2016 | 8161 | 3F | 
| 2212 | 817C | 3F | 
| 2225 | 3F | 8161 | 
| 301C | 8160 | 3F | 
| FF0D | 3F | 817C | 
| FF3C | 3F | 815F | 
| FF5E | 3F | 8160 | 
| FFE0 | 3F | 8191 | 
| FFE1 | 3F | 8192 | 
| FFE2 | 3F | 81CA | 
          Users of any Japanese character sets should be aware that
          using --character-set-client-handshake (or
          --skip-character-set-client-handshake) has an
          important effect. See Section聽5.2.2, 鈥淐ommand Options鈥.