Qual charset devo usar? ISO8859_1 WIN1252 ou UTF8?

This article is still in draft, the information is correct, but it lacks formatting adjustments, grammatical agreement and finding a more didactic way. But if you understand, you can use it.

If you use Windows, you might be wondering why this question is important.
Please open cmd and run the command 'chcp' with me, it returns the character page code that is currently in use. It will return code page '850'.

Let's google and search this page with the terms 'Code page 850' and we will be forwarded to wikipedia:

https://en.wikipedia.org/wiki/Code_page_850

And then we find that we are using iso8859_1 aka 'Latin1'. So every time we refer to ISO8859_1, let's call it Latin1, ok?

If you've come to know MS-DOS you may remember the chcp command we used in autoexec.bat to make it possible to see accents on the screen. So since the early days of MS-DOS we have used Latin1 (or iso8859_1) to make it possible to see our accents in the terminal. Latin1 (or iso8859_1) is prefixed with 'iso' because it was approved by the international committee of the same name: International Organization for Standardization, the equivalent of our ABNT only international.

But one thing is the terminal, something considered legacy by some to run applications. When Windows came up with a graphical interface, microsoft decided to create a new code page called 'Win1252' so it has 'Win' in the name, it was made exclusively for the Windows environment and had the approval of ANSI, the American ABNT.

Why create your own standard instead of continuing to use ISO8859_1(Latin1)? This is a difficult question to answer, iso8859_1 was initially created by Apple and later approved by ISO and at that time Microsoft liked to dictate industry standards and maybe that's why they created Win1252. It should be unnecessary since it was based on ISO8859_1 itself and both with few exceptions are interchangeable with each other. Personally, I only know of two differences between both, Win1252 has a long dash which is different from the normal dash and a serif(curve) quote set

Having a different character generation pattern only created problems for Microsoft, their terminal(cmd) is Latin1, the GUI is Win1252, the file system is UNICODE. Not long ago, if you used a notepad to edit a php or html file, chances were good that your work would be lost. Systems like Linux are all UNICODE and you usually don't have to deal with conversions between code from different pages.

Another problem with the Windows environment is that it is not as universal as it used to be, today applications are in the cloud and most database services are hosted on computers that do not run Windows. If you host data on a Windows server storing data in WIN1252 format, you would probably have to think what would happen if you decided to host a database on a Linux operating system that only follows international ISO standards where WIN1252 does not exist. You may have to deal with character transliteration issues.

[todo: show in notepad++ this transliteration]

So, if we want an application or database that is portable between operating systems, it is best to give up WIN1252 and opt for ISO8859_1(latin1) instead. Now, from now on I will only talk about Latin1(ISO8859_1) and the Unicode standard. But when referring to Latin1, remember that it is also applicable to WIN1252.

UNICODE came to solve some problems, There is a great talk Firebird Conference 2011 Luxembourg where the speaker talked a lot about charsets, in his talk entitled “Character Sets and Unicode in Firebird” and mentioned the advantages:

[I'll include the link in the video:
https://www.firebirdsql.org/file/community/ppts/fbcon11/FbCon2011-Charsets-Heymann.pdf
]

A single character set for all languages/scripts
No code overlays, no arbitrary code needed to fix inconsistencies
Independent of hardware and operating system

todo: show example in linux terminal, changing prompt to display latin characters and utf

todo: on windows change code page from 850 to chcp 65001

chcp

We then check that we are using latin1, let's now change to win1252:

chcp 1252
echo line #1: test with with page latin1(iso8859_1)>test.txt
echo line #2: attention, my accent is inoquable>>test.txt
test.txt hex editor

It all seems right, right? but let's now change the page code to utf:

chcp 65001
echo line #3: test with with unicode page>>test.txt
echo line #4: attention, my accent is inoquable>>test.txt
test.txt hex editor

Now we have two initial lines displayed correctly, but the last two erroneously, why? Because when starting the file for the first time, there are no codepage indications inside the file and so windows assumes it is Ansi.

Let's create the same 4 lines in notepad++, note that it is already created in UTF8
but if we look in notepad, we will notice that none of the lines look correct because neither Latin1 nor UTF is compatible with WIN1252(Ansi).

Se usarmos o notepad++ e convertermos o arquivo para UTF-8 (Formatar->Codificação UTF-8 ou europa ocidental) as duas primeiras ou as duas últimas linhas estão corretas, mas nunca as 4 linhas.

But if unicode is so good, why don't we use it?

We use it all the time, Microsoft has already surrendered to UNICODE in development, although Windows still has a page code salad, for example the file system is unicode, the graphical environment is still win1252 and the cmd(terminal) Lantin1 , so there are a lot of inconsistencies when we program in Windows to unicode. If a programmer generates an SQL script and saves it to disk using WIN1252, but the characters generated in it are in UNICODE, there will be problems. If you had to create edit a unicode file using notepad and then realized.

todo: demonstrate an SQL script using IBExpert, UTF-8 database will not run a seemingly harmless script that was written on Win1252.

This is a hell that only exists on Windows, and that when you try to port to Linux you have to solve problems that didn't exist before. Therefore, for our convenience, databases include WIN1252 among their charsets, although this is not ideal.

Still about databases, UNICODE is not always desirable, as a single character stored in it can occupy 1 to 4 bytes, so it is not widely used if an application already “speaks” ISO8859_1 or WIN1252, as nations that use ideograms or other typographic symbols also know how to “speak” ISO8859_1 or WIN1252, although the opposite is not true, if you were from a Japanese company you would have to write an application that uses Japanese ideograms and most likely also the Western one and only UNICODE would solve this problem .

So if you live in a western country that uses the characters AZ, you can consider yourself lucky because we have options to choose from, and we usually choose the cheapest that is also the most performant, disregard UNICODE and choose between WIN1252 or ISO8859_1.
A database entirely in UNICODE is less performant when compared to Latin1, not only because it is more greedy in terms of space, remember a varchar(64) depending on the UNICODE used, instead of 64 characters it can only contain 16, but also because the 'case insensitive' and 'accent insensitive' algorithm may be more complex. MySQL for example is recognized for adopting a UNICODE hack that sometimes creates problems.

IBExpert – Create database:

C:\TEMP\TEST_ISO8859_1.FDB
CHARSET: ISO8859_1
COLLATE: PT_BR
ALIAS: TEST_ISO8859_1.FDB
PORT: 3040

— CREATE THE TABLE:

CREATE TABLE T1(
    FRUTA_ISO8859 VARCHAR(10) CHARACTER SET ISO8859_1 COLLATE PT_BR,
    FRUTA_WIN1252 VARCHAR(10) CHARACTER SET WIN1252 COLLATE WIN_PTBR,
    FRUTA_UNICODE VARCHAR(10) CHARACTER SET UTF8 COLLATE UNICODE_CI_AI
);

POPULAR VALUES ACCREDITED WITH THE SCRIPT:

INSERT INTO T1 (FRUTA_ISO8859, FRUTA_WIN1252, FRUTA_UNICODE)
            VALUES ('Maracujá', 'Maracujá', 'Maracujá');
INSERT INTO T1 (FRUTA_ISO8859, FRUTA_WIN1252, FRUTA_UNICODE)
            VALUES ('Açaí', 'Açaí', 'Açaí');
INSERT INTO T1 (FRUTA_ISO8859, FRUTA_WIN1252, FRUTA_UNICODE)
            VALUES ('Gravatá', 'Gravatá', 'Gravatá');
INSERT INTO T1 (FRUTA_ISO8859, FRUTA_WIN1252, FRUTA_UNICODE)
            VALUES ('Avelã', 'Avelã', 'Avelã');
INSERT INTO T1 (FRUTA_ISO8859, FRUTA_WIN1252, FRUTA_UNICODE)
            VALUES ('Melão', 'Melão', 'Melão');
INSERT INTO T1 (FRUTA_ISO8859, FRUTA_WIN1252, FRUTA_UNICODE)
            VALUES ('Maçã', 'Maçã', 'Maçã');
INSERT INTO T1 (FRUTA_ISO8859, FRUTA_WIN1252, FRUTA_UNICODE)
            VALUES ('Mamão', 'Mamão', 'Mamão');
INSERT INTO T1 (FRUTA_ISO8859, FRUTA_WIN1252, FRUTA_UNICODE)
            VALUES ('Jiló', 'Jiló', 'Jiló');
INSERT INTO T1 (FRUTA_ISO8859, FRUTA_WIN1252, FRUTA_UNICODE)
            VALUES ('Babaçu', 'Babaçu', 'Babaçu');

COMMIT WORK;

Are all characters the same size in bytes? Run:

select
  a.fruta_iso8859||'('||octet_length(a.fruta_iso8859)||')' as iso8859,
  a.fruta_win1252||'('||octet_length(a.fruta_win1252)||')' as win1252,
  a.fruta_unicode||'('||octet_length(a.fruta_unicode)||')' as unicode
from T1 a

Among our fruits, the character limit is 10, passion fruit was the fruit that took up the most space in bytes with 9 bytes in unicode. What will happen if we increase it to 10 characters and we accentuate even more letters? Let's check:

update T1 set
  FRUTA_UNICODE='Maráçujãío' -- 10 caracteres, deveria caber, o limite é 10
where FRUTA_UNICODE='Maracujá';

We did it, but note the divergence between character size and occupied byte size:

select
  a.fruta_iso8859||octet_length(a.fruta_iso8859),
  a.fruta_win1252||octet_length(a.fruta_win1252),
  a.fruta_unicode||octet_length(a.fruta_unicode)
from T1 a
where FRUTA_UNICODE='Maráçujãío'

Passion fruit is 10 characters long, but now occupies 14 bytes. So the size of a field is not the same as saying the size occupied. What do you think would happen if all 10 letters had an accent? We will try:

update T1 set
FRUTA_UNICODE='áéíóúáéíóú'-- 10 caracteres
where FRUTA_UNICODE='Maráçujãío'

We then came to the conclusion that in unicode – especially UTF-8 – your database storage will be larger containing accents than it would be in Latin1 or Ansi.

With storage so cheap, should I be worried?

Se seu objetivo for internacionalizar, não deve se preocupar com o armazenamento, mas deve se preocupar com os limites teóricos de seus metadados, como nome de objetos. Por exemplo, se eu criar uma tabela cuja chave é um varchar(x) qual é o tamanho máximo de uma chave dentro de um índice? No FirebirdSQL, este tamanho é ¼ do tamanho da pagina de dados. Se por exemplo, seu banco tem paginas de 4kb, significa que o tamanho de um índice não vai poder ultrapassar 1k então um indice simples ou composto que tenha mais de 1024bytes não será possível. Então quando estiver usando unicode tome cuidado com os limites do FirebirdSQL que forem em bytes, pois o que antes era 1 byte=1 caractere não é mais aplicável. Os limites que eram estabelecidos em caracteres, estes não mudam, se o limite para tamanho de nome para uma tabela é 63 caracteres, continuará sendo 63 caracteres não importando se é unicode ou não.

Aqui temos uma gritante diferença para iso8859_1 ou win1252, pois nestes cada caractere é 1 byte, enquanto em unicode o FirebirdSQL assume que cada caractere consumido é 4 bytes. Então se voce criar um campo varchar(18000) e inserir 18.000 caracteres acentuados, ele irá deixar, mas o espaço ocupado será 18.000×2 ou 18.000×4 dependendo do acento.

Portanto, quando criar um campo que você sabe que no máximo terá X caracteres, leve em consideração:

O tamanho declarado não é o mesmo de tamanho ocupado.
Cuidado com o metadados, se um tipo de limite for informado em bytes então usando unicode você terá 4x menos do que o informado.

LET'S GO TO COLLATE

Charset is the available character set, mostly ISO8859_1, WIN1252 and UNICODE. Each with their limited set of characters. If your database needs Latin and Western characters ISO8859_1 is fine for you, you should not use WIN1252 because it is only provided for Windows and its condition may change in the future.

But if you need a large character set that allows you to write Western characters and Japanese characters in the same sentence then you have to choose UNICODE, usually UTF-8.

Collate is how characters will be treated and/or ordered. Does 'A' come before 'á'? Are 'Pharmacia' and Farmacia' the same thing? Are 'Jose' and 'Jose' also the same?

Whoever creates the collates defines this, in Brazil there is no such thing as the 'Ph' of pharmacia from the 30s, but in other countries there may be grouping of characters (collate) that should be treated jointly and not individually.

CREATE TABLE T2 (
    NOME  VARCHAR(30) NOT NULL COLLATE PT_BR
);

INSERT INTO T2 (NOME)   VALUES ('FARMACIA');
INSERT INTO T2 (NOME)   VALUES ('FARMÁCIA');
INSERT INTO T2 (NOME)   VALUES ('Jose');
INSERT INTO T2 (NOME)    VALUES ('José');
INSERT INTO T2 (NOME)    VALUES ('JOSÉ');

Note now the ordering of data between two different colates using the same charset:

SELECT * FROM T2 to ORDER BY a.name collate pt_pt

Now the other:

SELECT * FROM T2 to ORDER BY a.name collate pt_br

Collates are tied to the charset because without them, the bank would not have the sort regionality, or which diacritics are equal to their versions without those signs, and so on.

Even if a database has a table using 3 charsets different, containing the same data, the collate can make the data behave using the same linguistic rule. For example, for Brazil, case/accent insensitive means that the collate will not distinguish between uppercase and lowercase and that the ordering will follow the same pattern.

Let's test if the ordering was influenced by the charset running this query:

execute block
returns(
  iso8859_1 varchar(10),
  win1252 varchar(10),
  unicode varchar(10))
as
begin
  --iso8859_1
  iso8859_1='Sim';
  win1252='-';
  unicode='-';
  suspend;
  for select
    a.fruta_iso8859, a.fruta_win1252, a.fruta_unicode
    from T1 a
    order by a.fruta_iso8859
    into iso8859_1, win1252, unicode
  do begin
    suspend;
  end
  -- win1252
  iso8859_1='-';
  win1252='Sim';
  unicode='-';
  suspend;
  for select
    a.fruta_iso8859, a.fruta_win1252, a.fruta_unicode
    from T1 a
    order by a.fruta_win1252
    into iso8859_1, win1252, unicode
  do begin
    suspend;
  end
  -- unicode
  iso8859_1='-';
  win1252='-';
  unicode='Sim';
  suspend;
  for select
    a.fruta_iso8859, a.fruta_win1252, a.fruta_unicode
    from T1 a
    order by a.fruta_unicode
    into iso8859_1, win1252, unicode
  do begin
    suspend;
  end
end

We noticed in the output of the execute block command that the ordering by the iso8859_1 (latin1), win1252 or unicode charset had no difference!

Let's complicate and insert unaccented characters, running this sequence of ExecSQL:

INSERT INTO T1 (FRUTA_ISO8859, FRUTA_WIN1252, FRUTA_UNICODE)
            VALUES ('Maracuja', 'Maracuja', 'Maracuja');
INSERT INTO T1 (FRUTA_ISO8859, FRUTA_WIN1252, FRUTA_UNICODE)
            VALUES ('Açai', 'Açai', 'Açai');
INSERT INTO T1 (FRUTA_ISO8859, FRUTA_WIN1252, FRUTA_UNICODE)
            VALUES ('Gravata', 'Gravata', 'Gravata');
INSERT INTO T1 (FRUTA_ISO8859, FRUTA_WIN1252, FRUTA_UNICODE)
            VALUES ('Avela', 'Avela', 'Avela');
INSERT INTO T1 (FRUTA_ISO8859, FRUTA_WIN1252, FRUTA_UNICODE)
            VALUES ('Melao', 'Melao', 'Melao');
INSERT INTO T1 (FRUTA_ISO8859, FRUTA_WIN1252, FRUTA_UNICODE)
            VALUES ('Maça', 'Maça', 'Maça');
INSERT INTO T1 (FRUTA_ISO8859, FRUTA_WIN1252, FRUTA_UNICODE)
            VALUES ('Mamao', 'Mamao', 'Mamao');
INSERT INTO T1 (FRUTA_ISO8859, FRUTA_WIN1252, FRUTA_UNICODE)
            VALUES ('Jilo', 'Jilo', 'Jilo');
INSERT INTO T1 (FRUTA_ISO8859, FRUTA_WIN1252, FRUTA_UNICODE)
            VALUES ('Babacu', 'Babacu', 'Babacu');

We perform the run block again and we noticed that there was no difference, the order was the same for all cases, only the unaccented version took precedence.