Impact applicatif du passage à UTF-8

Et oui la mode UTF8 arrive partout … et c’est un bien fait, c’est sur… comment envisager de continuer à créer des applications ne fonctionnant que dans une langue ?!?

Mais que se passe-t-il lorsque l’on transforme une application existante utilisant le classique WEBISO (ISO-8859-1) pour ses interfaces et traitements en une application du futur basée sur UTF8 ? C’est ce qui m’a occupé ces quelques derniers jours au travail, alors autant vous faire partager cette expérience !

Pour tout comprendre, il faut savoir qu’en WEBISO les caractères sont tous codés sur 8 bits (1 octet) et que par conséquent tous les caractères du monde ne peuvent être représentés, mais les traitements sont simples : par exemple tronquer une chaîne à 10 caractères signifie simplement ne garder que 10 octets.
En UTF8, la taille d’un caractère dépend du caractère lui-même. Les caractères basiques (les 128 premiers) sont codés sur 1 octet et l’encoding est identique à l’ASCII, si bien qu’un texte en anglais de base sera grosso modo le même en UTF8 et en WEBISO. Pour ce qui est de nos écris, latin que nous sommes, il n’en va pas de même : les caractères accentués valent généralement deux octets. Les caractères un peu plus spéciaux (comme TM) par exemple valent 3 octets…

Ceci a de nombreux impacts dans les programmes. Que l’on peut classer en deux familles:

  • Ceux qui concernent l’apparence du caractère
  • Ceux qui concernent la taille du caractère

Pour ce qui est du problème de l’apparence, vous l’aurez sans doute déjà rencontré sur Internet : à la place de caractères comme é vous allez trouver des choses comme ou d’autres hiéroglyphes du même ordre. Ce problème se résout par transcodification. Il arrive lorsque l’on charge des données d’un encoding donné (WEBISO) par exemple vers une base qui est dans un autre encoding (UTF8). Ainsi lorsque l’on migre une application en UTF8, il faudra bien faire attention à l’encoding des flux d’entrée de la base et l’encoding des flux de sortie. Si l’on ne fait pas attention, les applications réceptrices ne seront peut être plus à même de traduire les caractères correctement.
Lorsque les fichiers sont importés par SQL-LOADER ou Java, il est généralement possible de spécifier l’encoding du fichier source des données en positionnant simplement la locale de l’utilisateur lançant le batch de chargement. L’encodage de destination étant alors pris en charge par le moteur Oracle
Dans le cas de fichiers extraits de la base par PL-SQL, si rien n’est précisé, c’est l’encoding de la base qui est utilisé par le PL-SQL (en java par contre ce sera toujours celui de l’utilisateur batch). le fichier devra alors être converti. Les outils de conversion sont standard sous UNIX.

La seconde catégorie d’impacte concernant la taille de données est plus génant car il nécessite souvent de redéfinir le modèle de données.
Le type de données usuel en SQL pour stocker des chaînes de caractères est VARCHAR2. Une déclaration classique étant VARCHAR2(10) pour indiquer que l’on souhaite 10 … à votre avis ?!? … BYTES !! et c’est là tout le problème. Par défaut on compte en BYTE et ce à cause d’un paramètre qui est NLS_LEN_SEMANTIC qui fixe si l’on parle par defaut en BYTE ou en CHAR. Du coup une chaîne de carcatère UTF8 telle que “abcdefghij” qui fera 10 caractères et 10 octets tiendra dans le champ et une chaîne comme “àbcdéfghïj” fera 10 caractères toujours mais cette fois 13 octets (à=2 é=2 ï=2) et Oracle vous retournera une erreur. Le plantage ne sera donc pas systématique d’où le danger…
Pour ce sortir de cette embûche il existe une solution qui semble simple : changer le paramètre NLS_LEB_SEMANTIC, mais voilà, Oracle est un produit comment dire…. leader (c’est ca le terme) et ce paramètre est des plus mal supporté. Ainsi l’application de patch par exemple nécesiste un retour au mode BYTE qui pourra être désastreux pour l’intégrité de la base si l’on ne fait pas attention … ajoutons à cela d’autres problèmes d’incompatibilité avec certains progiciel… bref cette solution simple n’est pas vraiment applicable.
Les solutions qui sont donc plus adaptées conduisent à retoucher les modèles de données en transformant les VARCHAR en NVARCHAR, ce second type ayant le bon goût de se définir par défaut en nombre de caractères plutôt qu’on nombre d’octets ou encore de choisir des définitions explicites pour les VARCHAR. Par exemple VARCHAR2(10 CHAR) indiquera que l’on parle bien de 10 caractères et non 10 octets. Dans ce cas, Oracle allouera non pas 10 octets mais 40 octets dans la colonne (le maximum que l’on peut atteindre avec 10 caractères) et vérifiera lors des insert/update que l’on ne dépasse pas le nombre de caractères souhaité.

Cette solution de retouche du modèle de donnée est plutôt lourde mais c’est sans doute la seule vraiment pérenne. En plus de la définition des tables il faudra retoucher une partie du code:
Les PL-SQL si les variables intermédiaires ne sont pas construites par références aux colonnes de la table.
Les SQL/Loaer si les fichiers de contrôles spécifient le type de destination.
Les chargements de type ETL lorsque ceux-ci définissent les types en entrée et sortie (cas courant)

Une solution de replie qui peut être envisagé de façon temporaire consiste à tronquer les données en entrées (type sqlloader) en spécifiant une taille en octets plutôt qu’en caractères. En PL/SQL la commande SUBSTRB par exemple permet de couper une chaîne en une sous chaîne de n octets de long. La coupure est intelligente, elle ne laissera pas de demi-caractères.

An other version i wrote … for our english spoker friends ….

Intro

The xxxxx standards for database encoding, mainly for Applications, is now UTF8 and more precisely AL32UTF8. UTF8 encoding allow databases to support any languages even Asian ones. Most of the XXXX legacies are based on WEBISO8859-1 named also Latin-1 encoding to support special characters like éàî … This choice has a lots of benefic import for our master applications but it also create some new considerations we have to take into account during the development phase.

Let’s try to explain what change make the UTF8 encoding …

Some encoding definitions

WEBISO was an easy encoding for systems as it represents each character on one byte. the first 128 characters are ascii compabible and the next 128 characters represent local and special characters like our “éçà”. Some different encoding name and local pages are used to change the 128 to 255 character code for each country. That means that by the past a file provided from a country to an other (like EST Europe to WEST Europe) could be received with strange or incorrect characters. As an example a central europe file could contain a character like Č (encoded as 0xC8 in ISO-8859-2) this character would be print as a È (encoded as 0xC8 in ISO-8859-1) in a western europe application. A file transcodification is needed but as Č character do not exists in ISO-8859-1 it should be replaced by something like a C. That was for the past …

UTF-8 is a good way to represent any characters in the world. It allows to use up to 4 bytes to represent a character when the choosen encoding is AL32UTF8. As a consequence characters Č and È can be represented in the same page / file. Processing this king of encoding is a little bit more complicated as not all the caracters require 4 byte as to save space in files the most frequent characters will be encoded in 1 byte, the less frequent in 2 bytes … etc Some characters like ? will required 3 bytes (the encoding is 0xE284A2).

As a way to have compatible files (mainly US) a file containing only ASCII-7 characters will be exactly the same in ISO-8859-1, ISO-8859-2 and UTF8. This situation look like a good news but it hides most of the problems we are meeting by implementing UTF8.

Encoding difference consequences (first issues)

By the past when an application received a file in a wrong encoding, the data were load normally in the application but displayed badly as we saw previously. This end-user issue was solved by translating the incoming file and so transform the unknown letter to the nearest one.

UTF8 can create the same issue when a file/flow is imported without the needed translation into an ISO-8859-1 application. This application will print some non-understandable caracters. As an exemple ? character will be print as “â?¢” (with ? for a non printable character) that is for the classical problem…

Now, imagine that the incoming file use a fixed column format, the program will try to read a data at a certain position… concidering the file as an ISO-8859-1 it will try to find that position in a certain byte from the beginning. As the character lenght is not fixed, each time a character will need more than 1 byte, the program will reach the wrong position. As an impact of this jobs can abend or load wrong data in the database.

The solution to bypass this situation is simpely to convert the incoming file in the desired encoding as it was done by the past.

Now, as current caracters like “éèà” cause the described issue and our applications mix UTF8 and WEBISO, it is really important to describe each flow by an encoding type. I mean, when a functionnal analyst describes a flow between two application he must indicate what is the source encoding and what is the destination encoding. Each developpeur should request the encoding type he have to specify when he develop a flow input / output procedure. As a test procedure in unitary tests or qualification, testers should fill some data with the maximum number of non common characters to verify UTF8 application’s compatibility.

Regarding our experience of the problem:

  • EAI flow encoding is generally defined in the flow
  • File import default encoding is genererally defined by the batch user encoding. So by changing this batch user encoding Java will choose the right input encoding. Once java put the data in a String element, it will be converted in Unicode, then reconverted in the database encoding when stored. SQL-loader work in the same way by using the batch user encoding and detect then perform the conversion before updating the database.
  • File export with java run similary as import : output encoding will be by default batch user encoding.
  • PL/SQL export will cause trouble as the output encoding used is the Database encoding. The generated file will have to be converted by a shell command in part of the case.

Encoding lenght consequences (second issues)

The previously described problem hide an other one, more sensible at the origin of that article… To understand it we must go deeper in Oracle configuration parameters… By default an oracle database has a parameter named NLS_LEN_SEMANTIC set to BYTE. The meaning of this parameter is to defined whar we are talking about when we define a classical type like a VARCHAR(10). This parameter answer the following question : what kind of stuff I want to put 10 times in my string ?!?

Regarding the NLS_LEN_SEMANTIC parameter VARCHAR2(10) means “reserve a maximum of 10 BYTES for a string“. Here is the main issue …. as most of us would use this syntax to indicate Oracle to reserve 10 characters…. as we had allways done ! Why is it so different with UTF8 ? It is not different … VARCHAR2(10) or CHAR, VARCHAR allways defined a number of bytes but before using UTF8 1 characters was equivalent to 1 byte.

As a consequence of this, imagine an Xnet screen with a form containing a user free text column with a maximum size of 10. User will for exemple enter “à découper”, Java program will get it, count the characters and accept it as the number of character is equal to 10 – lower or egal the constraint . It will generate the SQL request and send it to Oracle. Oracle will verify the 10 byte constraint and reject it…. This issue is also impacting data flow loading : any data able to contain non ASCI-7 characters are able to abend a loading process. As an impact referential files (like item definition) can abend during loading and stop a batch plan in production because the day before someone create in the referential a longer line with “éàè” characters.

There are three solutions to avoid this situation

  • The simpler one is to change the NLS_LEN_SEMANTIC parameter. This one can be set to CHAR, in this case VARCHAR2(10) would significate 10 CHARS. This solution is really low cost as the database definition for programs do not change and no program have to be rewrite. Unfortunatelly Oracle do not really support this parameter and DBA actually refuse to put it in place. As an example of known problem each patch of the Oracle engine have to be execute in BYTE mode. When this parameter is not set back to CHAR at the end of the Oracle update process, the database could be corrupted and unrecoverable. more over some tools are not yet validated on that mode as Datastage.
  • An other way (the best in my point of view) is to precise for each VARCHAR2 the unit desired. Yous should declare VARCHAR2(10 CHAR). Oracle, il this case will reserve physically 4 more times bytes : 10 CHARS will correspond to 40 BYTES. But oracle will also add a constraints of 10 characters on the column.  In Toad or DB-Visualizer you will generally see the column definition as 40 … be carefull, on toad, is seem that this information is displayed differently …randomly.
  • The last way to implement it is to use the NVARCHAR2 definition as NVARCHAR2 (NVARCHAR, NCHAR) are Character mode storage type by default. so NVARCHAR(10) will correspond to 10 CHARS. So one of the way to convert entierly a database to be UTF8 ready with less effort is to convert all the VARCHAR2 elements to NVARCHAR2 equivalent elements. N/CHAR types encoding is not defined by the usual NLS_ENCODING parameter but it is defined by something likne NLS_NCHAR_ENCODING parameter, becarefull in most of our database, this last was not correclty defined : AL32UTF8 is the standard, not AL16UTF8.

Implemented solution

The best way to do is to study the situation during the analysis phase and to build the solution with the right type… use VARCHAR2(xx CHAR), CHAR(xx CHAR) … definitions is the best way to avoid encoding problems.

In other case you will discover these issues during Central Acceptance or worst in Production, in these situations we studdied two way to solve the problem:

  • The global way: Here, we will convert all the VARCHAR type into NVARCHAR or VARCHAR(xx CHAR). As un impact, most of the SQL/LOADER and ETL process will have to be rewrite. The rewriting time is not big as the unit operation is simple.But, if you own some hundreds of ETL process, it will require some weeks of work. Java should not be impacted. PL/SQL routine should be compatible as mutch as the internal variables use referenced type more instead of fixed types but they should have to be recompiled. All the non regression tests should be perform.
  • The selected way: This method is the oposite one, it consists to change the minimum number of column. As the UTF8 lenght issue only impacts user typed data we can limit the perimeter to columns provided by the internal UI and external flows. Generally these labels represents less than 1% or the overall. The method we used to do that is to determine the source column by the rules previously indicated. Then determine where these columns will be propagated in the application (data copy or transformation) until they reach an output like an ETL flow. The objective of this aproach is to reduce the rework perimiter, mainly regarding the number of impacted ETL and also limit the non regression tests. You will be able to adjust the final perimeter regarding the lavel of risk of each field.

A good way to simpely verify the result of the patch and the non regression is to fill all the source column with a full string of “éàè” characters and start the application…

Conclusion

UTF8 is not just a technical improvement there are consequencies during the analysis phase, the datamodel and the deducted database shema has to take it into account. In the application interfaces encoding has also to be described and sometime conversion scripts will be needed. As a standard a Master Application should store data in UTF8 and should provide an UTF8 interface to the external Master Applications.

When UTF8 encoding impact your application as defined previously, we know some workaround, intermediate solution or global solution we are able to put in place. All of these solution require time to be put in place. These solutions can be cumulative to be able to be planned in patch / release … step by step…

If your application is currently UTF8 and you never think about these problem … I invite you to try to enter a full line of “éééééé” in one of your UI and check the result in database… You may discover a source of risk…

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.