Routines Unicode



Allegro peut manipuler et afficher du texte en utilisant des valeurs de caractères allant de 0 à 2^32-1 (bien que l'implémentation actuelle du grabber peut seulement créér des fontes utilisant au plus 2^16-1 caractères). Vous pouvez choisir entre plusieurs formats d'encodage des textes, qui contrôlent la façon dont les chaînes sont stockées et comment Allegro interprète les chaînes que vous lui passez. Ce paramétrage affecte tous les aspects du système : quand une fonction retourne un résultat de type char *, ou prend un char * en argument, ce texte sera dans le format que vous avez dit à Allegro d'utiliser.

By default, Allegro uses UTF-8 encoded text (U_UTF8). This is a variable-width format, where characters can occupy anywhere from one to six bytes. The nice thing about it is that characters ranging from 0-127 are encoded directly as themselves, so UTF-8 is upwardly compatible with 7 bit ASCII ("Hello, World!" means the same thing regardless of whether you interpret it as ASCII or UTF-8 data). Any character values above 128, such as accented vowels, the UK currency symbol, and Arabic or Chinese characters, will be encoded as a sequence of two or more bytes, each in the range 128-255. This means you will never get what looks like a 7 bit ASCII character as part of the encoding of a different character value, which makes it very easy to manipulate UTF-8 strings.

There are a few editing programs that understand UTF-8 format text files. Alternatively, you can write your strings in plain ASCII or 16 bit Unicode formats, and then use the Allegro textconv program to convert them into UTF-8.

If you prefer to use some other text format, you can set Allegro to work with normal 8 bit ASCII (U_ASCII), or 16 bit Unicode (U_UNICODE) instead, or you can provide some handler functions to make it support whatever other text encoding you like (for example it would be easy to add support for 32 bit UCS-4 characters, or the Chinese GB-code format).

There is some limited support for alternative 8 bit codepages, via the U_ASCII_CP mode. This is very slow, so you shouldn't use it for serious work, but it can be handy as an easy way to convert text between different codepages. By default the U_ASCII_CP mode is set up to reduce text to a clean 7 bit ASCII format, trying to replace any accented vowels with their simpler equivalents (this is used by the allegro_message() function when it needs to print an error report onto a text mode DOS screen). If you want to work with other codepages, you can do this by passing a character mapping table to the set_ucodepage() function.

void set_uformat(int type);
Sets the current text encoding format. This will affect all parts of Allegro, wherever you see a function that returns a char *, or takes a char * as a parameter. The type should be one of the values:

      U_ASCII     - fixed size, 8 bit ASCII characters
      U_ASCII_CP  - alternative 8 bit codepage (see set_ucodepage())
      U_UNICODE   - fixed size, 16 bit Unicode characters
      U_UTF8      - variable size, UTF-8 format Unicode characters

Although you can change the text format on the fly, this is not a good idea. Many strings, for example the names of your hardware drivers and any language translations, are loaded when you call allegro_init(), so if you change the encoding format after this, they will be in the wrong format, and things will not work properly. Generally you should only call set_uformat() once, before allegro_init(), and then leave it on the same setting for the duration of your program.

int get_uformat(void);
Returns the currently selected text encoding format.

void register_uformat(int type, int (*u_getc)(char *s), int (*u_getx)(char **s), int (*u_setc)(char *s, int c), int (*u_width)(char *s), int (*u_cwidth)(int c), int (*u_isok)(int c));
Installs a set of custom handler functions for a new text encoding format. The type is the ID code for your new format, which should be a 4-character string as produced by the AL_ID() macro, and which can later be passed to functions like set_uformat() and uconvert(). The function parameters are handlers that implement the character access for your new type: see below for details of these.

void set_ucodepage(unsigned short *table, unsigned short *extras);
When you select the U_ASCII_CP encoding mode, a set of tables are used to convert between 8 bit characters and their Unicode equivalents. You can use this function to specify a custom set of mapping tables, which allows you to support different 8 bit codepages. The table parameter points to an array of 256 shorts, which contain the Unicode value for each character in your codepage. The extras parameter, if not NULL, points to a list of mapping pairs, which will be used when reducing Unicode data to your codepage. Each pair consists of a Unicode value, followed by the way it should be represented in your codepage. The table is terminated by a zero Unicode value. This allows you to create a many->one mapping, where many different Unicode characters can be represented by a single codepage value (eg. for reducing accented vowels to 7 bit ASCII).

int need_uconvert(char *s, int type, int newtype);
Given a pointer to a string, a description of the type of the string, and the type that you would like this string to be converted into, this function tells you whether any conversion is required. No conversion will be needed if type and newtype are the same, or if one type is ASCII, the other is UTF-8, and the string contains only character values less than 128. As a convenience shortcut, you can pass the value U_CURRENT as either of the type parameters, to represent whatever text format is currently selected.

int uconvert_size(char *s, int type, int newtype);
Returns the number of bytes that will be required to store the specified string after a conversion from type to newtype, including the zero terminator. The type parameters can use the value U_CURRENT as a shortcut to represent the currently selected encoding format.

void do_uconvert(char *s, int type, char *buf, int newtype, int size);
Converts the specified string from type to newtype, storing at most size bytes into the output buf. The type parameters can use the value U_CURRENT as a shortcut to represent the currently selected encoding format.

char *uconvert(char *s, int type, char *buf, int newtype, int size);
Higher level function running on top of do_uconvert(). This function converts the specified string from type to newtype, storing at most size bytes into the output buf, but it checks before doing the conversion, and doesn't bother if the string formats are already the same (either both types are equal, or one is ASCII, the other is UTF-8, and the string contains only 7 bit ASCII characters). If a conversion was performed it returns a pointer to buf, otherwise it returns a copy of s, so you must use the return value rather than assuming that the string will always be moved to buf. As a convenience, if buf is NULL it will convert the string into an internal static buffer. You should be wary of using this feature, though, because that buffer will be overwritten the next time this routine is called, so don't expect the data to persist across any other library calls.

char *uconvert_ascii(char *s, char buf[]);
Helper macro for converting strings from ASCII into the current encoding format. Expands to uconvert(s, U_ASCII, buf, U_CURRENT, sizeof(buf)).

char *uconvert_toascii(char *s, char buf[]);
Helper macro for converting strings from the current encoding format into ASCII. Expands to uconvert(s, U_CURRENT, buf, U_ASCII, sizeof(buf)).

extern char empty_string[];
You can't just rely on "" to be a valid empty string in any encoding format. This global buffer contains a number of consecutive zeros, so it will be a valid empty string no matter whether the program is running in ASCII, Unicode, or UTF-8 mode.

int ugetc(char *s);
Low level helper function for reading Unicode text data. Given a pointer to a string in the current encoding format, it returns the next character from the string.

int ugetx(char **s);
Low level helper function for reading Unicode text data. Given the address of a pointer to a string in the current encoding format, it returns the next character from the string, and advances the pointer to the character after the one just read.

int usetc(char *s, int c);
Low level helper function for writing Unicode text data. It writes the specified character to the given address in the current encoding format, and returns the number of bytes written.

int uwidth(char *s);
Low level helper function for testing Unicode text data. It returns the number of bytes occupied by the first character of the specified string, in the current encoding format.

int ucwidth(int c);
Low level helper function for testing Unicode text data. It returns the number of bytes that would be occupied by the specified character value, when encoded in the current format.

int uisok(int c);
Low level helper function for testing Unicode text data. Tests whether the specified value can be correctly encoded in the current format.

int uoffset(char *s, int index);
Returns the offset in bytes from the start of the string to the character at the specified index. A zero index parameter will just return a copy of s. If the index is negative, it counts backward from the end of the string, so an index of -1 will return an offset to the last character.

int ugetat(char *s, int index);
Returns the character value at the specified index within the string. A zero index parameter will return the first character of the string. If the index is negative, it counts backward from the end of the string, so an index of -1 will return the last character of the string.

int usetat(char *s, int index, int c);
Replaces the character at the specified index within the string with value c, handling any adjustments for variable width data (ie. if c encodes to a different width than the previous value at that location). Returns the number of bytes by which the trailing part of the string was moved. If the index is negative, it counts backward from the end of the string.

int uinsert(char *s, int index, int c);
Inserts the character c at the specified index within the string, sliding the rest of the data along to make room. Returns the number of bytes by which the trailing part of the string was moved. If the index is negative, it counts backward from the end of the string.

int uremove(char *s, int index);
Removes the character at the specified index within the string, sliding the rest of the data back to fill the gap. Returns the number of bytes by which the trailing part of the string was moved. If the index is negative, it counts backward from the end of the string.

int ustrsize(char *s);
Returns the size of the specified string in bytes, not including the trailing zero.

int ustrsizez(char *s);
Returns the size of the specified string in bytes, including the trailing zero.

int utolower(int c);
int utoupper(int c);
int uisspace(int c);
int uisdigit(int c);
char * ustrdup(char *src)
char * ustrcpy(char *dest, char *src);
char * ustrcat(char *dest, char *src);
int ustrlen(char *s);
int ustrcmp(char *s1, char *s2);
char * ustrncpy(char *dest, char *src, int n);
char * ustrncat(char *dest, char *src, int n);
int ustrncmp(char *s1, char *s2, int n);
int ustricmp(char *s1, char *s2);
char * ustrlwr(char *s);
char * ustrupr(char *s);
char * ustrchr(char *s, int c);
char * ustrrchr(char *s, int c);
char * ustrstr(char *s1, char *s2);
char * ustrpbrk(char *s, char *set);
char * ustrtok(char *s, char *set);
double uatof(char *s);
long ustrtol(char *s, char **endp, int base);
double ustrtod(char *s, char **endp);
char * ustrerror(int err);
int uvsprintf(char *buf, char *format, va_list args);
int usprintf(char *buf, char *format, ...);

These all work like the equivalent ANSI C functions, but using whatever Unicode text format is currently selected. The size parameter to ustrncpy() and ustrncat() is given in bytes rather than characters (on the assumption that you will be using these routines to prevent overflowing the size of a memory buffer), while the size parameter to ustrncmp() is given in characters (because it doesn't make any sense for this to be in bytes). The usprintf() implementation complies with as much of the ANSI spec as I could remember when I wrote it, except that it doesn't support exponential notation for floating point values.




Retour au Sommaire