[ << ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |

`<uninorm.h>`

This include file defines functions for transforming Unicode strings to one of the four normal forms, known as NFC, NFD, NKFC, NFKD. These transformations involve decomposition and — for NFC and NFKC — composition of Unicode characters.

The following enumerated values are the possible types of decomposition of a Unicode character.

__Constant:__int**UC_DECOMP_CANONICAL**Denotes canonical decomposition.

__Constant:__int**UC_DECOMP_FONT**UCD marker:

`<font>`

. Denotes a font variant (e.g. a blackletter form).

__Constant:__int**UC_DECOMP_NOBREAK**UCD marker:

`<noBreak>`

. Denotes a no-break version of a space or hyphen.

__Constant:__int**UC_DECOMP_INITIAL**UCD marker:

`<initial>`

. Denotes an initial presentation form (Arabic).

__Constant:__int**UC_DECOMP_MEDIAL**UCD marker:

`<medial>`

. Denotes a medial presentation form (Arabic).

__Constant:__int**UC_DECOMP_FINAL**UCD marker:

`<final>`

. Denotes a final presentation form (Arabic).

__Constant:__int**UC_DECOMP_ISOLATED**UCD marker:

`<isolated>`

. Denotes an isolated presentation form (Arabic).

__Constant:__int**UC_DECOMP_CIRCLE**UCD marker:

`<circle>`

. Denotes an encircled form.

__Constant:__int**UC_DECOMP_SUPER**UCD marker:

`<super>`

. Denotes a superscript form.

__Constant:__int**UC_DECOMP_SUB**UCD marker:

`<sub>`

. Denotes a subscript form.

__Constant:__int**UC_DECOMP_VERTICAL**UCD marker:

`<vertical>`

. Denotes a vertical layout presentation form.

__Constant:__int**UC_DECOMP_WIDE**UCD marker:

`<wide>`

. Denotes a wide (or zenkaku) compatibility character.

__Constant:__int**UC_DECOMP_NARROW**UCD marker:

`<narrow>`

. Denotes a narrow (or hankaku) compatibility character.

__Constant:__int**UC_DECOMP_SMALL**UCD marker:

`<small>`

. Denotes a small variant form (CNS compatibility).

__Constant:__int**UC_DECOMP_SQUARE**UCD marker:

`<square>`

. Denotes a CJK squared font variant.

__Constant:__int**UC_DECOMP_FRACTION**UCD marker:

`<fraction>`

. Denotes a vulgar fraction form.

__Constant:__int**UC_DECOMP_COMPAT**UCD marker:

`<compat>`

. Denotes an otherwise unspecified compatibility character.

The following constant denotes the maximum size of decomposition of a single Unicode character.

__Macro:__unsigned int**UC_DECOMPOSITION_MAX_LENGTH**This macro expands to a constant that is the required size of buffer passed to the

`uc_decomposition`

and`uc_canonical_decomposition`

functions.

The following functions decompose a Unicode character.

__Function:__int**uc_decomposition***(ucs4_t*`uc`, int *`decomp_tag`, ucs4_t *`decomposition`)Returns the character decomposition mapping of the Unicode character

`uc`.`decomposition`must point to an array of at least`UC_DECOMPOSITION_MAX_LENGTH`

`ucs_t`

elements.When a decomposition exists,

and`decomposition`[0..`n`-1]`*`

are filled and`decomp_tag``n`is returned. Otherwise -1 is returned.

__Function:__int**uc_canonical_decomposition***(ucs4_t*`uc`, ucs4_t *`decomposition`)Returns the canonical character decomposition mapping of the Unicode character

`uc`.`decomposition`must point to an array of at least`UC_DECOMPOSITION_MAX_LENGTH`

`ucs_t`

elements.When a decomposition exists,

is filled and`decomposition`[0..`n`-1]`n`is returned. Otherwise -1 is returned.Note: This function returns the (simple) “canonical decomposition” of

`uc`. If you want the “full canonical decomposition” of`uc`, that is, the recursive application of “canonical decomposition”, use the function`u*_normalize`

with argument`UNINORM_NFD`

instead.

The following function composes a Unicode character from two Unicode characters.

__Function:__ucs4_t**uc_composition***(ucs4_t*`uc1`, ucs4_t`uc2`)Attempts to combine the Unicode characters

`uc1`,`uc2`.`uc1`is known to have canonical combining class 0.Returns the combination of

`uc1`and`uc2`, if it exists. Returns 0 otherwise.Not all decompositions can be recombined using this function. See the Unicode file ‘

`CompositionExclusions.txt`’ for details.

The Unicode standard defines four normalization forms for Unicode strings. The following type is used to denote a normalization form.

__Type:__**uninorm_t**An object of type

`uninorm_t`

denotes a Unicode normalization form. This is a scalar type; its values can be compared with`==`

.

The following constants denote the four normalization forms.

__Macro:__uninorm_t**UNINORM_NFD**Denotes Normalization form D: canonical decomposition.

__Macro:__uninorm_t**UNINORM_NFC**Normalization form C: canonical decomposition, then canonical composition.

__Macro:__uninorm_t**UNINORM_NFKD**Normalization form KD: compatibility decomposition.

__Macro:__uninorm_t**UNINORM_NFKC**Normalization form KC: compatibility decomposition, then canonical composition.

The following functions operate on `uninorm_t`

objects.

__Function:__bool**uninorm_is_compat_decomposing***(uninorm_t*`nf`)Tests whether the normalization form

`nf`does compatibility decomposition.

__Function:__bool**uninorm_is_composing***(uninorm_t*`nf`)Tests whether the normalization form

`nf`includes canonical composition.

__Function:__uninorm_t**uninorm_decomposing_form***(uninorm_t*`nf`)Returns the decomposing variant of the normalization form

`nf`. This maps NFC,NFD → NFD and NFKC,NFKD → NFKD.

The following functions apply a Unicode normalization form to a Unicode string.

__Function:__uint8_t ***u8_normalize***(uninorm_t*`nf`, const uint8_t *`s`, size_t`n`, uint8_t *`resultbuf`, size_t *`lengthp`)__Function:__uint16_t ***u16_normalize***(uninorm_t*`nf`, const uint16_t *`s`, size_t`n`, uint16_t *`resultbuf`, size_t *`lengthp`)__Function:__uint32_t ***u32_normalize***(uninorm_t*`nf`, const uint32_t *`s`, size_t`n`, uint32_t *`resultbuf`, size_t *`lengthp`)Returns the specified normalization form of a string.

The

`resultbuf`and`lengthp`arguments are as described in chapter Conventions.

The following functions compare Unicode string, ignoring differences in normalization.

__Function:__int**u8_normcmp***(const uint8_t **`s1`, size_t`n1`, const uint8_t *`s2`, size_t`n2`, uninorm_t`nf`, int *`resultp`)__Function:__int**u16_normcmp***(const uint16_t **`s1`, size_t`n1`, const uint16_t *`s2`, size_t`n2`, uninorm_t`nf`, int *`resultp`)__Function:__int**u32_normcmp***(const uint32_t **`s1`, size_t`n1`, const uint32_t *`s2`, size_t`n2`, uninorm_t`nf`, int *`resultp`)Compares

`s1`and`s2`, ignoring differences in normalization.`nf`must be either`UNINORM_NFD`

or`UNINORM_NFKD`

.If successful, sets

`*`

to -1 if`resultp``s1`<`s2`, 0 if`s1`=`s2`, 1 if`s1`>`s2`, and returns 0. Upon failure, returns -1 with`errno`

set.

__Function:__char ***u8_normxfrm***(const uint8_t **`s`, size_t`n`, uninorm_t`nf`, char *`resultbuf`, size_t *`lengthp`)__Function:__char ***u16_normxfrm***(const uint16_t **`s`, size_t`n`, uninorm_t`nf`, char *`resultbuf`, size_t *`lengthp`)__Function:__char ***u32_normxfrm***(const uint32_t **`s`, size_t`n`, uninorm_t`nf`, char *`resultbuf`, size_t *`lengthp`)Converts the string

`s`of length`n`to a NUL-terminated byte sequence, in such a way that comparing`u8_normxfrm (`

and`s1`)`u8_normxfrm (`

with the`s2`)`u8_cmp2`

function is equivalent to comparing`s1`and`s2`with the`u8_normcoll`

function.`nf`must be either`UNINORM_NFC`

or`UNINORM_NFKC`

.The

`resultbuf`and`lengthp`arguments are as described in chapter Conventions.

__Function:__int**u8_normcoll***(const uint8_t **`s1`, size_t`n1`, const uint8_t *`s2`, size_t`n2`, uninorm_t`nf`, int *`resultp`)__Function:__int**u16_normcoll***(const uint16_t **`s1`, size_t`n1`, const uint16_t *`s2`, size_t`n2`, uninorm_t`nf`, int *`resultp`)__Function:__int**u32_normcoll***(const uint32_t **`s1`, size_t`n1`, const uint32_t *`s2`, size_t`n2`, uninorm_t`nf`, int *`resultp`)Compares

`s1`and`s2`, ignoring differences in normalization, using the collation rules of the current locale.`nf`must be either`UNINORM_NFC`

or`UNINORM_NFKC`

.If successful, sets

`*`

to -1 if`resultp``s1`<`s2`, 0 if`s1`=`s2`, 1 if`s1`>`s2`, and returns 0. Upon failure, returns -1 with`errno`

set.

A “stream of Unicode characters” is essentially a function that accepts an
`ucs4_t`

argument repeatedly, optionally combined with a function that
“flushes” the stream.

__Type:__**struct uninorm_filter**This is the data type of a stream of Unicode characters that normalizes its input according to a given normalization form and passes the normalized character sequence to the encapsulated stream of Unicode characters.

__Function:__struct uninorm_filter ***uninorm_filter_create***(uninorm_t*`nf`, int (*`stream_func`) (void *`stream_data`, ucs4_t`uc`), void *`stream_data`)Creates and returns a normalization filter for Unicode characters.

The pair (

`stream_func`,`stream_data`) is the encapsulated stream.

receives the Unicode character`stream_func`(`stream_data`,`uc`)`uc`and returns 0 if successful, or -1 with`errno`

set upon failure.Returns the new filter, or NULL with

`errno`

set upon failure.

__Function:__int**uninorm_filter_write***(struct uninorm_filter **`filter`, ucs4_t`uc`)Stuffs a Unicode character into a normalizing filter. Returns 0 if successful, or -1 with

`errno`

set upon failure.

__Function:__int**uninorm_filter_flush***(struct uninorm_filter **`filter`)Brings data buffered in the filter to its destination, the encapsulated stream.

Returns 0 if successful, or -1 with

`errno`

set upon failure.Note! If after calling this function, additional characters are written into the filter, the resulting character sequence in the encapsulated stream will not necessarily be normalized.

__Function:__int**uninorm_filter_free***(struct uninorm_filter **`filter`)Brings data buffered in the filter to its destination, the encapsulated stream, then closes and frees the filter.

Returns 0 if successful, or -1 with

`errno`

set upon failure.

[ << ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |

This document was generated by *Daiki Ueno* on *May, 25 2018* using *texi2html 1.78a*.