The Unicode Standard encodes characters rather than glyphs, which is to say the idea rather than any particular representation. For example, the code point U+0041 always represents Latin Capital Letter A, no matter which font is used to render it. • Abstract Characters are the smallest meaningful units within a writing system • Coded Characters assign a name and a code point to identify each abstract character • Character Encoding Forms determine how coded characters are represented by computers • Character Encoding Schemes specify how encoded representations are serialized into bytes

Terminology

  • In Unicode, a grapheme ≠ a code point ≠ a byte:
    • Unicode unaware: traditionally programming languages such as C or Python assume that 1 byte = 1 grapheme. This is not the case for unicode.
  • Grapheme: a single unit of a human writing system
    • think of a grapheme as what would go on a single Scrabble tile.
    • similar to a character, but not exactly the same thing.
    • In Unicode, a grapheme is represented by one or more code points (the actual number representing the grapheme)
      • e.g. è can be represented as
        • 233: Latin small letter E with Acute or,
        • 101: Latin Small Letter E + 769: Combining Acute Accent Modifier
    • Grapheme size in bytes:
      • In ASCII 1 grapheme = 1 byte
      • In index, 1 grapheme ≠ 1 byte
  • Code point: A number assigned to grapheme1
    • These numbers must then be encoded into binary via an encoding strategy such as UTF-8, or UTF-32.
    • Code points are expressed in the form U+1234 where 1234 is the assigned hexadecimal number for that character.

Normalization (Comparing Unicode Strings)

See: Unicode equivalence - Wikipedia Remember that Unicode is not a one-to-one mapping to visual symbols. This is for a variety of reasons. One reason is that there can be more than one way unicode representation (code point) of a particular symbol (glyph). For example, è can be represented as one code point (“è” U+00E8 Latin Small Letter E with Grave) or two (“e” U+0065 Latin Small Letter E and “◌́” U+0301 Combining Acute Accent). Both of these code point combinations result in exactly the same glyph. In other words, it looks virtually identical to a human, but completely different to a computer. This creates a problem. For example, if you are implementing a search engine then you will only find search results for the results that match your exact code points.

This is why Unicode added normalization. See normal forms

Docs

Tools

Unicode Explorers

Unicode Analyzers

Unicode Transformations

Sources

Footnotes

Footnotes

  1. Don’t forget that each grapheme might be represented by more than one code point

2 items under this folder.