I prefer Rinaldos coffee Unicode Encoded Badge
A steaming cup of coffee


ASCII, Unicode and EBCDIC

 

Definitions

I'll try to be consistent, but I'm a programmer, so I may get lazy. When I use these terms, this is what I mean:-

TermWhat I mean
 Unicode  

The (successful) project building a collection of every character used digitally, worldwide. Unicode can support over 1,000,000 characters

E.g. Unicode U+ff21 is "FULLWIDTH LATIN CAPITAL LETTER A". Capital A is encoded in both ASCII and UTF-8 as one byte 0x41 and in EBCDIC as 0xC1

n.b. Capital A is encoded differently again in UTF-16 and UTF-32 (twice: little-endian and big-endian versions)

 UTF-8

An encoding of all Unicode characters. The WWW pretty much runs on this, and the common characters match in ASCII. There are no byte order (BE versus LE) issues. I am a BIG fan.

On one hand, it can encode all 154,998 code points (and rising) in Unicode, including combining characters, then on up to over 1,000,000 in future using a variable-width encoding of 1 to 4 bytes per code point,

On the other hand, web pages, program source code and English/European/American text use only 1 byte of UTF-8 for the most common characters (also found in the old 7-bit US-ASCII standard), so it saves space and network bandwidth.

 UTF-16

A silly encoding of all Unicode characters (I am heavily biased). It was developed from the old, fixed-width UCS-2 (used in some earlier versions of Windows, for example), when it was obvious 16 bits weren't enough. UTF-16 is variable width, but it uses either 2 or 4 bytes per character (so ordinary source code, web pages and text take more space and bandwidth) and it has byte order (BE versus LE) issues. Files using UTF-16 need a Byte Order Mark (BOM), because (surprise, surprise) big-endian machines need to read little-endian files, and vice versa.

 UTF-32

Another encoding of all Unicode characters. It uses a fixed width of 4 bytes per character (so ordinary source code, web pages and text take A LOT more space and bandwidth) and it has byte order (BE versus LE) issues. But the fixed 4 byte encoding gives improved performance in programs processing UTF-32 string data. Files using UTF-32 need a Byte Order Mark (BOM), for the same reasons UTF-16 files do.

Practically speaking, it is the same as the old UCS-4 standard.

 US_ASCII

An ancient encoding (1960s) that only uses 7 bits per byte, giving only characters A-Z and a-z in the Latin alphabet, 0123456789 and punctuation such as #~'-_+$"%&()<>[]{}/\|=*!^*,.;? Most program source code is written in these characters.

ASCII suited American users better than anyone else, and it soon got extended in various ways defined the top characters (byte values from 128 to 255 decimal) with national variations such as £ (for the UK Pound Sterling), accented characters etc.

n.b. If your printer and PC were using different code pages, you soon found out the hard way :-)

 EBCDIC

An ancient IBM encoding (1960s again) that uses 8 bits per byte, giving the same basic a-z, A-Z, 0-9 and mostly similar punctuation to ASCII (but not in the same order!) plus a few extras such as ¬ and IBM mainframes still read both ASCII/UTF and EBCDIC. I'll use either quite happily :-)

 grapheme or character  

What we think of when somebody says "the character A". You choose the typeface and size :-)

 glyph  

How a grapheme (character) gets shown in a particular typeface and size (font)

 typeface

Basically, a collection of fonts in the same typestyle, but with different sizes and options for bold, italic etc. Arial Bold

 font

A typeface in a particular size