Characters

Scripts, characters, glyphs, letters, and cases

A script is a system of writing, that is, a way of representing and recording utterances of some language in written form. A characteris a component of a script, a repeatable unit that may appear in forms that differ visually but are recognized by users of the writing system as depictions of the same linguistic constituent. These various visual depictions are called glyphs to distinguish them from characters; for instance, a boldface letter p and an italic letter p are different glyphs depicting the same character.

Some scripts are alphabetic; they contain a small number of heavily used characters (called letters) that roughly represent the constituent sounds of utterances rather than their meanings. In alphabetic scripts, glyphs are sometimes grouped into cases -- letter variants that are conventionally used in different printed contexts. For example, capital letters form a case in English; H and h are depictions of the same letter, but the first is in upper case and the second in lower case.

The question of whether H and h are instances of the same character is somewhat arbitrary, and computer programmers have usually found it convenient to treat them as distinct characters.

Character codes

When a character is stored in a computer, it must be represented as a sequence of bits, just like any other datum. In the early days of computing, each equipment manufacturer developed one or more ``character codes'' of its own, so that, for example, the capital letter A was represented by the sequence 110001 on an IBM 1401 computer, by 000001 on a Control Data 6600, by 11000001 on an IBM 360, and so on. This made it troublesome to transfer character data from one computer to another, since it was necessary to convert each character from the source machine's encoding to the target machine's encoding. The difficulty was compounded by the fact that different manufacturers supported different characters; all provided the twenty-six capital letters used in writing English and the ten digits used in writing Arabic numerals, but there was much variation in the selection of mathematical symbols, punctuation marks, etc.

ASCII

In 1963, a number of manufacturers agreed to use the American Standard Code for Information Interchange (ASCII), which is now by far the most common and widely used character code. It includes representations for ninety-four characters selected from American and Western European text, commercial, and technical scripts: the twenty-six English letters in both upper and lower case, the ten digits, sixteen standard punctuation marks used in typescript (the period, the comma, the colon, the semicolon, the question mark, the exclamation point, the hyphen, the apostrophe, the double quotation mark, left and right parentheses, left and right square brackets, the virgule or slash, the asterisk, and the underscore), some mathematical symbols (the plus sign, the equals sign, the less-than sign, the greater-than sign, left and right braces), some commercial symbols (the at-sign, the mesh or pound-sign, the dollar sign, the percent sign), and a few accents and miscellaneous characters (the grave accent, the circumflex, the tilde, the ampersand, the vertical bar, and the backslash).

ASCII also reserves a bit sequence for a ``space'' character, usually depicted by a blank (a glyph that doesn't include any marks) and thirty-three bit sequences for so-called ``control characters,'' which have various implementation-dependent effects on printing and display devices.

Each character or control character is represented by a sequence of exactly seven bits, and every sequence of seven bit values represents a different character or control character. It is common to read these bit sequences as binary numerals -- for instance, to read 1000001, which is the ASCII sequence representing the capital letter A, as sixty-five, which is the value of that bit sequence considered as a binary numeral. In Pascal, this is reflected in the fact that on a machine using ASCII characters, the value of the expression Ord ('A') is 65.

When an ASCII character is stored in the memory of a computer, it is usually given one byte (eight bits) of storage. Since the character is completely fixed by its seven-bit representation, this leaves a bit free for some other purpose. Often the extra bit is simply turned off and ignored. Alternatively, it can be used as a way of checking that the hardware is storing and recovering characters accurately -- for instance, by turning on the extra bit if, and only if, the number of 1 bits in the ASCII representation of the character to be stored is even. This convention ensures that the byte in which the character is stored will always have an odd number of 1 bits in it; if such a byte is ever found to have an even number of 1 bits, an error has occurred. When the extra bit is used in this way, it is called a ``parity bit'' (because it is adjusted to ensure a specified parity -- evenness or oddness -- of the number of 1 bits in the byte), and the character is said to be stored as ``seven bits, odd parity.'' If the parity bit is set in the opposite way -- turned on if the number of 1 bits in the ASCII representation of the character is odd -- the character is stored as ``seven bits, even parity.'' The phrase ``seven bits, no parity'' means that the extra bit is always turned off and ignored.

Still another possibility is to use all the possible bit sequences that will fit into the eight-bit byte to represent characters, 256 of them in all. Typically, those in which the first bit is 0 are the usual ASCII characters, while those in which the first bit is 1 are extras -- characters from European languages other than English, the symbols for cents, pounds, and yen, subscripts and superscripts, printers' marks, common fractions, additional control characters. Various manufacturers have developed different eight-bit extensions of ASCII, thus restoring the chaos that ASCII was designed to eliminate. Characters stored in this way are ``eight bits, no parity.'' (In some machines, the hardware provides a ninth bit in each byte, which is used only for error detection through parity control, and on such machines it is possible to have ``eight bits, even parity'' or ``eight bits, odd parity'' representations.)

Unicode

Even the extended versions of ASCII support only a few of the world's scripts, and rather inadequately at that. A more recently devised code, the Unicode Worldwide Character Standard, currently defines 34168 characters and supports Arabic, Armenian, Bengali, Bopomofo, Cyrillic, Devanagari, Georgian, Greek, Gujarati, Gurmkhi, Han, Hangul, Hebrew, Hiragana, Kannada, Katakana, Latin, Lao, Malayalam, Oriya, Tamil, Telugu, and Thai scripts, the International Phonetic Alphabet, Optical Character Recognition symbols, and the special character set used for the APL programming language, as well as a large number of miscellaneous numerical, mathematical, musical, astronomical, religious, technical, and printers' symbols and pieces of diagrams and geometric shapes. (In this list, the writing system used for English is characterized as the ``Latin'' script because most of the letters of its alphabet were initially used to write Latin.)

Unicode uses a sequence of sixteen bits for each character, allowing for two to the sixteenth power (that is, 65536) codes altogether. Thus a number of bit sequences are still unassigned and may, in future versions of Unicode, be allocated for some of the numerous scripts that are not yet supported. The designers are actively working on Burmese, Cherokee, Cree, Ethiopic, Khmer, Maldivian, Mongolian, Moso, Pahawh Hmong, Rong, Sinhalese, Tai Lu, Tai Mau, Tibetan, Tifinagh, and Yi.

In a sense, ASCII is contained in Unicode: By adding nine 0 bits at the beginning of any ASCII bit sequence, one obtains a Unicode bit sequence that represents the same character. For instance, the Unicode representation of a capital A is 0000000001000001.

A Unicode character stored in the memory of a computer occupies two adjacent eight-bit bytes. Depending on the manufacturer of the computer, the eight bits that are stored into the byte that has the smaller machine address may be the first eight bits or the last eight bits. Unicode requires that the convention that is adopted be the same one that is used for the storage of sixteen-bit unsigned integers on the same machine.

Both ASCII and Unicode implicitly define an ordering of the characters: One character precedes another in the ordering if the number expressed by its bit sequence, considered as a binary numeral, is less than the number expressed by the other's. As long as only characters from the English alphabet are considered, both ASCII and Unicode match the traditional alphabetical order, but this is not true more generally.

Characters as an abstract data type

No single set of characters is universally accepted and well adapted to all purposes. Because of its widespread use, I'll take the seven-bit ASCII character set as the basis for my description of characters as an abstract data type.

It is convenient to have a name for each of the possible character values; the following names are derived from those applied to ASCII characters in the Unicode standard:

0000000 null
0000001 start-of-heading
0000010 start-of-text
0000011 end-of-text
0000100 end-of-transmission
0000101 enquiry
0000110 acknowledge
0000111 bell
0001000 backspace
0001001 horizontal-tabulation
0001010 line-feed
0001011 vertical-tabulation
0001100 form-feed
0001101 carriage-return
0001110 shift-out
0001111 shift-in
0010000 data-link-escape
0010001 device-control-one
0010010 device-control-two
0010011 device-control-three
0010100 device-control-four
0010101 negative-acknowledge
0010110 synchronous-idle
0010111 end-of-transmission-block
0011000 cancel
0011001 end-of-medium
0011010 substitute
0011011 escape
0011100 file-separator
0011101 group-separator
0011110 record-separator
0011111 unit-separator
0100000 space
0100001 exclamation-mark
0100010 quotation-mark
0100011 number-sign
0100100 dollar-sign
0100101 percent-sign
0100110 ampersand
0100111 apostrophe
0101000 left-parenthesis
0101001 right-parenthesis
0101010 asterisk
0101011 plus-sign
0101100 comma
0101101 hyphen-minus
0101110 full-stop
0101111 solidus
0110000 digit-zero
0110001 digit-one
0110010 digit-two
0110011 digit-three
0110100 digit-four
0110101 digit-five
0110110 digit-six
0110111 digit-seven
0111000 digit-eight
0111001 digit-nine
0111010 colon
0111011 semicolon
0111100 less-than-sign
0111101 equals-sign
0111110 greater-than-sign
0111111 question-mark
1000000 commerical-at
1000001 capital-letter-a
1000010 capital-letter-b
1000011 capital-letter-c
1000100 capital-letter-d
1000101 capital-letter-e
1000110 capital-letter-f
1000111 capital-letter-g
1001000 capital-letter-h
1001001 capital-letter-i
1001010 capital-letter-j
1001011 capital-letter-k
1001100 capital-letter-l
1001101 capital-letter-m
1001110 capital-letter-n
1001111 capital-letter-o
1010000 capital-letter-p
1010001 capital-letter-q
1010010 capital-letter-r
1010011 capital-letter-s
1010100 capital-letter-t
1010101 capital-letter-u
1010110 capital-letter-v
1010111 capital-letter-w
1011000 capital-letter-x
1011001 capital-letter-y
1011010 capital-letter-z
1011011 left-square-bracket
1011100 reverse-solidus
1011101 right-square-bracket
1011110 circumflex-accent
1011111 low-line
1100000 grave-accent
1100001 small-letter-a
1100010 small-letter-b
1100011 small-letter-c
1100100 small-letter-d
1100101 small-letter-e
1100110 small-letter-f
1100111 small-letter-g
1101000 small-letter-h
1101001 small-letter-i
1101010 small-letter-j
1101011 small-letter-k
1101100 small-letter-l
1101101 small-letter-m
1101110 small-letter-n
1101111 small-letter-o
1110000 small-letter-p
1110001 small-letter-q
1110010 small-letter-r
1110011 small-letter-s
1110100 small-letter-t
1110101 small-letter-u
1110110 small-letter-v
1110111 small-letter-w
1111000 small-letter-x
1111001 small-letter-y
1111010 small-letter-z
1111011 left-curly-bracket
1111100 vertical-line
1111101 right-curly-bracket
1111110 tilde
1111111 delete

Here are the operations that I propose for characters as an abstract data type:

size-of-character-set
Inputs: none.
Output: result, a natural number.
Preconditions: none.
Postcondition: result is 128, the number of characters in the seven-bit ASCII character set.

equal
Inputs: left-operand and right-operand, both characters.
Output: result, a Boolean.
Preconditions: none.
Postcondition: result is true if the two inputs are the same character (specifically, if their character codes are identical), false if they are different characters.

alphabetic
Input: ch, a character.
Output: result, a Boolean.
Preconditions: none.
Postcondition: result is true if ch is a letter, false if it is not.

uppercase
Input: ch, a character.
Output: result, a Boolean.
Preconditions: none.
Postcondition: result is true if ch is an upper-case letter, false if it is not.

lowercase
Input: ch, a character.
Output: result, a Boolean.
Preconditions: none.
Postcondition: result is true if ch is a lower-case letter, false if it is not.

numeric
Input: ch, a character.
Output: result, a Boolean.
Preconditions: none.
Postcondition: result is true if ch is a decimal digit, false if it is not.

control
Input: ch, a character.
Output: result, a Boolean.
Preconditions: none.
Postcondition: result is true if ch is a control character, false if it is not.

precedes
Inputs: left-operand and right-operand, both characters.
Output: result, a Boolean.
Precondition: Both inputs are alphabetic.
Postcondition: result is true if the first input comes before the second in the alphabet, false if it is comes after or if the two letters are the same. Case is disregarded (so that, for example, d precedes F).

upcase
Input: ch, a character.
Output: result, a character.
Preconditions: none.
Postcondition: If ch is a lower-case letter, result is the upper-case version of the same letter; otherwise, result is ch.

downcase
Input: ch, a character.
Output: result, a character.
Preconditions: none.
Postcondition: If ch is an upper-case letter, result is the lower-case version of the same letter; otherwise, result is ch.

encode
Input: ch, a character.
Output: code, a natural number.
Preconditions: none.
Postcondition: When the bit sequence that represents ch is interpreted as a binary numeral, its value is code.

decode
Input: code, a natural number.
Output: ch, a character.
Preconditions: code is less than the value returned by size-of-character-set.
Postcondition: When the bit sequence that represents ch is interpreted as a binary numeral, its value is code.

read
Input: source, a data source (e.g., a file, the keyboard, a device).
Output: legend, a character.
Preconditions: source can supply another character on demand.
Postcondition: A character has been extracted from source, and legend is that character.

peek
Input: source, a data source (e.g., a file, the keyboard, a device).
Output: legend, a character.
Preconditions: source can supply another character on demand.
Postcondition: legend is the character that source will supply next on demand.

write
Inputs: target, a data sink (e.g., a file, a window, a device), and scribend, a character.
Outputs: None. Preconditions: none.
Postcondition: The character scribend has been appended to target.

Characters in Pascal

The Pascal standard makes very few assumptions about the values in its Char data type. It presupposes that the ten digits used in Arabic numerals are characters that occupy adjacent positions in the ordering of the characters, starting with 0 and continuing in ascending order to 9. It presupposes that the apostrophe is a character. It states that if there are representations of the capital letters A through Z, they must be arranged in the traditional alphabetical order but need not be adjacent; this accommodates yet another code, the Extended Binary Coded Decimal Interchange Code (EBCDIC), that is still used on some IBM mainframe computers and features ``gaps'' between the representations of I and J and between R and S. Pascal imposes the same requirement on the representations of the lower-case letters a through z, if they are present, but does not posit any particular relationship between the representations of upper- and lower-case letters or between letters and digits. Both ASCII and Unicode are straightforwardly consistent with the Pascal standard.

Pascal provides a total of twelve built-in operations involving Char values: the six comparison operations (=, <, >, <=, >=, <>), the Pred and Succ functions, input and output procedures, and the transfer functions Ord and Chr that correlate characters with the integer values of their bit-sequence representations. In general, these built-in operations do their work by performing arithmetic on those integer values. This implies that character comparisons are case-sensitive: If an implementation of Pascal supports both upper- and lower-case letters, they will be counted as distinct.

Standard Pascal provides names, in the form of literal constants, for all of the graphic characters, but not for any of the control characters. It does not provide the size-of-character-set operation in any form, nor any of the predicates alphabetic, uppercase, lowercase, numeric, or control, nor the case-insensitive comparison operation precedes, nor the case-conversion operations upcase and downcase. (The peek operation is provided, though, in the form of the ^ operation on files; Source^ is the character that would result from the application of peek to Source.)

Supplying the missing operations

However, Pascal is an extensible language, and it is easy to provide a library of character functions that fill in all of the gaps just mentioned:
{ This file contains a collection of character functions to complete the
  built-in char type in Pascal and to bring it into conformity with an
  abstract data type specification for characters developed in the handout

   http://www.math.grin.edu/~stone/courses/fundamentals/characters.html

  Programmer: John Stone, Grinnell College.
  Date of this version: July 24, 1996. }

const
  SizeOfCharacterSet = 128;
    { the number of characters in the seven-bit ASCII character set }

  { The first group of functions simply construct and return characters for
    which standard Pascal provides no literals.  It would be better if one
    could define these as constants, but there is no way to do this in
    standard Pascal (and, in practice, no way to do it portably in
    non-standard Pascal). }

  function Null: Char;
  begin
    Null := Chr (0);
  end;

  function StartOfHeading: Char;
  begin
    StartOfHeading := Chr (1);
  end;

  function StartOfText: Char;
  begin
    StartOfText := Chr (2);
  end;

  function EndOfText: Char;
  begin
    EndOfText := Chr (3);
  end;

  function EndOfTransmission: Char;
  begin
    EndOfTransmission := Chr (4);
  end;

  function Enquiry: Char;
  begin
    Enquiry := Chr (5);
  end;

  function Acknowledge: Char;
  begin
    Acknowledge := Chr (6);
  end;

  function Bell: Char;
  begin
    Bell := Chr (7);
  end;

  function Backspace: Char;
  begin
    Backspace := Chr (8);
  end;

  function HorizontalTabulation: Char;
  begin
    HorizontalTabulation := Chr (9);
  end;

  function LineFeed: Char;
  begin
    LineFeed := Chr (10);
  end;

  function VerticalTabulation: Char;
  begin
    VerticalTabulation := Chr (11);
  end;

  function FormFeed: Char;
  begin
    FormFeed := Chr (12);
  end;

  function CarriageReturn: Char;
  begin
    CarriageReturn := Chr (13);
  end;

  function ShiftOut: Char;
  begin
    ShiftOut := Chr (14);
  end;

  function ShiftIn: Char;
  begin
    ShiftIn := Chr (15);
  end;

  function DataLinkEscape: Char;
  begin
    DataLinkEscape := Chr (16);
  end;

  function DeviceControlOne: Char;
  begin
    DeviceControlOne := Chr (17);
  end;

  function DeviceControlTwo: Char;
  begin
    DeviceControlTwo := Chr (18);
  end;

  function DeviceControlThree: Char;
  begin
    DeviceControlThree := Chr (19);
  end;

  function DeviceControlFour: Char;
  begin
    DeviceControlFour := Chr (20);
  end;

  function NegativeAcknowledge: Char;
  begin
    NegativeAcknowledge := Chr (21);
  end;

  function SynChronousIdle: Char;
  begin
    SynChronousIdle := Chr (22);
  end;

  function EndOfTransmissionBlock: Char;
  begin
    EndOfTransmissionBlock := Chr (23);
  end;

  function Cancel: Char;
  begin
    Cancel := Chr (24);
  end;

  function EndOfMedium: Char;
  begin
    EndOfMedium := Chr (25);
  end;

  function Substitute: Char;
  begin
    Substitute := Chr (26);
  end;

  function Escape: Char;
  begin
    Escape := Chr (27);
  end;

  function FileSeparator: Char;
  begin
    FileSeparator := Chr (28);
  end;

  function GroupSeparator: Char;
  begin
    GroupSeparator := Chr (29);
  end;

  function RecordSeparator: Char;
  begin
    RecordSeparator := Chr (30);
  end;

  function UnitSeparator: Char;
  begin
    UnitSeparator := Chr (31);
  end;

  function Delete: Char;
  begin
    Delete := Chr (127);
  end;

  { The next five functions test attributes of characters. }

  function Alphabetic (Ch: Char): Boolean;
  begin
    Alphabetic := (('A' <= Ch) and (Ch <= 'Z')) or
                  (('a' <= Ch) and (Ch <= 'z'))
  end;

  function Uppercase (Ch: Char): Boolean;
  begin
    Uppercase := ('A' <= Ch) and (Ch <= 'Z')
  end;

  function Lowercase (Ch: Char): Boolean;
  begin
    Lowercase := ('a' <= Ch) and (Ch <= 'z')
  end;

  function Numeric (Ch: Char): Boolean;
  begin
    Numeric := ('0' <= Ch) and (Ch <= '9')
  end;

  function Control (Ch: Char): Boolean;
  begin
    Control := (Ch < ' ') or (Ch = Delete)
  end;

  { The Upcase and Downcase functions yield upper-case and lower-case
    versions of letters, returning other characters unchanged. }

  function Upcase (Ch: Char): Char;
  const
    CaseSeparation = 32;
      { the distance between upper-case letters and their lower-case
        counterparts in the ASCII character set }
  begin
    if Lowercase (Ch) then
      Upcase := Chr (Ord (Ch) - CaseSeparation)
    else
      Upcase := Ch
  end;

  function Downcase (Ch: Char): Char;
  const
    CaseSeparation = 32;
      { the distance between upper-case letters and their lower-case
        counterparts in the ASCII character set }
  begin
    if Uppercase (Ch) then
      Downcase := Chr (Ord (Ch) + CaseSeparation)
    else
      Downcase := Ch
  end;

  { The Precedes function determines whether one letter precedes another
    alphabetically, ignoring case. }

  function Precedes (LeftOperand, RightOperand: Char): Boolean;
  begin
    { Assert (Alphabetic (LeftOperand) and Alphabetic (RightOperand)); }
    Precedes := Downcase (LeftOperand) < Downcase (RightOperand)
  end;

This document is available on the World Wide Web as

http://www.math.grin.edu/~stone/courses/fundamentals/characters.html

created January 5, 1996
last revised July 24, 1996