Some scripts are alphabetic; they contain a small number of heavily used characters (called letters) that roughly represent the constituent sounds of utterances rather than their meanings. In alphabetic scripts, glyphs are sometimes grouped into cases -- letter variants that are conventionally used in different printed contexts. For example, capital letters form a case in English; H and h are depictions of the same letter, but the first is in upper case and the second in lower case.
The question of whether H and h are instances of the same character is somewhat arbitrary, and computer programmers have usually found it convenient to treat them as distinct characters.
110001 on an IBM 1401 computer, by
000001 on a Control Data 6600, by 11000001 on an
IBM 360, and so on. This made it troublesome to transfer character data
from one computer to another, since it was necessary to convert each
character from the source machine's encoding to the target machine's
encoding. The difficulty was compounded by the fact that different
manufacturers supported different characters; all provided the twenty-six
capital letters used in writing English and the ten digits used in writing
Arabic numerals, but there was much variation in the selection of
mathematical symbols, punctuation marks, etc.
ASCII also reserves a bit sequence for a ``space'' character, usually depicted by a blank (a glyph that doesn't include any marks) and thirty-three bit sequences for so-called ``control characters,'' which have various implementation-dependent effects on printing and display devices.
Each character or control character is represented by a sequence of exactly
seven bits, and every sequence of seven bit values represents a different
character or control character. It is common to read these bit sequences
as binary numerals -- for instance, to read 1000001, which is
the ASCII sequence representing the capital letter A, as sixty-five, which
is the value of that bit sequence considered as a binary numeral. In
Pascal, this is reflected in the fact that on a machine using ASCII
characters, the value of the expression Ord ('A') is 65.
When an ASCII character is stored in the memory of a computer, it is usually given one byte (eight bits) of storage. Since the character is completely fixed by its seven-bit representation, this leaves a bit free for some other purpose. Often the extra bit is simply turned off and ignored. Alternatively, it can be used as a way of checking that the hardware is storing and recovering characters accurately -- for instance, by turning on the extra bit if, and only if, the number of 1 bits in the ASCII representation of the character to be stored is even. This convention ensures that the byte in which the character is stored will always have an odd number of 1 bits in it; if such a byte is ever found to have an even number of 1 bits, an error has occurred. When the extra bit is used in this way, it is called a ``parity bit'' (because it is adjusted to ensure a specified parity -- evenness or oddness -- of the number of 1 bits in the byte), and the character is said to be stored as ``seven bits, odd parity.'' If the parity bit is set in the opposite way -- turned on if the number of 1 bits in the ASCII representation of the character is odd -- the character is stored as ``seven bits, even parity.'' The phrase ``seven bits, no parity'' means that the extra bit is always turned off and ignored.
Still another possibility is to use all the possible bit sequences that will fit into the eight-bit byte to represent characters, 256 of them in all. Typically, those in which the first bit is 0 are the usual ASCII characters, while those in which the first bit is 1 are extras -- characters from European languages other than English, the symbols for cents, pounds, and yen, subscripts and superscripts, printers' marks, common fractions, additional control characters. Various manufacturers have developed different eight-bit extensions of ASCII, thus restoring the chaos that ASCII was designed to eliminate. Characters stored in this way are ``eight bits, no parity.'' (In some machines, the hardware provides a ninth bit in each byte, which is used only for error detection through parity control, and on such machines it is possible to have ``eight bits, even parity'' or ``eight bits, odd parity'' representations.)
Unicode uses a sequence of sixteen bits for each character, allowing for two to the sixteenth power (that is, 65536) codes altogether. Thus a number of bit sequences are still unassigned and may, in future versions of Unicode, be allocated for some of the numerous scripts that are not yet supported. The designers are actively working on Burmese, Cherokee, Cree, Ethiopic, Khmer, Maldivian, Mongolian, Moso, Pahawh Hmong, Rong, Sinhalese, Tai Lu, Tai Mau, Tibetan, Tifinagh, and Yi.
In a sense, ASCII is contained in Unicode: By adding nine 0 bits at the
beginning of any ASCII bit sequence, one obtains a Unicode bit sequence
that represents the same character. For instance, the Unicode
representation of a capital A is 0000000001000001.
A Unicode character stored in the memory of a computer occupies two adjacent eight-bit bytes. Depending on the manufacturer of the computer, the eight bits that are stored into the byte that has the smaller machine address may be the first eight bits or the last eight bits. Unicode requires that the convention that is adopted be the same one that is used for the storage of sixteen-bit unsigned integers on the same machine.
Both ASCII and Unicode implicitly define an ordering of the characters: One character precedes another in the ordering if the number expressed by its bit sequence, considered as a binary numeral, is less than the number expressed by the other's. As long as only characters from the English alphabet are considered, both ASCII and Unicode match the traditional alphabetical order, but this is not true more generally.
It is convenient to have a name for each of the possible character values; the following names are derived from those applied to ASCII characters in the Unicode standard:
0000000 null
0000001 start-of-heading
0000010 start-of-text
0000011 end-of-text
0000100 end-of-transmission
0000101 enquiry
0000110 acknowledge
0000111 bell
0001000 backspace
0001001 horizontal-tabulation
0001010 line-feed
0001011 vertical-tabulation
0001100 form-feed
0001101 carriage-return
0001110 shift-out
0001111 shift-in
0010000 data-link-escape
0010001 device-control-one
0010010 device-control-two
0010011 device-control-three
0010100 device-control-four
0010101 negative-acknowledge
0010110 synchronous-idle
0010111 end-of-transmission-block
0011000 cancel
0011001 end-of-medium
0011010 substitute
0011011 escape
0011100 file-separator
0011101 group-separator
0011110 record-separator
0011111 unit-separator
0100000 space
0100001 exclamation-mark
0100010 quotation-mark
0100011 number-sign
0100100 dollar-sign
0100101 percent-sign
0100110 ampersand
0100111 apostrophe
0101000 left-parenthesis
0101001 right-parenthesis
0101010 asterisk
0101011 plus-sign
0101100 comma
0101101 hyphen-minus
0101110 full-stop
0101111 solidus
0110000 digit-zero
0110001 digit-one
0110010 digit-two
0110011 digit-three
0110100 digit-four
0110101 digit-five
0110110 digit-six
0110111 digit-seven
0111000 digit-eight
0111001 digit-nine
0111010 colon
0111011 semicolon
0111100 less-than-sign
0111101 equals-sign
0111110 greater-than-sign
0111111 question-mark
1000000 commerical-at
1000001 capital-letter-a
1000010 capital-letter-b
1000011 capital-letter-c
1000100 capital-letter-d
1000101 capital-letter-e
1000110 capital-letter-f
1000111 capital-letter-g
1001000 capital-letter-h
1001001 capital-letter-i
1001010 capital-letter-j
1001011 capital-letter-k
1001100 capital-letter-l
1001101 capital-letter-m
1001110 capital-letter-n
1001111 capital-letter-o
1010000 capital-letter-p
1010001 capital-letter-q
1010010 capital-letter-r
1010011 capital-letter-s
1010100 capital-letter-t
1010101 capital-letter-u
1010110 capital-letter-v
1010111 capital-letter-w
1011000 capital-letter-x
1011001 capital-letter-y
1011010 capital-letter-z
1011011 left-square-bracket
1011100 reverse-solidus
1011101 right-square-bracket
1011110 circumflex-accent
1011111 low-line
1100000 grave-accent
1100001 small-letter-a
1100010 small-letter-b
1100011 small-letter-c
1100100 small-letter-d
1100101 small-letter-e
1100110 small-letter-f
1100111 small-letter-g
1101000 small-letter-h
1101001 small-letter-i
1101010 small-letter-j
1101011 small-letter-k
1101100 small-letter-l
1101101 small-letter-m
1101110 small-letter-n
1101111 small-letter-o
1110000 small-letter-p
1110001 small-letter-q
1110010 small-letter-r
1110011 small-letter-s
1110100 small-letter-t
1110101 small-letter-u
1110110 small-letter-v
1110111 small-letter-w
1111000 small-letter-x
1111001 small-letter-y
1111010 small-letter-z
1111011 left-curly-bracket
1111100 vertical-line
1111101 right-curly-bracket
1111110 tilde
1111111 delete
Here are the operations that I propose for characters as an abstract data type:
size-of-character-set
Inputs: none.
Output: result, a natural number.
Preconditions: none.
Postcondition: result is 128, the number of characters in the
seven-bit ASCII character set.
equal
Inputs: left-operand and right-operand, both
characters.
Output: result, a Boolean.
Preconditions: none.
Postcondition: result is true if the two inputs are the
same character (specifically, if their character codes are identical),
false if they are different characters.
alphabetic
Input: ch, a character.
Output: result, a Boolean.
Preconditions: none.
Postcondition: result is true if ch is a
letter, false if it is not.
uppercase
Input: ch, a character.
Output: result, a Boolean.
Preconditions: none.
Postcondition: result is true if ch is an
upper-case letter, false if it is not.
lowercase
Input: ch, a character.
Output: result, a Boolean.
Preconditions: none.
Postcondition: result is true if ch is a
lower-case letter, false if it is not.
numeric
Input: ch, a character.
Output: result, a Boolean.
Preconditions: none.
Postcondition: result is true if ch is a
decimal digit, false if it is not.
control
Input: ch, a character.
Output: result, a Boolean.
Preconditions: none.
Postcondition: result is true if ch is a
control character, false if it is not.
precedes
Inputs: left-operand and right-operand, both
characters.
Output: result, a Boolean.
Precondition: Both inputs are alphabetic.
Postcondition: result is true if the first input comes
before the second in the alphabet, false if it is comes after or if
the two letters are the same. Case is disregarded (so that, for example,
d precedes F).
upcase
Input: ch, a character.
Output: result, a character.
Preconditions: none.
Postcondition: If ch is a lower-case letter,
result is the upper-case version of the same letter;
otherwise, result is ch.
downcase
Input: ch, a character.
Output: result, a character.
Preconditions: none.
Postcondition: If ch is an upper-case letter,
result is the lower-case version of the same letter;
otherwise, result is ch.
encode
Input: ch, a character.
Output: code, a natural number.
Preconditions: none.
Postcondition: When the bit sequence that represents ch is
interpreted as a binary numeral, its value is code.
decode
Input: code, a natural number.
Output: ch, a character.
Preconditions: code is less than the value returned by
size-of-character-set.
Postcondition: When the bit sequence that represents ch is
interpreted as a binary numeral, its value is code.
read
Input: source, a data source (e.g., a file, the keyboard, a
device).
Output: legend, a character.
Preconditions: source can supply another character on
demand.
Postcondition: A character has been extracted from source, and
legend is that character.
peek
Input: source, a data source (e.g., a file, the keyboard, a
device).
Output: legend, a character.
Preconditions: source can supply another character on
demand.
Postcondition: legend is the character that
source will supply next on demand.
write
Inputs: target, a data sink (e.g., a file, a window, a
device), and scribend, a character.
Outputs: None.
Preconditions: none.
Postcondition: The character scribend has been appended to
target.
Char data type. It presupposes that the ten digits used in
Arabic numerals are characters that occupy adjacent positions in the
ordering of the characters, starting with 0 and continuing in ascending
order to 9. It presupposes that the apostrophe is a character. It states
that if there are representations of the capital letters A through Z, they
must be arranged in the traditional alphabetical order but need not be
adjacent; this accommodates yet another code, the Extended Binary Coded
Decimal Interchange Code (EBCDIC), that is still used on some IBM mainframe
computers and features ``gaps'' between the representations of I and J and
between R and S. Pascal imposes the same requirement on the
representations of the lower-case letters a through z, if they are present,
but does not posit any particular relationship between the representations
of upper- and lower-case letters or between letters and digits. Both ASCII
and Unicode are straightforwardly consistent with the Pascal standard.
Pascal provides a total of twelve built-in operations involving
Char values: the six comparison operations (=,
<, >, <=,
>=, <>), the Pred and
Succ functions, input and output procedures, and the transfer
functions Ord and Chr that correlate characters
with the integer values of their bit-sequence representations. In general,
these built-in operations do their work by performing arithmetic on those
integer values. This implies that character comparisons are
case-sensitive: If an implementation of Pascal supports both
upper- and lower-case letters, they will be counted as distinct.
Standard Pascal provides names, in the form of literal constants, for all
of the graphic characters, but not for any of the control characters. It
does not provide the size-of-character-set operation in any form,
nor any of the predicates alphabetic, uppercase,
lowercase, numeric, or control, nor the
case-insensitive comparison operation precedes, nor the
case-conversion operations upcase and downcase. (The
peek operation is provided, though, in the form of the
^ operation on files; Source^ is the character
that would result from the application of peek to
Source.)
{ This file contains a collection of character functions to complete the
built-in char type in Pascal and to bring it into conformity with an
abstract data type specification for characters developed in the handout
http://www.math.grin.edu/~stone/courses/fundamentals/characters.html
Programmer: John Stone, Grinnell College.
Date of this version: July 24, 1996. }
const
SizeOfCharacterSet = 128;
{ the number of characters in the seven-bit ASCII character set }
{ The first group of functions simply construct and return characters for
which standard Pascal provides no literals. It would be better if one
could define these as constants, but there is no way to do this in
standard Pascal (and, in practice, no way to do it portably in
non-standard Pascal). }
function Null: Char;
begin
Null := Chr (0);
end;
function StartOfHeading: Char;
begin
StartOfHeading := Chr (1);
end;
function StartOfText: Char;
begin
StartOfText := Chr (2);
end;
function EndOfText: Char;
begin
EndOfText := Chr (3);
end;
function EndOfTransmission: Char;
begin
EndOfTransmission := Chr (4);
end;
function Enquiry: Char;
begin
Enquiry := Chr (5);
end;
function Acknowledge: Char;
begin
Acknowledge := Chr (6);
end;
function Bell: Char;
begin
Bell := Chr (7);
end;
function Backspace: Char;
begin
Backspace := Chr (8);
end;
function HorizontalTabulation: Char;
begin
HorizontalTabulation := Chr (9);
end;
function LineFeed: Char;
begin
LineFeed := Chr (10);
end;
function VerticalTabulation: Char;
begin
VerticalTabulation := Chr (11);
end;
function FormFeed: Char;
begin
FormFeed := Chr (12);
end;
function CarriageReturn: Char;
begin
CarriageReturn := Chr (13);
end;
function ShiftOut: Char;
begin
ShiftOut := Chr (14);
end;
function ShiftIn: Char;
begin
ShiftIn := Chr (15);
end;
function DataLinkEscape: Char;
begin
DataLinkEscape := Chr (16);
end;
function DeviceControlOne: Char;
begin
DeviceControlOne := Chr (17);
end;
function DeviceControlTwo: Char;
begin
DeviceControlTwo := Chr (18);
end;
function DeviceControlThree: Char;
begin
DeviceControlThree := Chr (19);
end;
function DeviceControlFour: Char;
begin
DeviceControlFour := Chr (20);
end;
function NegativeAcknowledge: Char;
begin
NegativeAcknowledge := Chr (21);
end;
function SynChronousIdle: Char;
begin
SynChronousIdle := Chr (22);
end;
function EndOfTransmissionBlock: Char;
begin
EndOfTransmissionBlock := Chr (23);
end;
function Cancel: Char;
begin
Cancel := Chr (24);
end;
function EndOfMedium: Char;
begin
EndOfMedium := Chr (25);
end;
function Substitute: Char;
begin
Substitute := Chr (26);
end;
function Escape: Char;
begin
Escape := Chr (27);
end;
function FileSeparator: Char;
begin
FileSeparator := Chr (28);
end;
function GroupSeparator: Char;
begin
GroupSeparator := Chr (29);
end;
function RecordSeparator: Char;
begin
RecordSeparator := Chr (30);
end;
function UnitSeparator: Char;
begin
UnitSeparator := Chr (31);
end;
function Delete: Char;
begin
Delete := Chr (127);
end;
{ The next five functions test attributes of characters. }
function Alphabetic (Ch: Char): Boolean;
begin
Alphabetic := (('A' <= Ch) and (Ch <= 'Z')) or
(('a' <= Ch) and (Ch <= 'z'))
end;
function Uppercase (Ch: Char): Boolean;
begin
Uppercase := ('A' <= Ch) and (Ch <= 'Z')
end;
function Lowercase (Ch: Char): Boolean;
begin
Lowercase := ('a' <= Ch) and (Ch <= 'z')
end;
function Numeric (Ch: Char): Boolean;
begin
Numeric := ('0' <= Ch) and (Ch <= '9')
end;
function Control (Ch: Char): Boolean;
begin
Control := (Ch < ' ') or (Ch = Delete)
end;
{ The Upcase and Downcase functions yield upper-case and lower-case
versions of letters, returning other characters unchanged. }
function Upcase (Ch: Char): Char;
const
CaseSeparation = 32;
{ the distance between upper-case letters and their lower-case
counterparts in the ASCII character set }
begin
if Lowercase (Ch) then
Upcase := Chr (Ord (Ch) - CaseSeparation)
else
Upcase := Ch
end;
function Downcase (Ch: Char): Char;
const
CaseSeparation = 32;
{ the distance between upper-case letters and their lower-case
counterparts in the ASCII character set }
begin
if Uppercase (Ch) then
Downcase := Chr (Ord (Ch) + CaseSeparation)
else
Downcase := Ch
end;
{ The Precedes function determines whether one letter precedes another
alphabetically, ignoring case. }
function Precedes (LeftOperand, RightOperand: Char): Boolean;
begin
{ Assert (Alphabetic (LeftOperand) and Alphabetic (RightOperand)); }
Precedes := Downcase (LeftOperand) < Downcase (RightOperand)
end;
This document is available on the World Wide Web as
http://www.math.grin.edu/~stone/courses/fundamentals/characters.html