text file formats Flashcards
ASCII
American Standard Code for Information Interchange - each character encoded in 1 byte (7-bit encoding) - 65-90 caps, + 32 to caps to get lower case - <=31 control characters - English characters only
new line standards
UNIX / linux = LF
windows = CR + LF
mac = CR (to OS-9 then UNIX based)
ISO-8859 encoding
each family of language has individual has own code page - 0-127 identical to ascii, 128-256 used for characters above English
+ 8 bit
+ simple
- one language family at a time
- only languages with < 128 non-characters
unicode
does not code, just gives each character a code point (unique numerical identity) - 0-255 corresponds with iso-8859-1 (west Europe)
encoding unicode at bite level
ucs-2 - uses 2 bites to represent code point
- not back compatible with ascii
- unicode > 65k code points
UTF - 8 - 1 bite if ASCII, multiple if not
+ back compatible with ASCII
+ all unicode
+ standard
UTF-8
up to 4 bites
0-7 bits - 0XXXXXXX
8-11 bits - 110XXXXX 10XXXXXX
12-16 bits - 1110XXXX 10XXXXXX 10XXXXXX
17-21 bits - 11110XXX 10XXXXXX 10XXXXXX 10XXXXXX
reconising the character encoding
guessing on statistical analysis - not recommended
Byte Order Mark - string at beginning of file indicating encoding type - not recommended
HTTP header - header item specifies encoding type - requires web server to be configured
In HTMl file - meta tag - if no control of web server
what is not encoded
font
font size
special formates
colour