Representing Text Flashcards
What is a character set?
a list of characters and the codes used to represent each one
What is ASCII? How many characters are used?
a character set (american standard code for information interchange)
uses 7 bits for each character and there are 128 unique characters, 33 are control characters and the others were numbers, letters, and punctuation
What is the difference between ASCII and extended ASCII?
extended ascii uses all 8 bits instead of 7, can also represent lines, symbols, and letters with accents, there are several incompatbile versions
What are control characters? Which codes are they in ASCII?
control how text appears, do not appear as text (space, tab)
0-31 and 127 (first 32 and last one)
What are the codes for decimal digits? How can you find the code for the number 6?
start at 48, 0=48, 48-57
48 + 6 = 54
Where do the codes for uppercase letters start? What about lowercase? If you know the code for a lowercase letter how can you find upper case and vice versa?
start at 65-90
start at 97-122
they are seperated by 32 codes, lower case -32 = uppercase and uppercase +32 = lowercase
How can you find the code for K? What about k?
K is 11th letter, 65 + (11-1)= 75
k is 11th letter, 97 + (11-1)= 107
Is extended ASCII still limited?
yes, missing common symbols, as well as letters in different languages, can’t use multiple languages at the same time
What is Unicode? What is it compared to ASCII?
a character set that is a superset of ASCII
the first 128 characters correspond to the same ones in ASCII, it uses 16+ bits per character and can represent more than 1 million characters
How is Unicode organized?
divided into blocks or characters, each block has a theme
ex. arabic, hebrew, etc
What are emoji? where did they come from? Where are they located in unicode?
symbol typically appearning in text
japanese word for picture/letter
Miscellaneous Symbols and Pictographs, range 1F300-1F5FF and Emoticons, range 1F600-1F64F
What is data compression?
a reduction in the amount of
space needed to store a piece of data
What is compression ratio?
the size of the compressed data divided by the size of the original data
Is a compression ratio of 90% or 80% more?
80 b/c it means the original data was compressed by 20% while 90 means it was only compressed by 10%
What are the two types of data compression?
Lossless, when the data can be retrieved without losing any of the original information
Lossy, some information may be lost in the process of compaction
What is keyword encoding?
words that are frequently used in text are replaced with a character, characters that are used to encode cannot be used in the text
ex. as is represented as ~
What is run length encoding?
when a single character is repeated a lot of times in a long sequence, doesn’t usually occur in english text but it often occurs in large data streams
In run lenght encoding what are sequences of repeating character replaced with?
a sequence like this is replaced by a flag character (*), followed by the repeated character, and a digit to signifiy how many times the character is repeated
What is huffman encoding?
huffman codes use variable-length bit strings to represent each character, some characters are represented by 5 bits, 7 bits etc
What is one challenge that huffman encoding faced? What was the solution?
hard to know where one symbol ends and the other begins
prefix property: no code is a prefix of another code
What is a benefit of huffman encoding in terms of compression?
because it only uses a few bits for characters that appear often and reserves longer bit strings for characters that are less common the overall size of the document is smaller
Where did the idea for huffman encoding come from?
morse code, fewer dots and dashes for common letters
What is input/output/property in reference to huffman encoding?
input is symbols and their frequency
output is the binary code for each symbol
property is the optimal compression rate with prefix property
What are the steps in creating a huffman algorithm?
- sort the symbols in ascending order of frequency (least to most frequently used) place them in a sorted queue
- replace the symbols with the two smallest frequencies with a combined symbol, place in 2nd queue
- repeat step two until one remains in the queue
- result is binary tree
What is root of binary tree?
topmost node or spot on the tree
What is branch of binary tree?
connects root to node or node to node
What is node of binary tree?
intersection of binary tree
What is leaf of binary tree?
bottommost node, end of tree, symbol
How do you label the branches of binary tree? Can you label it another way?
left branches is 0 and right branches are 1
yes as long as one branch is 0 and the other is 1
How can you find the binary code for a symbol using a completed binary tree?
follow the path from the root of the tree to the leaf and the corresponding binary sequence is the code
How can you find the compressed bit length using binary tree?
the summation of (character code length x frequency count)
ex. c appears 3 times and the code is 000, (3 x 3), a appears 17 times and the code is 01, (2 x 17)
What is the least and most effective lossless encoding on their own?
least is keyword encoding
most if huffman encoding