Representing Text Flashcards

1
Q

What is a character set?

A

a list of characters and the codes used to represent each one

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is ASCII? How many characters are used?

A

a character set (american standard code for information interchange)

uses 7 bits for each character and there are 128 unique characters, 33 are control characters and the others were numbers, letters, and punctuation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the difference between ASCII and extended ASCII?

A

extended ascii uses all 8 bits instead of 7, can also represent lines, symbols, and letters with accents, there are several incompatbile versions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are control characters? Which codes are they in ASCII?

A

control how text appears, do not appear as text (space, tab)

0-31 and 127 (first 32 and last one)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the codes for decimal digits? How can you find the code for the number 6?

A

start at 48, 0=48, 48-57

48 + 6 = 54

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Where do the codes for uppercase letters start? What about lowercase? If you know the code for a lowercase letter how can you find upper case and vice versa?

A

start at 65-90

start at 97-122

they are seperated by 32 codes, lower case -32 = uppercase and uppercase +32 = lowercase

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How can you find the code for K? What about k?

A

K is 11th letter, 65 + (11-1)= 75

k is 11th letter, 97 + (11-1)= 107

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Is extended ASCII still limited?

A

yes, missing common symbols, as well as letters in different languages, can’t use multiple languages at the same time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Unicode? What is it compared to ASCII?

A

a character set that is a superset of ASCII

the first 128 characters correspond to the same ones in ASCII, it uses 16+ bits per character and can represent more than 1 million characters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is Unicode organized?

A

divided into blocks or characters, each block has a theme

ex. arabic, hebrew, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are emoji? where did they come from? Where are they located in unicode?

A

symbol typically appearning in text

japanese word for picture/letter

Miscellaneous Symbols and Pictographs, range 1F300-1F5FF and Emoticons, range 1F600-1F64F

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is data compression?

A

a reduction in the amount of
space needed to store a piece of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is compression ratio?

A

the size of the compressed data divided by the size of the original data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Is a compression ratio of 90% or 80% more?

A

80 b/c it means the original data was compressed by 20% while 90 means it was only compressed by 10%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the two types of data compression?

A

Lossless, when the data can be retrieved without losing any of the original information

Lossy, some information may be lost in the process of compaction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is keyword encoding?

A

words that are frequently used in text are replaced with a character, characters that are used to encode cannot be used in the text

ex. as is represented as ~

17
Q

What is run length encoding?

A

when a single character is repeated a lot of times in a long sequence, doesn’t usually occur in english text but it often occurs in large data streams

18
Q

In run lenght encoding what are sequences of repeating character replaced with?

A

a sequence like this is replaced by a flag character (*), followed by the repeated character, and a digit to signifiy how many times the character is repeated

19
Q

What is huffman encoding?

A

huffman codes use variable-length bit strings to represent each character, some characters are represented by 5 bits, 7 bits etc

20
Q

What is one challenge that huffman encoding faced? What was the solution?

A

hard to know where one symbol ends and the other begins

prefix property: no code is a prefix of another code

21
Q

What is a benefit of huffman encoding in terms of compression?

A

because it only uses a few bits for characters that appear often and reserves longer bit strings for characters that are less common the overall size of the document is smaller

22
Q

Where did the idea for huffman encoding come from?

A

morse code, fewer dots and dashes for common letters

23
Q

What is input/output/property in reference to huffman encoding?

A

input is symbols and their frequency

output is the binary code for each symbol

property is the optimal compression rate with prefix property

24
Q

What are the steps in creating a huffman algorithm?

A
  1. sort the symbols in ascending order of frequency (least to most frequently used) place them in a sorted queue
  2. replace the symbols with the two smallest frequencies with a combined symbol, place in 2nd queue
  3. repeat step two until one remains in the queue
  4. result is binary tree
25
What is root of binary tree?
topmost node or spot on the tree
26
What is branch of binary tree?
connects root to node or node to node
27
What is node of binary tree?
intersection of binary tree
28
What is leaf of binary tree?
bottommost node, end of tree, symbol
29
How do you label the branches of binary tree? Can you label it another way?
left branches is 0 and right branches are 1 yes as long as one branch is 0 and the other is 1
30
How can you find the binary code for a symbol using a completed binary tree?
follow the path from the root of the tree to the leaf and the corresponding binary sequence is the code
31
How can you find the compressed bit length using binary tree?
the summation of (character code length x frequency count) ex. c appears 3 times and the code is 000, (3 x 3), a appears 17 times and the code is 01, (2 x 17)
32
What is the least and most effective lossless encoding on their own?
least is keyword encoding most if huffman encoding