Representing Text Flashcards

1
Q

What is a character set?

A

a list of characters and the codes used to represent each one

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is ASCII? How many characters are used?

A

a character set (american standard code for information interchange)

uses 7 bits for each character and there are 128 unique characters, 33 are control characters and the others were numbers, letters, and punctuation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the difference between ASCII and extended ASCII?

A

extended ascii uses all 8 bits instead of 7, can also represent lines, symbols, and letters with accents, there are several incompatbile versions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are control characters? Which codes are they in ASCII?

A

control how text appears, do not appear as text (space, tab)

0-31 and 127 (first 32 and last one)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the codes for decimal digits? How can you find the code for the number 6?

A

start at 48, 0=48, 48-57

48 + 6 = 54

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Where do the codes for uppercase letters start? What about lowercase? If you know the code for a lowercase letter how can you find upper case and vice versa?

A

start at 65-90

start at 97-122

they are seperated by 32 codes, lower case -32 = uppercase and uppercase +32 = lowercase

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How can you find the code for K? What about k?

A

K is 11th letter, 65 + (11-1)= 75

k is 11th letter, 97 + (11-1)= 107

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Is extended ASCII still limited?

A

yes, missing common symbols, as well as letters in different languages, can’t use multiple languages at the same time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Unicode? What is it compared to ASCII?

A

a character set that is a superset of ASCII

the first 128 characters correspond to the same ones in ASCII, it uses 16+ bits per character and can represent more than 1 million characters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is Unicode organized?

A

divided into blocks or characters, each block has a theme

ex. arabic, hebrew, etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are emoji? where did they come from? Where are they located in unicode?

A

symbol typically appearning in text

japanese word for picture/letter

Miscellaneous Symbols and Pictographs, range 1F300-1F5FF and Emoticons, range 1F600-1F64F

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is data compression?

A

a reduction in the amount of
space needed to store a piece of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is compression ratio?

A

the size of the compressed data divided by the size of the original data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Is a compression ratio of 90% or 80% more?

A

80 b/c it means the original data was compressed by 20% while 90 means it was only compressed by 10%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the two types of data compression?

A

Lossless, when the data can be retrieved without losing any of the original information

Lossy, some information may be lost in the process of compaction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is keyword encoding?

A

words that are frequently used in text are replaced with a character, characters that are used to encode cannot be used in the text

ex. as is represented as ~

17
Q

What is run length encoding?

A

when a single character is repeated a lot of times in a long sequence, doesn’t usually occur in english text but it often occurs in large data streams

18
Q

In run lenght encoding what are sequences of repeating character replaced with?

A

a sequence like this is replaced by a flag character (*), followed by the repeated character, and a digit to signifiy how many times the character is repeated

19
Q

What is huffman encoding?

A

huffman codes use variable-length bit strings to represent each character, some characters are represented by 5 bits, 7 bits etc

20
Q

What is one challenge that huffman encoding faced? What was the solution?

A

hard to know where one symbol ends and the other begins

prefix property: no code is a prefix of another code

21
Q

What is a benefit of huffman encoding in terms of compression?

A

because it only uses a few bits for characters that appear often and reserves longer bit strings for characters that are less common the overall size of the document is smaller

22
Q

Where did the idea for huffman encoding come from?

A

morse code, fewer dots and dashes for common letters

23
Q

What is input/output/property in reference to huffman encoding?

A

input is symbols and their frequency

output is the binary code for each symbol

property is the optimal compression rate with prefix property

24
Q

What are the steps in creating a huffman algorithm?

A
  1. sort the symbols in ascending order of frequency (least to most frequently used) place them in a sorted queue
  2. replace the symbols with the two smallest frequencies with a combined symbol, place in 2nd queue
  3. repeat step two until one remains in the queue
  4. result is binary tree
25
Q

What is root of binary tree?

A

topmost node or spot on the tree

26
Q

What is branch of binary tree?

A

connects root to node or node to node

27
Q

What is node of binary tree?

A

intersection of binary tree

28
Q

What is leaf of binary tree?

A

bottommost node, end of tree, symbol

29
Q

How do you label the branches of binary tree? Can you label it another way?

A

left branches is 0 and right branches are 1

yes as long as one branch is 0 and the other is 1

30
Q

How can you find the binary code for a symbol using a completed binary tree?

A

follow the path from the root of the tree to the leaf and the corresponding binary sequence is the code

31
Q

How can you find the compressed bit length using binary tree?

A

the summation of (character code length x frequency count)

ex. c appears 3 times and the code is 000, (3 x 3), a appears 17 times and the code is 01, (2 x 17)

32
Q

What is the least and most effective lossless encoding on their own?

A

least is keyword encoding

most if huffman encoding