W1 Flashcards

1
Q

What is data?

A

Data refers to raw facts, information, or observations that are collected, stored, and processed for various purposes. It can take various forms, including numbers, text, images, or any other representations of information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is datafication?

A

When aspects of our life are turned into digital data (typically automatically).

  • Online behaviour:
  • Interactions with other people being “datafied” (e.g. “likes” on Instagram).
  • Browsing history (through cookies) and searches being “datafied”
  • Offline behaviour:
  • Being “datafied” when visiting places via sensors, cameras, etc.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Big Data?

A

Refers not only to the quantity but with the following characteristics:

  • Volume: Large amount of data, from terabytes to petabytes
  • Velocity: High speed of data generation
  • Value: Valuable information buried in data sea
  • Variety: Lack of homogeneity in data types, formats and quality

Some people may also include other Vs like:
* Veracity: Can we trust the data?
* Variability: Changing formats, structure, or sources of big data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the power of data?

A
  • Know better about the (potential) customers/users.
  • Examples: Walmart: found out the pre-hurricane top-selling item was beer by mining trillions of bytes of sales history. Netflix:
    show recommendations
  • Deep learning
  • Image recognition algorithms like ResNet often use millions of images to train
  • Generate human-like text. GPT-3 (released by OpenAI in 2020) uses billions of tokens to train
  • Data → Value (→ Profits)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a data science project life cycle?

A
  • Ask Questions
  • Data Acquisition
  • Data Preparation
  • Data Exploration (and *Ask Questions AGAIN)
  • Analyse/model data
  • Evaluation
  • Action
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the advantages and disadvantages of ‘Ready to use datasets’?

A

Advantage
* Minimal effort to process the data, can focus on modelling/analysis techniques

Disadvantage
* Real-world data is seldom available in such nice, clean, ready-to-use way

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are some features of data in real life?

A
  1. Requires effort to collect
    - Some data is available somewhere on the Internet but not stored as a table.
    - How can we gather data automatically?
    - EXAMPLE: We can get more up-to-date car data from some car-selling websites, but information about each car is separate.
  2. Non-Tabular
    - A lot of the data available is not in the form of a table.
    - How can we extract the required information and organise it in table form for analysis?
    - EXAMPLE: The data from the car-selling web page is in HTML.
  3. Issues with missing data, incorrect values, and duplicate data
    - How do we detect and deal with these problems?
    - EXAMPLE. May include NaN
  4. Data scattered in different locations
    - It is common that data required for an analysis is available from different data sources.
    - How can we get the data from the database and combine it with different tables?
  5. Different types of data
    - Can use graph data to visualise i.e. how American politicians connected on Twitter.
    - How can we work with and visualise this type of data?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why is data visualisation useful?

A
  • Visualisation can help to discover patterns in the data that statistics may miss
  • Visualisation helps to raise questions that stimulate research and further analysis
  • Visualisation helps to answer questions and effectively conveys the message of the analysis/result
  • EXAMPLE. The graph on the percentage of new DC characters by gender can (partially) answer the question: female characters are not introduced at a rate approaching gender parity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the 2 main types of data?

A
  1. Quantitative
  2. Qualitative
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Quantitative Data?

A

Quantitative data (or numerical data) refers to numerical information or data that can be expressed as numbers and can be measured.

Can perform meaningful computation like sum, average, difference, etc. E.g. Average Bitcoin price per week

There are two types of quantitative data:
1. Discrete
2. Continuous

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the 2 types of Quantitative Data?

A

Two types of quantitative data:

  1. Discrete
    - Can only take distinct values and cannot be subdivided infinitely
    - E.g. count data (The number of students in a class, the number of likes, etc).
  2. Continuous
    - Can take on any value within a range
    - E.g. Height, temperature, Bitcoin prices
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Qualitative Data?

A

Qualitative data (or categorical data) refers to non-numerical information that describes qualities, characteristics, or attributes.

  • Measures of “categories”
  • E.g. Genders, BSc programmes, Python levels
  • While qualitative data is non-numerical in nature, they can be “mapped” or “coded” as numbers
  • E.g. LSE student ID
  • Numerical calculation may not make sense
  • E.g. Averaging the student IDs for students taking this course
  • Two types of Qualitative Data:
    1. Ordinal
    2. Cardinal
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the 2 types of Qualitative Data?

A
  1. Ordinal: have meaningful order or rank
    * EXAMPLE:
    - Survey response options: strongly disagree, disagree, neither agree nor disagree, agree and strongly agree
    - Python level: None, Basic, Intermediate, Advanced (order by proficiency in Python)
    * Note the exact differences between the values may not be well-defined
  2. Nominal: no natural order
    * No meaningful way to compare the categories in terms of magnitude or order
    - Each category is considered equal to the others
    * e.g. Genders, BSc programmes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is sequential data?

A

Sequential data is data arranged in sequences where order matters.

  • Each data point is associated with a specific time or position in a sequence
  • EXAMPLE:
  • Text data
  • Gene sequence (ACGT)
  • Daily temperature readings
  • The closing price of Bitcoin in December 2023
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Time Series Data?

A

A time series is a sequence of data points indexed in time order.

  • A type of sequential data
  • EXAMPLE:
  • Daily temperature readings
  • The closing price of Bitcoin in December 2023
  • Number of covid cases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is it important to understand the data types?

A

Understanding the data type helps to determine the appropriate analysis methods.

  • Different types of data require different:
  • Data cleaning and preprocessing
  • Descriptive statistics
  • Visualisation techniques
  • Statistical models
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the 3 categories of data based on how the data is processed, organised, and stored?

A

Data can be categorised into structured data, semi-structured data and unstructured data
based on how the data is processed, organised and stored.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is structured data?

A

Structured has a pre-defined data model and is organised in a pre-defined way

  • Often stored in tabular formats e.g. auto dataset

*EXAMPLE:
start - end year name
2017 2023 Minouche Shafik
2023 2024 Eric Neumayer
2024 NA Larry Kramer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is unstructured data

A

Unstructured is information that either does not have a pre-defined data model or useful/consistent structure to help process the data

*Often in a raw, natural form and can include text, images, audio, and video

*Accounts for the majority of the data available in the world

  • EXAMPLE:From Wikipedia pages:
  • Larry D. Kramer (born June 23, 1958) is an American legal scholar serving as the president and vice chancellor of the London School of Economics since April 2024. Previously, Kramer served as president of the William and Flora Hewlett Foundation from 2012 through 2023. Prior to that role, he was the Dean of Stanford Law School (2004–2012). He is a scholar of both constitutional law and civil procedure.
  • Nemat Talaat Shafik, Baroness Shafik (born 13 August 1962), commonly known as Minouche Shafik, is a British-American academic and economist. She served as the president and vice chancellor of the London School of Economics from 2017 to 2023, and then as the 20th president of Columbia University from July 2023 to August 2024.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is semi-structured data?

A

Semi-structured data may not have a rigid pre-defined structure but have some level of organisation

  • Information about the organisation of the data is often within the data in the form of tags or a hierarchical structure
  • “Self-describing”

*EXAMPLE: (data in JSON format):
{“presidents”: [

{
“name”: “Larry Kramer”,
“start year”: 2024,
“universities”: [“Brown University”, “University of Chicago Law School”]
},
{
“name”: “Minouche Shafik”,
“end year”: 2023,
“start year”: 2017
}
]}

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How is data represented and stored?

A

BITS

In a computer, data is represented using the binary numeral system - texts, numbers, bits images, audio, etc are stored by a sequence of BITS.

  • Bit is a basic unit of information in a computer
  • It can only take two possible values, which can be considered as on / off, true / false, or 0 / 1
  • 2^n Can represent patterns with n bits
  • For n = 2, there are 4 distinct combinations (00, 01, 10, 11) to represent 4 patterns
  • For n = 2 , there are 2^3 = 8 distinct combinations (000, 001, 010, 011, 100, 101, 110, 111) to represent 8 patterns
  • For n=8, there are 2^8 = 256 distinct combinations to represent 256 patterns
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is a Byte?

A

Byte is a common unit of digital information.

  • Eight bits = one byte
  • How many distinct patterns can 1 byte represent?
  • 28 = 256
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What are some of the multi-byte units?

A
  • Kilobyte (kB) = 1000 bytes
  • Megabyte (MB) = 1000^2 bytes
  • Gigabyte (GB) = 1000^3 bytes
  • Terabyte (TB) = 1000^4 bytes
  • Petabyte (PB) = 1000^5 bytes
  • Exabyte (EB) = 1000^6 bytes
  • Zettabyte (ZB) = 1000^7 bytes
  • Yottabyte (YB) = 1000^8 bytes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

How are unsigned integers represented?

A

Unsigned integers (i.e. non-negative integers) are commonly stored in 4 bytes (i.e. 32 bits).

*With finite n bits, we can only represent a finite range of integers
- Unsigned integer: [0, 2^n − 1]
- Integer: [−2^(n−1), 2^(n−1) − 1]

  • This allows us to represent non-negative integers from value 4294967295 (2^32 − 1)

*Attempting to represent an integer that is outside of the range can cause unexpected consequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
How can you represent the integer 13 in bytes?
1101^2 = 1 × 2^3 + 1 × 2^2 + 0 × 2^1 + 1 × 2^0 = 13 *Therefore w can represent the integer 13 by 00000000 00000000 00000000 00001101
26
How can you represent a negative integer in bytes?
If the leftmost bit is 0, it represents a non-negative integer; When the leftmost bit is 1, it represents a negative integer. * We can represent integers from to (i.e. from−2^(n−1) through 2^(n−1) − 1 )
27
What happens if we need to represent an integer larger than 32 bits?
The number is greater than the maximum integer that 32 bits can represent (2152382740 > 2147483647 - The binary representation of 2152382740 is 10000000 01001010 11000001 00010100, which represents a negative integer
28
How are real numbers represented in a computer?
In a computer, real numbers are represented by floating-point numbers * Notice that: - The number of real numbers is infinite (i.e. imagine the real numbers between a small range like [0,0.01] - The number of patterns that can be represented by n-bits is finite (i.e. with 64 bits, we can represent 2^64 = 18446744073709551616 distinct numbers - It is not possible to represent all real numbers by finite bits
29
What is a floating-point number?
A floating-point number is represented approximately with a fixed number of significant digits (the significand) and scaled using an exponent in some fixed base in a form: significand× base^(exponent) * EXAMPLE: 0.0125 = 1.25 × 10^(-2) 1250 = 1.25 × 10^3 0.0125 = 125 × 10(-4) * The term floating point refers to the fact that a number's "point" can "float" - The point can "move" by changing the exponent. * In base 10 with finite digits, we are not able to represent some numbers exactly * By increasing the number of digits we can have a better approximation * Floating point numbers in a computer are represented using finite bits. Therefore, some numbers are not represented exactly
30
What is the difference between floating-point numbers with base=2, and base=10?
With base = 2: 0.001(base 2) = 1 × 2^(−3) = 0.125(base 10) * In base 2 with finite digits, we are not able to represent some numbers exactly * Some numbers that can be represented exactly in base 10 cannot be expressed precisely in base 2 with finite digits
31
Why is it important to know that floats are only approximations?
* Real numbers are approximated in a computer * We need to be cautious when we compare floats * EXAMPLE: 3.14 + 3 Out: 6.140000000000001 3.14 + 3 == 6.14 Out: False
32
How can text be represented in a computer?
Text can be represented by a sequence of characters. * A character is a unit of information that roughly corresponds to a symbol, such as: - Letter - Digit - Punctuation mark (such as '.', '-') - space ' '
33
What are control characters?
Control characters do not correspond to visible symbols but rather to instructions to format or process the text Example: Newline character '\n' which moves the print head down one line: print('Hello\nworld') OUT: Hello world
34
What is the escape character?
Escape Character '\\' is used to distinguish the newline character from the character 'n' EXAMPLE: print('n\nn') # n newline n OUT: n n
35
What is a whitespace?
Whitespace is any character or series of characters that represent horizontal or vertical space. EXAMPLE: - space: ' ' - tab: '\t' - newline: '\n' print('Hello\tworld') print('st\t115') OUT: Hello world st 115
36
What is character encoding and an example?
Characters are represented using a character encoding that assigns each character to something (e.g. integer). * Common examples of character encoding systems include: - ASCII (American Standard Code for Information Interchange) - UTF-8 or UTF-16 encoding for Unicode
37
What is ASCII?
ASCII (American Standard Code for Information Interchange) * ASCII defines 128 characters, which map to the integers 0–127 (i.e. requires only 7 bits): 52 English alphabets (uppercase and lowercase) e.g 'A' maps to 65, 'a' maps to 97 * Many more letters and symbols are desirable or required to directly represent other characters, such as: - Letters of alphabets other than English (e.g. é) - More kinds of punctuation and symbols (e.g. ¡ and £) - More mathematical operators (e.g. ÷) * As characters are often stored in 8 bits, and 8 bits can represent 256 distinct characters, additional 128-character definitions can be added!
38
What is Unicode?
The Unicode (or formally Unicode Standard) provides a unique mapping for every character, no matter what platform, device, application or language. * Every character: - It currently defines more than 150000 characters covering ~160 modern and historic scripts, as well as symbols, emojis, and non-visual control and formatting codes - It can at most represent around a million characters * Unique mapping: each character has their unique Unicode code point * It is a superset of ASCII, and the numbers 0–127 have the same meaning in ASCII as they have in Unicode (i.e. the number 65 is mapped to 'A') *Unicode characters do not generally fit into 8 bits * The Unicode standard defines Unicode Transformation Formats (UTF): - UTF-8 - UTF-16 - UTF-32
39
What is UTF-8?
* The name is derived from Unicode Transformation Format – 8-bit * UTF-8 encodes all Unicode characters using one to four 8 bit (i.e. 1 byte) code units - Variable-width character encoding! *EXAMPLE: - ASCII characters require 1 byte Greek letters require 2 bytes (e.g. 𝛼) - Most Chinese, Japanese and Korean characters require 3 bytes (e.g. 食) - Emojis require four bytes (e.g. 😀)
40
What is UTF-16?
* Variable-width character encoding * Encode all Unicode characters using one to two 16-bit (i.e. 2 bytes) code units
41
What is UTF-32?
* Fixed-length character encoding *Encode all Unicode characters using 32 bits
42
What can we use to code and decode characters and why is it useful?
'😀'.encode('utf-8') smile_utf_8.decode('utf-8') * Characters are represented differently internally when we use different encoding methods: EXAMPLE (UTF-8) smile_utf_8 = '😀'.encode('utf-8') Out: b'\xf0\x9f\x98\x80' EXAMPLE (UTF-16) smile_utf_16 = '😀'.encode('utf-16') Out: b'\xff\xfe=\xd8\x00\xde' * When we decode, we need to know how the data was encoded. EXAMPLE (Correct UTF-8) smile_utf_8.decode('utf-8') Out: '😀' EXAMPLE (WRONG UTF-16) smile_utf_8.decode('utf-16') Out: '鿰肘'
43
How can you make headings in Markdown in Jupyter notebook?
# H1 Headings: # H1 ## H2 ### H3
44
How can you make text BOLD in Markdown in Jupyter notebook?
Bold: **bold**
45
How can you make text Italic in Markdown in Jupyter notebook?
Italic: *italic*
46
How do you make blockquote in Markdown in Jupyter notebook?
> blockquote NOTE: You can also nest blockquote via >> EXAMPLE: > Dorothy followed her through many of the beautiful rooms in her castle. > >> The Witch bade her clean the pots and kettles and sweep the floor and keep the fire fed with wood.
47
How do you add a link in Markdown in Jupyter notebook?
Link: [title](https://www.example.com)
48
How do you add an image in Markdown in Jupyter notebook?
Image: ![alt text](image.jpg)
49
How do you build a table using markdown in Jupyter notebook?
You can create a table using the pipe (|) and hyphen (-) syntax | Header 1 | Header 2 | Header 3 | |------------|:----------:|-----------:| | Row 1, Col 1 | Row 1, Col 2 | Row 1, Col 3 | | Row 2, Col 1 | Row 2, Col 2 | Row 2, Col 3 | NOTE: * Header Row: The first row defines the headers of your table. * Separator Row: The second row uses hyphens (with optional colons for alignment) to separate the header from the rest of the table. |------------| creates a left-aligned column. |:----------:| creates a centered column. |-----------:| creates a right-aligned column. *Data Rows: Each subsequent row adds data cells, separated by pipes.
50
How can you use mathematic notation to convert a binary number into the decimal system using Markdown in Jupyter notebook?
1101(2) = 1 x 2^3 + 1 x 2^2 + 0 x 2^1 + 1 x 2^0 = 13(10) MARKDOWN CODE: $$1101_2 = 1\times 2^3 + 1 \times 2^2 + 0 \times 2^1 + 1 \times 2^0 = 13_{10}$$
51
How do you use code to update a list (aapl_prices) so that it stores the prices rounded to the nearest integer?
for i in range(0,len(appl_prices)): appl_prices[i]=round(appl_prices[i]) appl_prices
52
How do you write code to calculate the difference between each pair of consecutive numbers (in aapl_prices), and store the results in a list and bind it to the variable (price_diff)?
price_diff = [] for i in range(0,len(appl_prices)-1): price_diff.append(appl_prices[i+1]-appl_prices[i]) price_diff
53
How do you convert a string to be all lowercase, uppercase, capitalise, and title?
* To convert to lowercase, you use .lower() * To convert to uppercase, you use .upper() * The capitalize() method converts the first character of the string to uppercase and the rest to lowercase. * The title() method converts the first character of each word in the string to uppercase and the rest to lowercase.
54
How do you count the number of times a letter appears in a string?
.count('t') adds up the number of times the character 't' appears in a string. You can also do words. It is case sensitive, so you may need to make the string lowercase
55
How can you replace words/letters in a string?
.replace( , ) s = "I intend to live forever, or die trying." s.replace("to", "three") OUT: 'I intend three live forever, or die trying.'
56
How can you define a function that returns the absolute value?
def absolute(x): if x > 0: return x else: return -x
57
How can you define a function number_of_combinations() to determine the number of distinct combinations that can be represented by a given number of bits?
def number_of_combinations(num_bits): return 2**num_bits
58
How do you define a function uint_to_32_bit_binary() to convert a non-negative integer to a string representing the 32-bit binary representation of the given integer?
POSSIBLE SOLUTION 1: def uint_to_32_bit_binary(x): binary_str = bin(x)[2:] binary_str = (32 - len(binary_str)) * '0' + binary_str return binary_str[-32:] POSSIBLE SOLUTION 2: def uint_to_32_bit_binary(x): binx = bin(x)[2:] if len(binx) <= 32: return ('0'*(32 - len(binx))) + binx elif len(binx) > 32: return x + " can not be represented 32-bit binary representation"
59
How do you define the function int_to_32_bit_binary() to convert an integer to a string representing its 32-bit binary representation using two's complement?
def int_to_32_bit_binary(x): if x >= 0: return uint_to_32_bit_binary(x) else: return uint_to_32_bit_binary(2**32+x)
60
How can you add 0.1 for 10 times in Python using a loop (float)?
If you add 0.1 for 10 times in Python: x = 0 for i in range(10): x += 0.1 x OUT: 0.9999999999999999 NOTE: The total sum is not exactly 1, but some number that is close to 1. This happens as float is only an approximation of the given real number. x == 1 FALSE You should use an integer as a counter to control the loop as an integer is exact (i.e. use 1 not 0.1): x = 0 while x != 10: x += 1
61
What does the ord() command do?
The ord() function returns the number representing the unicode code of a specified character.
62
How many bytes do Greek Letters, Chinese/Japanese/Korean characters, and emojis need?
Greek letters need 2 bytes to represent, Chinese/Japanese/Korean characters often need 3 bytes and emoji often requires 4 bytes.
63
How do you split a string?
.split() score = 'harry \t 85\t79\n' print(score) name = score.split()[0]
64
How do you remove whitespace at the front/back of a string?
.strip() EXAMPLE: 'abc '.strip() returns 'abc'
65
How can you convert scores from a string into integers?
scores = [int(score) for score in tokens[1:]] for i in range(len(scores)): scores[i] = int(scores[i])
66
REMEMBER
For loops often, don't forget range(len()) !!!!!!!
67
What should often determine how you categorise data?
* How you categorise the data type depends on how you are going to use the data. * More important is that you work with the data appropriately, and having some idea of how the data is categorised can guide you.