W1 Flashcards
What is data?
Data refers to raw facts, information, or observations that are collected, stored, and processed for various purposes. It can take various forms, including numbers, text, images, or any other representations of information
What is datafication?
When aspects of our life are turned into digital data (typically automatically).
- Online behaviour:
- Interactions with other people being “datafied” (e.g. “likes” on Instagram).
- Browsing history (through cookies) and searches being “datafied”
- Offline behaviour:
- Being “datafied” when visiting places via sensors, cameras, etc.
What is Big Data?
Refers not only to the quantity but with the following characteristics:
- Volume: Large amount of data, from terabytes to petabytes
- Velocity: High speed of data generation
- Value: Valuable information buried in data sea
- Variety: Lack of homogeneity in data types, formats and quality
Some people may also include other Vs like:
* Veracity: Can we trust the data?
* Variability: Changing formats, structure, or sources of big data
What is the power of data?
- Know better about the (potential) customers/users.
- Examples: Walmart: found out the pre-hurricane top-selling item was beer by mining trillions of bytes of sales history. Netflix:
show recommendations - Deep learning
- Image recognition algorithms like ResNet often use millions of images to train
- Generate human-like text. GPT-3 (released by OpenAI in 2020) uses billions of tokens to train
- Data → Value (→ Profits)
What is a data science project life cycle?
- Ask Questions
- Data Acquisition
- Data Preparation
- Data Exploration (and *Ask Questions AGAIN)
- Analyse/model data
- Evaluation
- Action
What are the advantages and disadvantages of ‘Ready to use datasets’?
Advantage
* Minimal effort to process the data, can focus on modelling/analysis techniques
Disadvantage
* Real-world data is seldom available in such nice, clean, ready-to-use way
What are some features of data in real life?
- Requires effort to collect
- Some data is available somewhere on the Internet but not stored as a table.
- How can we gather data automatically?
- EXAMPLE: We can get more up-to-date car data from some car-selling websites, but information about each car is separate. - Non-Tabular
- A lot of the data available is not in the form of a table.
- How can we extract the required information and organise it in table form for analysis?
- EXAMPLE: The data from the car-selling web page is in HTML. - Issues with missing data, incorrect values, and duplicate data
- How do we detect and deal with these problems?
- EXAMPLE. May include NaN - Data scattered in different locations
- It is common that data required for an analysis is available from different data sources.
- How can we get the data from the database and combine it with different tables? - Different types of data
- Can use graph data to visualise i.e. how American politicians connected on Twitter.
- How can we work with and visualise this type of data?
Why is data visualisation useful?
- Visualisation can help to discover patterns in the data that statistics may miss
- Visualisation helps to raise questions that stimulate research and further analysis
- Visualisation helps to answer questions and effectively conveys the message of the analysis/result
- EXAMPLE. The graph on the percentage of new DC characters by gender can (partially) answer the question: female characters are not introduced at a rate approaching gender parity
What are the 2 main types of data?
- Quantitative
- Qualitative
What is Quantitative Data?
Quantitative data (or numerical data) refers to numerical information or data that can be expressed as numbers and can be measured.
Can perform meaningful computation like sum, average, difference, etc. E.g. Average Bitcoin price per week
There are two types of quantitative data:
1. Discrete
2. Continuous
What are the 2 types of Quantitative Data?
Two types of quantitative data:
- Discrete
- Can only take distinct values and cannot be subdivided infinitely
- E.g. count data (The number of students in a class, the number of likes, etc). - Continuous
- Can take on any value within a range
- E.g. Height, temperature, Bitcoin prices
What is Qualitative Data?
Qualitative data (or categorical data) refers to non-numerical information that describes qualities, characteristics, or attributes.
- Measures of “categories”
- E.g. Genders, BSc programmes, Python levels
- While qualitative data is non-numerical in nature, they can be “mapped” or “coded” as numbers
- E.g. LSE student ID
- Numerical calculation may not make sense
- E.g. Averaging the student IDs for students taking this course
- Two types of Qualitative Data:
1. Ordinal
2. Cardinal
What are the 2 types of Qualitative Data?
- Ordinal: have meaningful order or rank
* EXAMPLE:
- Survey response options: strongly disagree, disagree, neither agree nor disagree, agree and strongly agree
- Python level: None, Basic, Intermediate, Advanced (order by proficiency in Python)
* Note the exact differences between the values may not be well-defined - Nominal: no natural order
* No meaningful way to compare the categories in terms of magnitude or order
- Each category is considered equal to the others
* e.g. Genders, BSc programmes
What is sequential data?
Sequential data is data arranged in sequences where order matters.
- Each data point is associated with a specific time or position in a sequence
- EXAMPLE:
- Text data
- Gene sequence (ACGT)
- Daily temperature readings
- The closing price of Bitcoin in December 2023
What is Time Series Data?
A time series is a sequence of data points indexed in time order.
- A type of sequential data
- EXAMPLE:
- Daily temperature readings
- The closing price of Bitcoin in December 2023
- Number of covid cases
Why is it important to understand the data types?
Understanding the data type helps to determine the appropriate analysis methods.
- Different types of data require different:
- Data cleaning and preprocessing
- Descriptive statistics
- Visualisation techniques
- Statistical models
What are the 3 categories of data based on how the data is processed, organised, and stored?
Data can be categorised into structured data, semi-structured data and unstructured data
based on how the data is processed, organised and stored.
What is structured data?
Structured has a pre-defined data model and is organised in a pre-defined way
- Often stored in tabular formats e.g. auto dataset
*EXAMPLE:
start - end year name
2017 2023 Minouche Shafik
2023 2024 Eric Neumayer
2024 NA Larry Kramer
What is unstructured data
Unstructured is information that either does not have a pre-defined data model or useful/consistent structure to help process the data
*Often in a raw, natural form and can include text, images, audio, and video
*Accounts for the majority of the data available in the world
- EXAMPLE:From Wikipedia pages:
- Larry D. Kramer (born June 23, 1958) is an American legal scholar serving as the president and vice chancellor of the London School of Economics since April 2024. Previously, Kramer served as president of the William and Flora Hewlett Foundation from 2012 through 2023. Prior to that role, he was the Dean of Stanford Law School (2004–2012). He is a scholar of both constitutional law and civil procedure.
- Nemat Talaat Shafik, Baroness Shafik (born 13 August 1962), commonly known as Minouche Shafik, is a British-American academic and economist. She served as the president and vice chancellor of the London School of Economics from 2017 to 2023, and then as the 20th president of Columbia University from July 2023 to August 2024.
What is semi-structured data?
Semi-structured data may not have a rigid pre-defined structure but have some level of organisation
- Information about the organisation of the data is often within the data in the form of tags or a hierarchical structure
- “Self-describing”
*EXAMPLE: (data in JSON format):
{“presidents”: [
…
{
“name”: “Larry Kramer”,
“start year”: 2024,
“universities”: [“Brown University”, “University of Chicago Law School”]
},
{
“name”: “Minouche Shafik”,
“end year”: 2023,
“start year”: 2017
}
]}
How is data represented and stored?
BITS
In a computer, data is represented using the binary numeral system - texts, numbers, bits images, audio, etc are stored by a sequence of BITS.
- Bit is a basic unit of information in a computer
- It can only take two possible values, which can be considered as on / off, true / false, or 0 / 1
- 2^n Can represent patterns with n bits
- For n = 2, there are 4 distinct combinations (00, 01, 10, 11) to represent 4 patterns
- For n = 2 , there are 2^3 = 8 distinct combinations (000, 001, 010, 011, 100, 101, 110, 111) to represent 8 patterns
- For n=8, there are 2^8 = 256 distinct combinations to represent 256 patterns
What is a Byte?
Byte is a common unit of digital information.
- Eight bits = one byte
- How many distinct patterns can 1 byte represent?
- 28 = 256
What are some of the multi-byte units?
- Kilobyte (kB) = 1000 bytes
- Megabyte (MB) = 1000^2 bytes
- Gigabyte (GB) = 1000^3 bytes
- Terabyte (TB) = 1000^4 bytes
- Petabyte (PB) = 1000^5 bytes
- Exabyte (EB) = 1000^6 bytes
- Zettabyte (ZB) = 1000^7 bytes
- Yottabyte (YB) = 1000^8 bytes
How are unsigned integers represented?
Unsigned integers (i.e. non-negative integers) are commonly stored in 4 bytes (i.e. 32 bits).
*With finite n bits, we can only represent a finite range of integers
- Unsigned integer: [0, 2^n − 1]
- Integer: [−2^(n−1), 2^(n−1) − 1]
- This allows us to represent non-negative integers from value 4294967295 (2^32 − 1)
*Attempting to represent an integer that is outside of the range can cause unexpected consequences