03_data and features Flashcards
What is data?
-Info output by sensing devide or organ
-includes both useful and irrelevant or redundant info
-must be processed to be meaningful
- info in digital form that can be transmitted or processed
-factual information (eg measurements or statistics)
-used as a basis for reasoning, discussion or calculation
What is the pipeline/process associated with data? (3 steps)
1) data acquisition
2) data storage (used to be a bottleneck)
3) data analysis
Is all existing data technically accessible for analysis?
No, most of it is privately owned
What are two types of data?
1) structured data
2) unstructured data
What is structured data?
preprocessed and formatted data that is easily queryable
eg quantitative data in a table
most data analysis techniques require data to be available in a structured form for easier processing
How is structured data represented?
Always in a database schema
(eg a table in 2 dimensions)
What is unstructured data?
Unprocessed and unformatted data is not easily queryable
eg qualitative data, textual data, image data, data stream, audio data, video data (with increasing data complexity)
What is quantitative data?
can be measured,
distances can be defined
What are two kinds of quantitative data?
1) continuous data
2) discrete data
What is continuous data?
real-valued numbers;
potentially within a given range
eg
- temperatures
- a person’s height
- prices
What is discrete data?
discrete numbers;
whole numbers or real numbers;
potentially within a given range
eg
- number of people in a room
- inventory counts
What is qualitative (categorical) data?
cannot be measure,
distances not defined
What are two types of qualitative data?
1) nominal data
2) ordinal data
What is nominal data?
Labels for different categories
without ordering
eg
- color of hair
- names of persons
- types of fruit
What is ordinal data?
Labels for different categories
following an inherent ranking scheme
eg
- rank in a competition
- grades
- day of the week
What is feature engineering?
Turning unstructured data into structured data
Why do we need feature engineering?
Before ML methods can be applied to unstructured data, we have to process those and extract useful features from them
What are features?
features are quantitative and independent variables
based on which our ML model learns