1. Data Flashcards
What is data mining?
The extraction of useful knowledge from noisy data
What are the 4 steps of successful data mining?
Acquisition, Marshaling, Analysis, Action
What does data mining specifically focus on?
Turning sets of data into useful knowledge
What is a relational database?
A database which is structured already
Ex: rows and columns
What is the purpose of data pre-processing?
Preprocessing cleans up the data and makes it easier to analyze.
Think: cleaning, normalization, transformation, feature extraction, and selection.
It handles issues like missing values, outliers, and conflicts (like incorrect information).
What are association rules?
A set of rules that characterize associations between items.
Think: interesting relationships between variables in large databases.
What kinds of things can we do with customer purchasing data?
We can analyze their spending habits to determine when and where consumers are most likely to purchase particular products. Knowing that, we can market those products to more effectively get consumers to buy them.
What kind of data are we interested in (for this course)?
Primarily relational data. Most of our data will come in preprocessed lists that we will have to mine for relationships/associations.
What is clustering?
The process of partitioning a set of data into meaningful groupings so that each cluster differs from the next in some respect.
This is to help the user understand the natural structure of the data and gives insight into the data distribution. Can also be used as a preprocessing step for other algorithms.
What does it mean for data to be discrete?
Finite or countably infinite values.
Examples: zip codes, age, eye color, number of whole numbers
What does it mean for data to be continuous?
Data which is not restricted to defined, separate values. Their values can occupy a continuous range (infinitely specific).
Ex: temperature, real numbers, weight
What are the four types of data attributes?
Nominal data, ordinal data, interval data, and ratio data
What are nominal data?
When data is labeled without any quantitative value.
Ex: male/female; hair color; north/south/east/west
What are ordinal data?
Data where the order is important (but the difference between the data is not necessarily known.
Ex: 1st place beat 2nd place, but we don’t know by how much.
Ex: “how do you feel today?”
- very unhappy, unhappy, ok, happy, very happy
What are interval data?
Data where we know the order AND the difference between the values.
Note: interval data does NOT have a “true zero” (required to calculate ratios).
In other words, 10 deg + 10 deg is 20 deg, but 20 deg is not twice as hot as 10 deg, because there is no true zero on the Celsius scale.
Only + and - operations can be done on interval data.
Interval = space in between
What are ratio data?
Data that is ordered (ordinal), tells us the value between units (interval), AND has an absolute zero point, which allows for multiplication and division operations (and all the statistical power that comes with them).
Ex: height; weight
What operations can we perform on nominal data?
Counting and mode
What operations can we perform on ordinal data?
Order, counting, mode, median
What operations can we perform on interval data?
Order, counting, mode, median, mean, quantify difference between each value, add/subtract values