Introduction and foundations Flashcards by Marcus Hellberg

What is data mining?

a) Data cleaning process

b) Discovering patterns in large datasets

c) Data storage technique

d) A way to delete unnecessary data

b) Discovering patterns in large datasets

How well did you know this?

Not at all

Perfectly

Which analogy is commonly used to describe data mining?

a) Searching for errors in data

b) Extracting gold from ore

c) Planting information seeds

d) Building data warehouses

b) Extracting gold from ore

How well did you know this?

Not at all

Perfectly

Why is data mining often called “Knowledge Discovery from Data (KDD)”?

a) It involves cleaning the data

b) It transforms raw data into useful knowledge

c) It focuses on visualization

d) It is synonymous with machine learning

b) It transforms raw data into useful knowledge

How well did you know this?

Not at all

Perfectly

Which step in the KDD process involves combining data from multiple sources?

a) Data cleaning

b) Data integration

c) Data selection

d) Pattern evaluation

b) Data integration

How well did you know this?

Not at all

Perfectly

What is the primary goal of the data mining process?

a) To store data efficiently

b) To visualize data trends

c) To uncover interesting patterns and models

d) To clean and organize datasets

c) To uncover interesting patterns and models

How well did you know this?

Not at all

Perfectly

What are “outliers” in data mining?

a) Common data points in a dataset

b) Data points that deviate significantly from others

c) Summary of the entire dataset

d) Missing data points

b) Data points that deviate significantly from others

How well did you know this?

Not at all

Perfectly

Which type of pattern does data mining NOT aim to find?

a) Associations

b) Correlations

c) Predictions

d) Irrelevant trends

How well did you know this?

Not at all

Perfectly

What does the term “Big Data” refer to?

a) Small datasets processed in real-time

b) Vast amounts of data characterized by volume, velocity, and variety

c) Data limited to structured formats

d) Data that only includes images and videos

b) Vast amounts of data characterized by volume, velocity, and variety

How well did you know this?

Not at all

Perfectly

Big Data is characterized by which three V’s?

a) Value, Validation, Velocity

b) Variety, Volume, Velocity

c) Volume, Verification, Variability

d) Visualization, Variety, Value

b) Variety, Volume, Velocity

How well did you know this?

Not at all

Perfectly

What is a key challenge in mining Big Data?

a) Limited storage space

b) Poor visualization tools

c) Efficient handling of high velocity and volume

d) Incompatibility of algorithms with structured data

c) Efficient handling of high velocity and volume

How well did you know this?

Not at all

Perfectly

Why is Big Data important for data mining?

a) It allows access to unlimited data storage

b) It provides vast, diverse datasets for uncovering patterns

c) It simplifies machine learning algorithms

d) It only focuses on small subsets of data

b) It provides vast, diverse datasets for uncovering patterns

How well did you know this?

Not at all

Perfectly

What is Knowledge Discovery from Data (KDD)?

a) Cleaning and summarizing datasets

b) A process that involves extracting useful information from raw data

c) A tool used to query databases

d) A step focused solely on visualization

b) A process that involves extracting useful information from raw data

How well did you know this?

Not at all

Perfectly

What distinguishes KDD from simple database querying?

a) KDD generates knowledge, not just results

b) KDD is only for structured data

c) KDD relies on external tools

d) KDD ignores data cleaning steps

a) KDD generates knowledge, not just results

How well did you know this?

Not at all

Perfectly

Why might outliers be important rather than ignored?

a) They make data cleaning easier

b) They reveal valuable anomalies like fraud

c) They confirm dataset accuracy

d) They are always indicative of errors

b) They reveal valuable anomalies like fraud

How well did you know this?

Not at all

Perfectly

What is the difference between structured and unstructured data?

a) Unstructured data cannot be analyzed

b) Structured data has clear formats and attributes

c) Unstructured data is error-prone

d) Structured data is always accurate

b) Structured data has clear formats and attributes

How well did you know this?

Not at all

Perfectly

What is an example of predictive data mining?

a) Clustering similar customers

b) Analyzing frequent purchases

c) Predicting future sales based on patterns

d) Summarizing datasets

Study These Flashcards

c) Predicting future sales based on patterns

Which of the following is NOT a step in the KDD process?

a) Data transformation

b) Pattern evaluation

c) Knowledge presentation

d) Web scraping

Study These Flashcards

d) Web scraping

Which method can be used for outlier detection?

a) Statistical tests

b) Deep learning only

c) Manual analysis

d) Regression

Study These Flashcards

a) Statistical tests

How does data cleaning contribute to data mining?

a) By adding more patterns

b) By removing irrelevant data

c) By ensuring all models fit all datasets

d) By storing data efficiently

Study These Flashcards

b) By removing irrelevant data

What are the four primary types of data?

a) Binary, Continuous, Nominal, Ratio

b) Nominal, Ordinal, Interval, Ratio

c) Numeric, Text, Boolean, Ratio

d) Structured, Unstructured, Semi-structured, Nominal

Study These Flashcards

b) Nominal, Ordinal, Interval, Ratio

What is nominal data?

a) Data that represents order but not magnitude

b) Data with categories that have no inherent order

c) Data that measures absolute zero

d) Data with equal intervals but no true zero

Study These Flashcards

b) Data with categories that have no inherent order

Which type of data reflects order but not distance between values?

a) Nominal

b) Ordinal

c) Interval

d) Ratio

Study These Flashcards

b) Ordinal

What distinguishes ratio data from interval data?

a) Ratio data cannot have a zero value

b) Ratio data includes a true zero point

c) Ratio data is categorical

d) Ratio data lacks any numerical meaning

Study These Flashcards

b) Ratio data includes a true zero point

What is data quality?

a) The process of data storage

b) The degree to which data meets user needs for accuracy and reliability

c) A measure of data size and velocity

d) The ability to visualize datasets

Study These Flashcards

b) The degree to which data meets user needs for accuracy and reliability

Which is NOT a factor that determines data quality? a) Completeness b) Consistency c) Accessibility d) Volume

d) Volume

How does data cleaning affect data quality? a) By ensuring only structured data is used b) By removing noise and correcting inconsistencies c) By focusing on minimizing storage d) By visualizing outliers

b) By removing noise and correcting inconsistencies

What do basic descriptive statistics help us understand about data? a) Hidden patterns in datasets b) The summary characteristics like central tendency and spread c) Advanced predictive models d) Data types and categories

b) The summary characteristics like central tendency and spread

Which of the following is a measure of central tendency? a) Mean b) Standard deviation c) Range d) Variance

a) Mean

What does the median represent in a dataset? a) The most frequently occurring value b) The average value of the dataset c) The middle value when data is ordered d) The spread of data around the mean

c) The middle value when data is ordered

What is a measure of dispersion? a) A measure of the central value of data b) A measure of how data values are spread around the central tendency c) A technique for data visualization d) A method for cleaning data

b) A measure of how data values are spread around the central tendency

Which is NOT a measure of dispersion? a) Variance b) Standard deviation c) Mean d) Range

c) Mean

What is the range of a dataset? a) The average of all data points b) The difference between the maximum and minimum values c) The most frequent value d) The spread around the median

b) The difference between the maximum and minimum values

What are the five components of a five-number summary? a) Mean, Variance, Standard Deviation, Median, Quartiles b) Minimum, First Quartile, Median, Third Quartile, Maximum c) Mode, Median, Mean, Range, Variance d) Minimum, Range, Variance, Mean, Maximum

b) Minimum, First Quartile, Median, Third Quartile, Maximum

What is the purpose of a box plot in data analysis? a) To display the frequency of values b) To visualize the spread and potential outliers in data c) To summarize text data d) To show relationships between variables

b) To visualize the spread and potential outliers in data

What is a proximity measure? a) A way to cluster data b) A metric to evaluate similarity or distance between data points c) A statistical summary of data d) A method for cleaning noisy data

b) A metric to evaluate similarity or distance between data points

How is Euclidean distance calculated between two points in 2D space? a) By taking the sum of the absolute differences in each dimension b) By finding the square root of the sum of squared differences c) By dividing the differences in values d) By averaging all the dimensions

b) By finding the square root of the sum of squared differences

What is the Manhattan distance between two points? a) The shortest straight-line distance b) The sum of the absolute differences in each dimension c) The square of the differences in each dimension d) A measure that combines both Euclidean and cosine similarity

b) The sum of the absolute differences in each dimension

Which distance metric is a generalization of Euclidean and Manhattan distances? a) Cosine Similarity b) Minkowski Distance c) Chebyshev Distance d) Hamming Distance

b) Minkowski Distance

What does cosine similarity measure? a) The angle between two vectors in a multidimensional space b) The absolute difference between two vectors c) The ratio of Euclidean distance to Manhattan distance d) The squared differences between two data points

a) The angle between two vectors in a multidimensional space

Introduction and foundations Flashcards

(39 cards)