Introduction and foundations Flashcards

1
Q

What is data mining?

a) Data cleaning process

b) Discovering patterns in large datasets

c) Data storage technique

d) A way to delete unnecessary data

A

b) Discovering patterns in large datasets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Which analogy is commonly used to describe data mining?

a) Searching for errors in data

b) Extracting gold from ore

c) Planting information seeds

d) Building data warehouses

A

b) Extracting gold from ore

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why is data mining often called “Knowledge Discovery from Data (KDD)”?

a) It involves cleaning the data

b) It transforms raw data into useful knowledge

c) It focuses on visualization

d) It is synonymous with machine learning

A

b) It transforms raw data into useful knowledge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which step in the KDD process involves combining data from multiple sources?

a) Data cleaning

b) Data integration

c) Data selection

d) Pattern evaluation

A

b) Data integration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the primary goal of the data mining process?

a) To store data efficiently

b) To visualize data trends

c) To uncover interesting patterns and models

d) To clean and organize datasets

A

c) To uncover interesting patterns and models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are “outliers” in data mining?

a) Common data points in a dataset

b) Data points that deviate significantly from others

c) Summary of the entire dataset

d) Missing data points

A

b) Data points that deviate significantly from others

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Which type of pattern does data mining NOT aim to find?

a) Associations

b) Correlations

c) Predictions

d) Irrelevant trends

A

d) Irrelevant trends

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does the term “Big Data” refer to?

a) Small datasets processed in real-time

b) Vast amounts of data characterized by volume, velocity, and variety

c) Data limited to structured formats

d) Data that only includes images and videos

A

b) Vast amounts of data characterized by volume, velocity, and variety

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Big Data is characterized by which three V’s?

a) Value, Validation, Velocity

b) Variety, Volume, Velocity

c) Volume, Verification, Variability

d) Visualization, Variety, Value

A

b) Variety, Volume, Velocity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a key challenge in mining Big Data?

a) Limited storage space

b) Poor visualization tools

c) Efficient handling of high velocity and volume

d) Incompatibility of algorithms with structured data

A

c) Efficient handling of high velocity and volume

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why is Big Data important for data mining?

a) It allows access to unlimited data storage

b) It provides vast, diverse datasets for uncovering patterns

c) It simplifies machine learning algorithms

d) It only focuses on small subsets of data

A

b) It provides vast, diverse datasets for uncovering patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Knowledge Discovery from Data (KDD)?

a) Cleaning and summarizing datasets

b) A process that involves extracting useful information from raw data

c) A tool used to query databases

d) A step focused solely on visualization

A

b) A process that involves extracting useful information from raw data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What distinguishes KDD from simple database querying?

a) KDD generates knowledge, not just results

b) KDD is only for structured data

c) KDD relies on external tools

d) KDD ignores data cleaning steps

A

a) KDD generates knowledge, not just results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why might outliers be important rather than ignored?

a) They make data cleaning easier

b) They reveal valuable anomalies like fraud

c) They confirm dataset accuracy

d) They are always indicative of errors

A

b) They reveal valuable anomalies like fraud

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the difference between structured and unstructured data?

a) Unstructured data cannot be analyzed

b) Structured data has clear formats and attributes

c) Unstructured data is error-prone

d) Structured data is always accurate

A

b) Structured data has clear formats and attributes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is an example of predictive data mining?

a) Clustering similar customers

b) Analyzing frequent purchases

c) Predicting future sales based on patterns

d) Summarizing datasets

A

c) Predicting future sales based on patterns

17
Q

Which of the following is NOT a step in the KDD process?

a) Data transformation

b) Pattern evaluation

c) Knowledge presentation

d) Web scraping

A

d) Web scraping

18
Q

Which method can be used for outlier detection?

a) Statistical tests

b) Deep learning only

c) Manual analysis

d) Regression

A

a) Statistical tests

19
Q

How does data cleaning contribute to data mining?

a) By adding more patterns

b) By removing irrelevant data

c) By ensuring all models fit all datasets

d) By storing data efficiently

A

b) By removing irrelevant data

20
Q

What are the four primary types of data?

a) Binary, Continuous, Nominal, Ratio

b) Nominal, Ordinal, Interval, Ratio

c) Numeric, Text, Boolean, Ratio

d) Structured, Unstructured, Semi-structured, Nominal

A

b) Nominal, Ordinal, Interval, Ratio

21
Q

What is nominal data?

a) Data that represents order but not magnitude

b) Data with categories that have no inherent order

c) Data that measures absolute zero

d) Data with equal intervals but no true zero

A

b) Data with categories that have no inherent order

22
Q

Which type of data reflects order but not distance between values?

a) Nominal

b) Ordinal

c) Interval

d) Ratio

A

b) Ordinal

23
Q

What distinguishes ratio data from interval data?

a) Ratio data cannot have a zero value

b) Ratio data includes a true zero point

c) Ratio data is categorical

d) Ratio data lacks any numerical meaning

A

b) Ratio data includes a true zero point

24
Q

What is data quality?

a) The process of data storage

b) The degree to which data meets user needs for accuracy and reliability

c) A measure of data size and velocity

d) The ability to visualize datasets

A

b) The degree to which data meets user needs for accuracy and reliability

25
Q

Which is NOT a factor that determines data quality?

a) Completeness

b) Consistency

c) Accessibility

d) Volume

A

d) Volume

26
Q

How does data cleaning affect data quality?

a) By ensuring only structured data is used

b) By removing noise and correcting inconsistencies

c) By focusing on minimizing storage

d) By visualizing outliers

A

b) By removing noise and correcting inconsistencies

27
Q

What do basic descriptive statistics help us understand about data?

a) Hidden patterns in datasets

b) The summary characteristics like central tendency and spread

c) Advanced predictive models

d) Data types and categories

A

b) The summary characteristics like central tendency and spread

28
Q

Which of the following is a measure of central tendency?

a) Mean

b) Standard deviation

c) Range

d) Variance

A

a) Mean

29
Q

What does the median represent in a dataset?

a) The most frequently occurring value

b) The average value of the dataset

c) The middle value when data is ordered

d) The spread of data around the mean

A

c) The middle value when data is ordered

30
Q

What is a measure of dispersion?

a) A measure of the central value of data

b) A measure of how data values are spread around the central tendency

c) A technique for data visualization

d) A method for cleaning data

A

b) A measure of how data values are spread around the central tendency

31
Q

Which is NOT a measure of dispersion?

a) Variance

b) Standard deviation

c) Mean

d) Range

A

c) Mean

32
Q

What is the range of a dataset?

a) The average of all data points

b) The difference between the maximum and minimum values

c) The most frequent value

d) The spread around the median

A

b) The difference between the maximum and minimum values

33
Q

What are the five components of a five-number summary?

a) Mean, Variance, Standard Deviation, Median, Quartiles

b) Minimum, First Quartile, Median, Third Quartile, Maximum

c) Mode, Median, Mean, Range, Variance

d) Minimum, Range, Variance, Mean, Maximum

A

b) Minimum, First Quartile, Median, Third Quartile, Maximum

34
Q

What is the purpose of a box plot in data analysis?

a) To display the frequency of values

b) To visualize the spread and potential outliers in data

c) To summarize text data

d) To show relationships between variables

A

b) To visualize the spread and potential outliers in data

35
Q

What is a proximity measure?

a) A way to cluster data

b) A metric to evaluate similarity or distance between data points

c) A statistical summary of data

d) A method for cleaning noisy data

A

b) A metric to evaluate similarity or distance between data points

36
Q

How is Euclidean distance calculated between two points in 2D space?

a) By taking the sum of the absolute differences in each dimension

b) By finding the square root of the sum of squared differences

c) By dividing the differences in values

d) By averaging all the dimensions

A

b) By finding the square root of the sum of squared differences

37
Q

What is the Manhattan distance between two points?

a) The shortest straight-line distance

b) The sum of the absolute differences in each dimension

c) The square of the differences in each dimension

d) A measure that combines both Euclidean and cosine similarity

A

b) The sum of the absolute differences in each dimension

38
Q

Which distance metric is a generalization of Euclidean and Manhattan distances?

a) Cosine Similarity

b) Minkowski Distance

c) Chebyshev Distance

d) Hamming Distance

A

b) Minkowski Distance

39
Q

What does cosine similarity measure?

a) The angle between two vectors in a multidimensional space

b) The absolute difference between two vectors

c) The ratio of Euclidean distance to Manhattan distance

d) The squared differences between two data points

A

a) The angle between two vectors in a multidimensional space