Data Science Flashcards

1
Q

What is the most common approach to transform categorical variables into numeric ones?
Select one:
a. Frequency count
b. One-hot encoding
d. “Baysean” encoders
e. Contrast encoders

A

b. One-hot encoding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In what situation should observations with anomalies in values (outliers) be removed?
Select one:
a. When outliers represent more than 90% of the existing number of observations
b. When outliers deform models more than they can help, as is the case with models built with
numerical algorithms, such as K-Means or K-Medoids
c. When outliers represent more than 75% of the existing number of observations
d. When the algorithms used do not deal well with outliers, as is the case with algorithms based on
decision trees

A

b. When outliers deform models more than they can help, as is the case with models built with
numerical algorithms, such as K-Means or K-Medoids

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the primary use of Data Visualization in Data Science?
a. To build machine learning models
b. To clean the data
c. To help understand and interpret the data better by representing it in a graphical format
d. To produce attractive graphics for publications

A

c. To help understand and interpret the data better by representing it in a graphical format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

When evaluating a descriptive analytics model using CRISP-DM, which of the following aspects is
often considered?
a. The quality of insights extracted and their relevance to the business problem
b. The predictive power of the model
d. The speed of the model in production
e. The accuracy of the clustering algorithms

A

a. The quality of insights extracted and their relevance to the business problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which of these tasks is NOT part of the Business Understanding phase of CRISP-DM?
Select one:
a. Produce the project plan
c. Defining business objectives
d. Data exploration
e. Determine Data Mining objectives
f. Assess the situation

A

d. Data exploration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which of the following does NOT represent a valid reason for a dataset to be normalized?
Select one:
a. When the target is between 0 and 1
b. When one wants to interpret the coefficients of linear models
d. Improve an algorithm’s training speed by helping to accelerate convergence, even in algorithms that
do not require data to be normalized
e. When using algorithms that measure the distance (e.g., Euclidean distance) between data points

A

a. When the target is between 0 and 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is SQL often used for in the context of data science?
a. To manipulate and retrieve data stored in relational databases
b. To create interactive user interfaces
c. To design the output of descriptive models
e. To create 3D graphics

A

a. To manipulate and retrieve data stored in relational databases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which of the following is NOT a characteristic of CRISP-DM?
Select one:
a. Used by academia and by practitioners
c. Follows the waterfall project management standards defined by the Project Management Institute
d. It is independent of the tools used
e. Can be applied in different type of problems
f. Non-Proprietary (not patented or owned by an organization)

A

c. Follows the waterfall project management standards defined by the Project Management Institute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Select the correct type of data.
a. {name: “Maria”, age: 32, city: “London”};
b. Image file
c. A column in a dataset with the product brand (rows are transactions)
d. A column in a dataset with the customer total purchase amount, in Euros (rows are based
customers’ behavior and characteristics)

Structured data – Numerical
Semi-structured data - Json
Structured Data - Categorical
Unstructured data - Multimedia

A

a. Semi-structured data - Json
b. Unstructured data - Multimedia
c. Structured Data - Categorical
d. Structured data – Numerical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

A histogram chart is helpful to…
a. Analyze variations over time
b. Make comparisons between two variables
d. Analyze the distribution of data points
e. Analyze the relationship between two variables

A

d. Analyze the distribution of data points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the main goal of the ‘Deployment’ phase in the CRISP-DM methodology?
a. To recollect the data for analysis
c. To evaluate the success of data cleaning
d. To explain the results of exploratory data analysis
e. To push the developed model into a production or operational environment where it can provide the intended business benefits

A

e. To push the developed model into a production or operational environment where it can provide the intended business benefits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Which chart should be used to visualize the distribution between three numeric variables?

A

3D area chart

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

In an RFM segmentation where segments are created using quartiles, typically, “Good customers, but almost churned” (do not make purchases for some time, but buy frequently and spend the most) are situated in which segment?
Select one:
a. 344
b. 444
c. 144
e. 411

A

a. 344

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How does Cosine Similarity differ from Euclidean distance?
a. Cosine Similarity can only be used when all data points are positive
b. Cosine Similarity can only be calculated in two dimensions
c. Cosine Similarity is used only with unstructured data.
e. Cosine Similarity takes into account the direction of data (the angle between data point vectors), and not the magnitude, making it useful when the dataset includes text or other high-dimensional data

A

e. Cosine Similarity takes into account the direction of data (the angle between data point vectors), and not the magnitude, making it useful when the dataset includes text or other high-dimensional data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the line signed inside the red oval represented in the following boxplot?
Select one
a. The quartile 3 value plus the value of the interquartile range multiplied by 1.5
b. Quartile 2
c. The quartile 1 value minus the value of the interquartile range multiplied by 1.5
d. Quartile 3

A

a. The quartile 3 value plus the value of the interquartile range multiplied by 1.5

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which of the following are typical applications of Market Basket Analysis? Select one or more:
a. Improve product placement in stores
b. Identifying cross-selling opportunities
c. Segment customers into groups with similar characteristics
d. Identifying up-selling opportunities

A

a. Improve product placement in stores
b. Identifying cross-selling opportunities
d. Identifying up-selling opportunities

17
Q

What is the primary use of Data Science in Marketing?
b. To understand and predict customer behaviors and improve marketing strategies
c. To entertain customers
d. To develop high-tech products
e. To retain talented employees

A

b. To understand and predict customer behaviors and improve marketing strategies

18
Q

Which of the following statements describes what is the data preparation phase (Data
Preparation) in CRISP-DM?
Select one:
a. Phase that covers all the activities necessary to build the final dataset (data that will be used in >/ the
modeling tool(s)) from the initial raw data
c. Phase in which various modeling techniques are selected and applied. The parameters of the various techniques are also calibrated to ideal values
d. Phase that focuses on understanding the objectives and requirements of the project from a business perspective, converting this knowledge into the definition of the Data Mining problem and the
development of the preliminary plan to achieve the objectives
e. Phase that begins with the initial data extraction and continues with activities that allow familiarization with the data, identifying data quality problems, discovering the first information about the data and / or detecting interesting subsets to form hypotheses about hidden information

A

a. Phase that covers all the activities necessary to build the final dataset (data that will be used in >/ the
modeling tool(s)) from the initial raw data