Data Science Flashcards
What is the most common approach to transform categorical variables into numeric ones?
Select one:
a. Frequency count
b. One-hot encoding
d. “Baysean” encoders
e. Contrast encoders
b. One-hot encoding
In what situation should observations with anomalies in values (outliers) be removed?
Select one:
a. When outliers represent more than 90% of the existing number of observations
b. When outliers deform models more than they can help, as is the case with models built with
numerical algorithms, such as K-Means or K-Medoids
c. When outliers represent more than 75% of the existing number of observations
d. When the algorithms used do not deal well with outliers, as is the case with algorithms based on
decision trees
b. When outliers deform models more than they can help, as is the case with models built with
numerical algorithms, such as K-Means or K-Medoids
What is the primary use of Data Visualization in Data Science?
a. To build machine learning models
b. To clean the data
c. To help understand and interpret the data better by representing it in a graphical format
d. To produce attractive graphics for publications
c. To help understand and interpret the data better by representing it in a graphical format
When evaluating a descriptive analytics model using CRISP-DM, which of the following aspects is
often considered?
a. The quality of insights extracted and their relevance to the business problem
b. The predictive power of the model
d. The speed of the model in production
e. The accuracy of the clustering algorithms
a. The quality of insights extracted and their relevance to the business problem
Which of these tasks is NOT part of the Business Understanding phase of CRISP-DM?
Select one:
a. Produce the project plan
c. Defining business objectives
d. Data exploration
e. Determine Data Mining objectives
f. Assess the situation
d. Data exploration
Which of the following does NOT represent a valid reason for a dataset to be normalized?
Select one:
a. When the target is between 0 and 1
b. When one wants to interpret the coefficients of linear models
d. Improve an algorithm’s training speed by helping to accelerate convergence, even in algorithms that
do not require data to be normalized
e. When using algorithms that measure the distance (e.g., Euclidean distance) between data points
a. When the target is between 0 and 1
What is SQL often used for in the context of data science?
a. To manipulate and retrieve data stored in relational databases
b. To create interactive user interfaces
c. To design the output of descriptive models
e. To create 3D graphics
a. To manipulate and retrieve data stored in relational databases
Which of the following is NOT a characteristic of CRISP-DM?
Select one:
a. Used by academia and by practitioners
c. Follows the waterfall project management standards defined by the Project Management Institute
d. It is independent of the tools used
e. Can be applied in different type of problems
f. Non-Proprietary (not patented or owned by an organization)
c. Follows the waterfall project management standards defined by the Project Management Institute
Select the correct type of data.
a. {name: “Maria”, age: 32, city: “London”};
b. Image file
c. A column in a dataset with the product brand (rows are transactions)
d. A column in a dataset with the customer total purchase amount, in Euros (rows are based
customers’ behavior and characteristics)
Structured data – Numerical
Semi-structured data - Json
Structured Data - Categorical
Unstructured data - Multimedia
a. Semi-structured data - Json
b. Unstructured data - Multimedia
c. Structured Data - Categorical
d. Structured data – Numerical
A histogram chart is helpful to…
a. Analyze variations over time
b. Make comparisons between two variables
d. Analyze the distribution of data points
e. Analyze the relationship between two variables
d. Analyze the distribution of data points
What is the main goal of the ‘Deployment’ phase in the CRISP-DM methodology?
a. To recollect the data for analysis
c. To evaluate the success of data cleaning
d. To explain the results of exploratory data analysis
e. To push the developed model into a production or operational environment where it can provide the intended business benefits
e. To push the developed model into a production or operational environment where it can provide the intended business benefits
Which chart should be used to visualize the distribution between three numeric variables?
3D area chart
In an RFM segmentation where segments are created using quartiles, typically, “Good customers, but almost churned” (do not make purchases for some time, but buy frequently and spend the most) are situated in which segment?
Select one:
a. 344
b. 444
c. 144
e. 411
a. 344
How does Cosine Similarity differ from Euclidean distance?
a. Cosine Similarity can only be used when all data points are positive
b. Cosine Similarity can only be calculated in two dimensions
c. Cosine Similarity is used only with unstructured data.
e. Cosine Similarity takes into account the direction of data (the angle between data point vectors), and not the magnitude, making it useful when the dataset includes text or other high-dimensional data
e. Cosine Similarity takes into account the direction of data (the angle between data point vectors), and not the magnitude, making it useful when the dataset includes text or other high-dimensional data
What is the line signed inside the red oval represented in the following boxplot?
Select one
a. The quartile 3 value plus the value of the interquartile range multiplied by 1.5
b. Quartile 2
c. The quartile 1 value minus the value of the interquartile range multiplied by 1.5
d. Quartile 3
a. The quartile 3 value plus the value of the interquartile range multiplied by 1.5