Quiz 1 Flashcards
Steps of the data science pipeline
Data selection Data Preprocessing Data Transformation Data Mining Evaluation/Interpretation
Ways to measure central tendency
Mean
Median
Midrange
Mode
Ways to measure dispersion or spread
Range Quartiles Variance Standard Deviation Interquartile Range
Unsupervised Learning
Includes Clustering
Find groups in data without provided labels
types of supervised learning problems
Regression
Classification
Classifier
Discovers a pattern that can predict a class that a new data instance falls into
What is clustering points used for?
Anomaly Detection
Based on similarities between them
Does not require labeled data
Supervised learning examples
Examine a web page, and classify whether the content on the web page should be considered “child friendly” or “adult.”
In farming, given data on crop yields over the last 20 years, learn to predict next year’s crop yields.
Learn from historical data and determine whether a new user will respond to an add campaign (or not).
Data discretization is part of data reduction
True
Scatter plot is not an effective graphical method to look for correlation between two numerical variables
False
Truths about correlation
If correlation is equal to -1 then two features are perfectly negatively correlated
Correlation between two features ranges between [-1, 1]
If correlation is equal to 1 then two features are perfectly positively correlated
If correlation is equal to 0 then two features have no correlation
Scatter plot
Can handle multiple Y values per X value
Bar Chart
Good for categorical X values and cases where the Y value is ratio scaled.
Line Graph
Implies some importance of the connection between the data points