Data Science Essentials Flashcards
What is a Random Variable?
A random variable assigns a numerical value to each possible outcome of a random experiment.
What are the 5 Vs of Big Data?
- Velocity
- Veracity
- Variability
- Volume
- Value

What does machine learning include?
Machine Learning is a computing technique that has its origins in artificial intelligence (AI) and statistics. Machine Learning solutions include:
- Classification - Predicting a Boolean true/false value for an entity with a given set of features.
- Regression - Predicting a real numeric value for an entity with a given set of features.
- Clustering - Grouping entities with similar features.
What does the 5 number Summary Statistic contain?
- Min
- Max
- Q1
- Q2
- Q3
Python Merge Data Frames….Good Examples Link
http://chrisalbon.com/python/pandas_join_merge_dataframe.html
What is one of the first steps of machine learning?
Now in general, the first step in machine learning is to figure out how to represent your data as a vector.
CRISP-DM Process?
See Image

What does summary statistics generally contain?
Summary statistics generally include the mean, the median and quartiles of the data. This gives you a first quick look at the distribution of data values.
What is the benefit of a scatter plot matrix?
Scatter plot matrix methods quickly produce a single overall view of the relationships in a dataset.
The scatter plot matrix allows you to examine the relationships between many variables in one view.
The data science process includes the following activities:
- Data selection.
- Preprocessing.
- Transformation.
- Data Mining.
- Interpretation and evaluation.
What is a discrete random variable?
A discrete random variable has a number of
outcomes that you could count.
What are some aspects of Data Analytic Thinking?
- replace intuition with data driven analytical decisions.
- Transform raw data to valuable asset
- Increase pace of action
WHAT IS DATA SCIENCE?
Data Science is the exploration and quantitative analysis of all available structured and unstructured data to develop understanding, extract knowledge, and formulate actionable results.
What is a continuous variable?
A continuous variable is a variable that has an infinite number of possible values. In other words, any value is possible for the variable. A continuous variable is the opposite of a discrete variable, which can only take on a certain number of values.
Types of Machine Learning algorithms?
- Linear Regression
- Logistic Regression
- Decision Tree
- SVM
- Naive Bayes
- KNN
- K-Means
- Random Forest
- Dimensionality Reduction Algorithms
- Gradient Boost & Adaboost
Machine Learning : Good to Remember
Machine learning is a super powerful set of techniques for prediction.
Machine learning allows you to make predictions and detect patterns that otherwise would have gone unnoticed.
Now, machine learning started as the subfield of
artificial intelligence, and its goal is to allow computers to learn by example.
Good to remember about discrete and conntinuous variables?
A discrete variable is a variable whose value is obtained by counting.
Examples: number of students present
number of red marbles in a jar
number of heads when flipping three coins
students’ grade level
A continuous variable is a variable whose value is obtained by measuring.
Examples: height of students in class
weight of students in class
time it takes to get to school
distance traveled between classes
Some common functions of cleaning data in Azure?
Ingested and joined data from multiple sources.
- Deleted unnecessary and redundant columns.
- Consolidated the number of categories of a categorical feature.
- Treated missing values.
- Removed duplicate rows.
- Generated a calculated column
- Located and treated outliers.
- Scaled numeric values.
What are some important aspects of Data Cleansing?
One of the most important aspects of any data science project is to clean, filter, and otherwise transform data to prepare it for use in a model. Common tasks when preparing data include:
- Identifying and handling missing or duplicate values.
- Identifying and handling outliers and errors
- Scaling numeric values to make them easier to compare.
What is a conditioned histogram?
A conditioned histogram is a histogram of a subset of data conditioned on another variable in the dataset. Often the histogram of a numeric variable is conditioned on a categorical variable. It is also possible to condition a histogram on (generally overlapping) ranges of a numeric variable.
How do you judge the quality of your prediction model?
The prediction quality should always be judged out of sample. You should make a judgement based on the results of the Test Data Set and not the Training Data Set.