Statistical Thinking for Data Science and Analytics Flashcards
What is data science?Professors def.
I really think of data science as the pairing of people who develop technology that can learn from data with people who have data and who have problems to solve.
And really, it’s interdisciplinary at heart.
To me, data science is about building tools
to help solve problems with data.
Data science is about building tools to uncover patterns,
to form predictions, to help us explore
data to understand the world.
And this involves pushing fields like computer science
and optimization and statistics in new ways.
But there has to be an application of some sort.
So the intersection of probability, statistics,
computer science, and an application
is an essential definition of data science, I believe.
What questions can data science answer?
Why is there an explosion of data?
Why is data visualization important?
Data visualization is important for people who want to explore their data, to get some idea of what it contains, and therefore, perhaps, to develop some intuitions about how they would go about solving a problem, or learning from that data.
Visualization is also really important when we’re looking at the output of data science systems.
relies heavily on data visualization for interpretation. So to be able to take things from a mathematical space,
which is fairly abstract, and convert them and be able to speak to a clinician and map the areas on the brain that are affected is absolutely critical.
For dashboards.People won’t be able to see the benefits that are being provided if they are not able to visualize things.
Secondly data visualization is important for the communication
of the results of data science to a general audience.
Number one is, it makes it very easy to understand what’s going on.
at least the best data visualizations that I’ve seen,
is that they introduce new questions.
So it’s both the initial exploration of the data set,
as well as the presentation of the results to the people
that need to understand what we’ve learned.
What skills does a data scientist need?
- Math and statistics foundation
- Algorithms for big data
- Computer Scince knowledge
- Storing and accessing data
- Parallel processing of data
- How to apply DS to real world problems
- Optimization
- Statistical way of thinking beyond theory
- Machine Learning
- Complexity Theory
11.
In 2011 Peter Warden writes in “Why the term ‘data science’ is flawed but useful” that:
Traditional scientists chose a problem then find data to shed light on it
What is the first stept in any data analysis project?
The first step of any data analysis project is “data conditioning,” or getting data into a state where it’s usable
When a company does a data mashup it is:
Using data from multiple, disparate sources to create a data product
Data conditioning involves:
Getting data into a state where it is usable
The MapReduce approach is a strategy for:
Processing a large data set using a large number of computers
The term “stream processing” refers to:
Processing data as it arrives
Machine learning almost always requires:
A training set of data
The role of Statistics in Data Science:
Showing trends in the data being analyzed
Making data tell its story:
Involves creating visualizations
What is the validity of our results of data experiments based on?
So the validity of results depends on the validity of assumptions we make on the data generating process.
Such assumptions include assumptions on sampling, randomization, the measurements of the data, and independence between variables, and so forth.
When we call an observed effect statistical significant, we mean that:
The effect is unlikely to occur purely by chance
What is Data?
Data are numbers, but they’re not just simply numbers.
They’re numbers with context.
What are the different units of measurement in data sets?
The unit of measurements can be objects,
can be dates, can be time units, can be events, et cetera.
So it basically is on what unit we’re taking measurements.
Why do we study variables in data sets?
variables are really the central focus of analysis
because we want to study the variation of variables to gauge the trend and the randomness and the extent of variability in this particular variable to generate knowledge about population.
How many types of variables are there and what are they?
3.
- Categorical
- Quantitative
- Ordinal
What are summary statistics?
The statistics are summaries of numerical data.
They do not tell the whole story, but they’re useful and meaningful.
Generally how are categorical and quantitative summaries visualized?
Categorical data - Pie Charts,Bar Plots
Quantitative data - Histograms
Good to remember about how to visualize different data…
Even though technically one can make a pie chart for any numerical values, a pie chart for price values of products will not be meaningful as there are too many possible values and the values should also be arranged in an increasing order. The pie chart treats each distinct value of the variable as a category and does not use the order information of these values.
Good to remember about standard statistical notation:
For a data set, we use n for sample size, or the number of individuals in the data. The variables are represented by letters that are close to the end of the alphabet, such as X, Y and Z. We use letter i to index the individuals. Therefore Xi would refer to the value of variable X for the ith individual.
One important notation in statistics is the summation sign, ∑ (capital Greek letter /sigma/). For example
∑i=1nXi
would mean a sum of the n values from X1 to Xn.
If we replace Xi in the sum above by (Xi−3)2, then the quantity changes to a sum of (X1−3)2, (X2−3)2, …, (Xn−3)2,.
See Image for Question
See image for answer
What is the first thing to consider when summariziing numerical data?
Center of Variation.
The center of variation is where the different observed values distribute around
2 commonly used methods to show center of variation:
The first one is mean, which is the numerical average
of observed values.
The second is median, which is the midpoint.
Why is it always a good idea to plot the data, in addition to reporting summary statistics?
- Summary statistics alone don’t necessarily provide an insight into the distribution i.e. normal distribution, skewed etc.
- A visualisation such as a box plot is not only easy to read and understand, but can also show outliers in the data.
- data plot is an image. Image make quicker sense to human brain than pure numbers.
- Sometimes plotting the data can give additional insight into data itself.
- Visuals are also more compelling to people and help communicate what the data is saying
Define Association?
Association is defined as when you observe certain values of one variable are observed more frequently, more often, with certain values of another variable.
What does a correlation of ‘0’ mean? Does it tell th whole story?
A correlation of ‘0’ means there is no linear association but this does not mean there is no association. To get the whole picture look at the scatter plot. There could be a ‘U’ plot and hence some association.
How to determine if there are cause-effect relationships?
- Randomized Experiments
- A/B Testing
- Control Groups
- Double Blinded Studies
- Causal Inference from Observational Data
Why do we need a sample?
To derive knowledge from sample to population, we need to have a representative sample.
what happens if we do not have a good representative sample.
- Misleading Outcomes
- Biased Results
- Difficult to analyze results
- Wastage of time and money
What are the 2 charestristics of Randomness?
- Unpredictability
- Trends
What is Probability?
Probability is the proportion of a certain occurrence in the long run.
It is only when you have a large number of occurrences
in the long run you can use probability to accurately describe the proportion of any certain random outcome.
Probability Rules : Very Important
Specific Addition Rule
Only valid when the events are mutually exclusive.
P(A or B) = P(A) + P(B)
Non-Mutually Exclusive Events
General Addition Rule
P(A or B) = P(A) + P(B) - P(A and B)
Specific Multiplication Rule:Independent Events
P(A and B) = P(A) * P(B)
Conditional Probability : General Multiplication Rule
P(A and B) = P(A) * P(B|A)
OR
P(B|A) = P(A and B) / P(A)
How can we study sampling distribution ?
- by simulation
- by experiment
- by mathematical models