Data Science Terms and Techniques Flashcards
Hypothesis
an assumption made about the world that can be tested using data; an educated guess that needs to be validated or disproved by experiment and data
Statistical Inference
a branch of statistics dedicated to drawing conclusion about the world using smaller data samples
Confidence intervals
an interval estimate used to express the degree of uncertainty associated with a sample statistic
Statistical Significance
an estimate of how likely that the observed event has some kind of real world importance; an estimate of how likely an event might occur randomly - the smaller the number, the more likely that the observed event has some kind of real-world importance.
Big Data
a collective term used for technology to analyze large amounts of data to unearth insights, typically into human behavior and patterns
Data Set
a collection of data to be analyzed
Analytics
a collective term for techniques used to analyze data, mostly to draw business insights
Algorithm
a well defined set of steps to solve a specific problem
Technology Stack
the collective set of tools and programs used in an organization or team
Pre-packaged distribution
a package that bundles all of the required python tools and libraries e.g. numpy, scipy, pandas, scikit-learn, jupyter, matplotlib, seaborn and statsmodels. In the python world, Anaconda and Canopy are popular distributions for scientific computing and data science.
Regular Expressions
a technique to quickly search for or substitute complex patterns in strings
Jupyter
formerly known as IPython, this tool enables data scientists to prototype code rapidly and combine it with useful documentation
Raw data
data from original or secondary sources that may be unstructured or corrupted and needs more work performed on it before it can be analyzed
Data Wrangling
process of taking data in its raw form and manipulating it in various ways into a useful form
Messy or Dirty data
data can be messy or dirty in the sense that it might contain values that are invalid, missing, corrupted, inconsistent or non-uniform
Storytelling
this highly effective art of communication is not limited to entertainment; it is a crucial skill needed to communicate answers to questions that data scientists ask
Visualization
a picture is worth a thousand words, this rings true when you are trying to present data in an easy to understand manner. Data visualization is increasingly used to depict data analysis. Think about how data is visualized during election times.
Supervised Learning
algorithms that create a model of the world by looking at labeled examples
Unsupervised Learning
algorithms that create a model of the world using examples without labels
Bayesian Analysis
algorithms based on Bayes Theorem, which makes inferences about the world by combining domain knowledge or assumptions and observed evidence
Clustering
a family unsupervised learning algorithms used to automatically find groups in data sets
What does data science entail?
processing, analyzing and visualizing data.
Why is python a popular language for data science?
Because it can:
handle large datasets
works with common mathematical functions
creates powerful data visualizations
What is Python Jupyter Notebook?
It’s built around a typical data analysis workflow and very different from an integrated development environment such as Pychar which focuses more on just working with code. In Jupyter, you work with notebooks which mix plain text, code and code outputs in one view. You can interleave code with markdown text explanations which enables you to easily explore data, create visualizations and share your results.
What is a kernel in Jupyter Notebook?
The kernel defines the programming language that the code in the notebook will be written in. This is displayed in the top right corner of the notebook. When you run code, it’s executed inside the kernel session.
What strategies should one use to address missing data?
- Remove any rows that contain missing data
- Populate the empty fields with a specified value
- Populate the empty fields with a calculated value.
- Use analysis techniques that work with missing data.