Data Science Terms and Techniques Flashcards
Hypothesis
an assumption made about the world that can be tested using data; an educated guess that needs to be validated or disproved by experiment and data
Statistical Inference
a branch of statistics dedicated to drawing conclusion about the world using smaller data samples
Confidence intervals
an interval estimate used to express the degree of uncertainty associated with a sample statistic
Statistical Significance
an estimate of how likely that the observed event has some kind of real world importance; an estimate of how likely an event might occur randomly - the smaller the number, the more likely that the observed event has some kind of real-world importance.
Big Data
a collective term used for technology to analyze large amounts of data to unearth insights, typically into human behavior and patterns
Data Set
a collection of data to be analyzed
Analytics
a collective term for techniques used to analyze data, mostly to draw business insights
Algorithm
a well defined set of steps to solve a specific problem
Technology Stack
the collective set of tools and programs used in an organization or team
Pre-packaged distribution
a package that bundles all of the required python tools and libraries e.g. numpy, scipy, pandas, scikit-learn, jupyter, matplotlib, seaborn and statsmodels. In the python world, Anaconda and Canopy are popular distributions for scientific computing and data science.
Regular Expressions
a technique to quickly search for or substitute complex patterns in strings
Jupyter
formerly known as IPython, this tool enables data scientists to prototype code rapidly and combine it with useful documentation
Raw data
data from original or secondary sources that may be unstructured or corrupted and needs more work performed on it before it can be analyzed
Data Wrangling
process of taking data in its raw form and manipulating it in various ways into a useful form
Messy or Dirty data
data can be messy or dirty in the sense that it might contain values that are invalid, missing, corrupted, inconsistent or non-uniform