Lecture 1 Flashcards
Data
A set of discrete, objective facts about events
Dataset
a collection of data with a defined structure
Data point
a single instance in the dataset
Attribute
A single property of the dataset
Data science
a collection of techniques used to extract value from data
process of building a representative model that fits the observational data
Model
representation of a relationship between variable in a dataset
modeling
process in which a representative abstraction is built from the observed dataset
Data science model serves two purposes
- it predicts the output (interest rate) based on the new and unseen set of input variables
- the model can be used to understand the relationship between the output variable and all the input variables
techniques used in the steps of a data science process
- descriptive statistics exploratory visualization dimensional slicing hypothesis-testing data engineering business intelligence
Supervised model
supervised data science tries to infer a function or relationship based on labeled training data and uses this function to map new unlabeled data
Unsupervised data
uncovers hidden patterns in unlabeled data
Classification and regression techniques
predicting a target variables based on input variables
Clustering
the process of identifying the natural groupings in a dataset
recommendation engines
the systems that recommend items to the users based on individual user preference
anomaly or outlier detection
identifies the data points that are significantly different from other data points in a dataset
time-series forecasting
the process of predicting the future value of a variable based on past historical values that may exhibit a trend and seasonality
text mining
a data science application where the input data is text which can be in the form of documents, messages, emails or web pages
feature selection
A process in which attributes in a dataset are reduced to a few attributes that really matter
association analysis
identifying pairs of items that are purchased together, so that specific items can be bundled or placed next to each other
deep learning
increasingly used for classification and regression problems
Big data
High-volume, high-velocity, and or high variety information that requires new forms of processing to enable enhanced decision making, insight discovery and process optimization
Big data characteristics (5vs)
Volume, velocity, variety, veracity, and value
volume
increase in data size coming from infinite sources
velocity
- increase in the speed of input and output data and the ability to quickly incorporate new data
- ability to quickly add new data sources
Variety
increasing the range of diversity and data structure
- structured data,
- semi-structured data,
- unstructured data
Veracity
valid and truthful data that provides the right direction for future decisions and actions
- data freshness
- quality dimensions (challenges)
- trust, quality& validity of data
Value
data that has high veracity provides higher value
- usefulness of data for an enterprise
Data science tends to fall into three broad categories
investigating, predicting, and optimizing
Data science tasks
regression clustering association analysis anomaly detection recommendation engines deep learning time series forecasting text mining feature selection classification