Prelims - honpritz Flashcards
honpritz
Data encoders, gatherers
Collector
Treat, prepare data
Data engineer
Performs the modeling, testing, and validation
Modeler or Data Scientist
Do the decision making
Business analyst
Data Steward
Collector
Modeler
Data Scientist
It is a multi - disciplinary field that uses scientific method, processes, algorithms, computations, and systems in order to extract understanding and insights from a structured and/or unstructured data.
Data Science
is the mother of invention.
NECESSITY
What era:
REPORT WRITING
Goal: Automation
1970s
What era:
CENTRALIZED SYSTEM
Goal: ERP (Enterprise Resource Planning)/ MIS (Management Info System)
1980s
Goals of the 1980s Centralized system
- Enterprise Resource Planning
- Management Info System
What era:
Business Intelligence
Goal: Apps for everyone
Applications for personal use were invented and made to share (not YET to analyze)
1990s
Goal: Apps for everyone
Applications for personal use were invented and made to share (not YET to analyze)
1990s Business Intelligence
What era:
INTERNET & DATA MINING
2000s
What era:
BIG DATA &
Data Science (used for real-time analysis)
2010
The value in the data haystack is guided by your knowledge of the ____ - not the ___ or ____
domain; tools or techniques
the combination of al skillsets needs to find the value in the data
Analytics
Data under Business Intelligence
- Standard reports (What happened?)
- Ad Hoc, Drill down (Where exactly is the problem?)
- Alerts (What needs attention?)
Data under Predictive analytics
Predictive modeling
“What is the next best action?”
Data under Prescriptive analytics
Optimization
“What is the best thing that can happen?”
Evolution of analytics
Descriptive → Diagnostic → Predictive → Prescriptive → Cognitive
What happened? Describes historical data: Helps understand how things are going
Descriptive
Why did it happen?
Helps understand unique drivers; Segmentation, Statistical, & Sensitivity analysis
Diagnostic
What could happen? Forecast future performance, events a n d results
Predictive
How to make it happen?
Analysis that suggest a prescribed action
Prescriptive
What to do, why &how?
Proactive action
Learn at scale
Reason with purpose
Interact naturally
Cognitive
Data Science & Analytics:
in health care
- Medical Image analysis
- Machine Learning in Disease Diagnosis
- Genetics & Genomics
- Drug Development
- Virtual assistance for patients and customer support
Finding useful pattern in a data.
Data Mining
it is the process of knowledge discovery, machine learning and predictive analytics.
Data Mining
Data Mining
- Extracting Meaningful Patterns.
- Building Representative Models.
- Combination of Statistics, Machine Learning, and Computing
- Algorithms
DATA MINING: Types of Learning Models
- Supervised
- Unsupervised
Data Mining is NOT about:
- Descriptive statistics.
- Exploratory visualization.
- Dimensional slicing
- Hypothesis testing
- Queries
directed data mining
Supervised Learning Model
The model generalizes the relationship between the input and output variables.
Supervised Learning Model
Undirected data mining
Unsupervised Learning Model
The objective of this class of data mining techniques is to find patterns in data based on the relationship between data points themselves
Unsupervised Learning Model
DATA MINING: Groups of Learning Models
- Classification Models
- Regression Models
- Clustering Models
- Anomaly Detection
- Time Series Forecasting
- Association
- Text and Sentiment Analysis
DATA MINING: Steps
- Business Understanding
- Data Understanding
- Data Preparation
- Modeling
- Testing and Evaluation
- Deployment
the process of preparing data for analysis by removing or modifying incorrect, incomplete, irrelevant, duplicated, or improperly formatted data.
Data Cleaning
variables of a given data set; Represented by columns
Attributes
Cases or observations of a given data set
Represented by rows
Examples
Functions or building blocks that create processes for data analysis
Operators
Parts of the RapidMiner Interface
- Canvas / Process Panel
- Repository / Source Tabs
- Operators / Analysis Tabs
- Parameter Tabs
- Description Tabs
Working area for building processes
Canvas or the Process Panel
Storage within rapid miner studio for data and rapid miner processes
Repository / Source Tabs
Building blocks used to create rapidminer processes
Operators / Analysis Tabs
Settings that modify operator behavior
Parameter Tabs
context-sensitive help for selected operator
Help
work area for accessing specific functionality
Views
Methods of Importing Data
- From Repository
- “Read Excel” Operator
many different string values (for example: red, green, blue, yellow)
polynomial
exactly two values (for example: true/false, yes/no)
binomial
a fractional number (for example: 11.23 or -0.0001).
real
a whole number (for example: 23, -5, or 11,024,768).
integer
both date and time (for example: 23.12.2014 17:59).
date_time
Operator used for filtering cases
Filter Examples
Operator used for removing all cases with missing values
Filter Examples
Operator used for imputing missing data
Replace Missing Values
To remove “white spaces” in the encoding, use the
_____ operator.
TRIM
To remove “duplicates” in the encoding, use the _________ operator.
Remove Duplicates
To recode miscoded values, use the ______ operator.
REPLACE
Use the ________ operator to select the attributes that you need for analysis.
Select Attributes
Use the _____ operator to tag the attribute that will be use as the label (Target Variable) or any other role it will act in the analysis.
Set Role
If two data sets are needed to be merged in order to make an analysis, use the ____ operator.
Join
Joining Two Data Sets:
In the parameter tab, use _____ as join type.
Inner
graphical representation of data
Data Visualization
techniques used to communicate
insights from data through visual
representation.
Data Visualization
to distill large datasets into visual graphics to allow for easy understanding of complex relationships within the data
Data Visualization
to analyze massive amounts of information
and make data-driven decisions.
Data Visualization
Visualization Technique:
to compare counts,
percentage, or other measures (average) for different discrete
categories of data
Bar Graph
Visualization Technique: to observe trend
Line Graph
Visualization Technique:
shows the relative
contribution that different categories contribute to an overall total
Pie Graph
Visualization Technique:
the frequency distribution of
continuous attribute
Histogram
(Bar vs Histo)
presents categorical attribute
Bar graph
(Bar vs Histo)
represents numerical attribute
(Bar vs Histo)
represents numerical attribute
histogram
(Bar vs Histo)
have spaces between bars
Bar graph
(Bar vs Histo)
do not have spaces between bars
Histogram
Visualization Technique:
plots two numerical attributes
Scatterplot
Visualization Technique:
graphical representation of the quartiles
Boxplots
process performed to decide which examples are kept ad which are removed
Data filtering
Visualization Technique:
a graphical representation of data where the individual values contained in a matrix (map) are represented as colors.
Heat maps
replaces missing values by the attribute’s minimum, maximum, or average value.
Missing Value Imputation
Imputation method is selected in the ?
Default
Use the ______ operator to create a RapidMiner data set from the process
Store
Use the ______ operator to store the data in a format you want.
Write ***