Week 2 Flashcards
Data Science
the exploration and quantitative analysis of all available structured and unstructured data to develop understanding, extract knowledge and formulate actionable results
Business Intelligence
strategies and technologies used by enterprises for the data analysis of business information.
CRISP-DM
provides useful input on ways to frame analytics problems and is popular approach for data mining. Six steps include: business understanding, data understanding, data preparation, modeling, evaluation and deployment.
Framing a Decision
outline what decision is being considered, why it is important, what data is need, who will provide input. Business Understanding, Data Understanding and Data Preparation of CRISP-DM.
Analyzing a Decision
what kind of analytical approach is needed, what. does it show, what does it mean. Modeling in CRISP-DM.
Implementing a Decision
how do I make use of the decision, what can I expect, what else should be considered, how do I “sell” the result. Evaluation and Deployment in CRISP-DM.
Data Modeling Blocks
- Data, 2. Build Model, 3. Inter hidden variables, 4. Predict & Explore
Interpretation Error and Inconsistencies
Taking the value in your data for granted and difference between data sources and company’s standardized values.
Cleansing Data
Interpretation and Inconsistencies. Data Entry Errors, Redundant Whitespace, Fixing Capital Letter Mismatching, Outliers, Dealing with Missing Values, Different Units of Measurement, Different Level of Aggregation, Deviation for a Cook Book, Impossible values and Sanity Checks.
Integrating Data
Combining data from different data sources. Joining/Appending Data, Appending Tables, Using Views to Simulate Data Joins and Appends, Enriching Aggregated Measures.
Transforming Data
making data into a certain shape for models. Reducing the number of variables, turning variables into dummy variables.
Data Retrieval
data stored within the company, data outside organization and data quality checks.
Data Preparation
fix problems in the data; create derived variables.
Exploratory Data Analysis
the use of graphical techniques to gain an understanding of your data and the interactions between variables.
Joining
enriching an observation from one table with information from another.
Appending/Stacking
Adding the observations of one table to those of another table.
Dummy Variables
Can only take two values true(1) and false(0). Used to indicate the absence of a categorical effect that may explain the observation.
Unsupervised Learning
Algorithm does not have past data cases, with inputs and output of interest identified. Algorithm “attempts” to learn something interesting about the data.
Data Partitioning
Training 60%, Validation 30%,
Test 10%.
Technical Data Scientist
designs solution from scratch.
Business Data Scientist
monitors the solution from scratch. Not as knowledgeable as a Technical Data Scientist.
Databases
structured with defined schema. Items are organized as a set of tables with columns and rows. Transactional.
Data marts
stores data from data warehouse. Subject-oriented, partitioned segment of an enterprise data warehouse.
Data Warehouses
exists on top of databases and used for business intelligence. Consumes data from databases and creates a layer optimized to perform data analytics. Schema is done on import.
Data Lakes (Big Data)
centralized repository of structured/unstructured data. Store raw data without structure(schema). No ETL or transformation jobs.