Exploratory Data Analysis for Machine Learning Flashcards
In this overview, we will discuss
-Define Artificial Intelligence (AI) Machine Learning (ML) and Deep Learning (DL).
-Explain how DL helps solve classical ML Limitations.
- Explain key historical developments, and the hype-AI winter cycle.
- Differentiate modern AI from prior AI.
- Relate sample applications of AI.
Artificial Intelligence (AI)
A Program that can sense, reason, act and adapt.
Machine Learning
Algorithms whose performance improve as they are exposed to more data over time.
Deep Learning
Subset of machine learning in which multilayered neural networks learn from vests amounts of data.
Artificial Intelligence dictionary definition.
A Branch of computer science dealing with the simulation of intelligent behavior’s in computers. - Merriam-Webster
Machine Learning
The Study and construction of programs that are not explicitly programmed, but learn patterns as they are exposed to more data over time.
Two types of Machine Learning
-Supervised Learning
-Unsupervised Learning
Supervised Learning
- Dataset: Has a target Column
- Goal :Make Predictions
-Example Fraud Detection
Unsupervised Learning
-Dataset: Does not have a Target Column
- Goal: Find Structure in data.
-Example: Customer segmentation
Machine Learning( example)
- Suppose you wanted to identify fraudulent credit card transactions.
(you could define features to be:
-Transaction time - Transaction amount
-Transaction location
-Category of purchase - The algorithm could learn what feature combinations suggest unusual activity.
Deep Learning
Machine learning that involves using very complicated models called deep neural networks.
- Models determine best representation of original data; in classic machine learning, humans must do this.
Deep Learning example
Classic machine learning - Feature Detection, Machine Learning Classifier Algorithm - Arjun (output).
- Deep Learning ( Steps 1 and 2 are combined into 1 step )
using complex model neural Network.
Deep Learning
Machine learning that involves using very complicated models called deep neural networks.
Deep Learning
Machine learning that involves using very complicated models called deep neural networks.
History of AI
-AI has experienced several hype cycles, where it has oscillated between periods of excitement and disappointment.
-AI has experienced cycles of AI winters and AI booms.
AI solutions include speech recognition, computer vision, assisted medical diagnosis, robotics, and others.
Learning Goals
In this section, we will cover:
- Background and tools used in this course.
-Machine Learning workflow
-Machine Learning vocabulary
Background and Tools
-Examples assume familiarity with:
- Python libraries(e.g. NumPy or Pandas), Jupyter Notebooks.
- Basic statistics including probability, calculating moments, bayes’ rule.
Examples use iPython(via Jupyter Lab/Notebook), with the following libraries:
-NumPy
-Pandas(We will usually read data into a Pandas DataFrame)
-Matplotlib
-Seaborn
-Scikit-Learn
-TensorFlow
-Keras
Machine Learning Workflow
-Problem Statement: What Problem are you trying to solve?
-Data Collection: What data do you need to solve it?
- Data Exploration and Preprocessing: How should you clean your data so y our model can use it?
- Modeling: Build a model to solve your Problem?
-Validation: Did I solve the problem?
-Decision Making and deployment: Communicate to stakeholders or put into production?
Machine Learning Vocabulary
Target: Category or value that we are trying to predict.
Features: Properties of the data used for prediction ( explanatory variables).
Example:/Observation a single data point within the data (one row).
Label: the target value for a single data point.
Modern AI
Factors that have contributed to the current state of Machine Learning are: bigger data sets, faster computers, open source packages, and a wide range of neural network architectures.
Learning Goals
In this section, we will cover:
- Retrieving data from multiple data sources:
- SQL databases
- NoSQL databases
- APIs
- Cloud data sources
- Understand common issues that arise with importing data.
Reading CSV Files
Comma-separated (CSV) files consist of rows of data, separated by commas.
JSON Files
JavaScript Object Notation (JSON) files are a standard way to store data across platforms.
JSON files are very similar in structure to Python dictionaries.
SQL Databases
Structured Query Language represents a set of relational databases with fixed schemas.
There are many types of SQL databases, which function similarly ( with some subtle differences in syntax).
examples of SQL databases:
-Microsoft SQL server
-Postgres
-MySQL
-AWS Redshift
-Oracle DB
- Db2 Family
SQL Databases
Structured Query Language represents a set of relational databases with fixed schemas.
There are many types of SQL databases, which function similarly ( with some subtle differences in syntax).
examples of SQL databases:
-Microsoft SQL server
-Postgres
- MYSQL
-AWS Redshift
-Oracle DB
-Db2 Family
Not-only SQL (NoSQL)
databases are not relational, vary more in structure. Depending on the application, may perform more quickly or reduce technical overhead. Most NoSQL store data in JSON format.
Example of NoSQL databases:
- Document databases: mongoDB, couchDB
-Key-value stores: Riak, Voldenmort, Redis
-Graph databases: Neo4j, HyperGraph
-Wide-column stores: Cassandra, Hbase.
APIs and Cloud Data Access
A variety of data providers make data available via Application programming interfaces (APIs), that make it easy to access such data via python.
- There are also a number of datasets available online in various formats.
An online available example is the UC Irvine Machine Learning Library.
Here, we read one of its datasets into Pandas directly via the URL.
Reading SQL Data
-While this example uses sqlite3, there are several other packages available.
- The SQL module creates a connection with the database.
- Data is read into pandas by combining a query with this connection.
Reading NoSQL Data
-This example uses the pymongo module to read files stored in MongoDB, although there are several other packages available.
- We first make a connection with the database ( MongoDB needs to be running).
-Data is read into pandas by combining a query with this connection.
- Here, query should be replaced with a mongoDB query string (or{} to select all).
Data Cleaning Machine Learning
Learning Goals:
In this section, we will cover:
- Why data cleaning is important for machine learning.
- Issues that arise with messy data.
- How to identify duplicate or unnecessary data.
- Policies for dealing with outliers.
Why is data cleaning so important?
- Decisions and analytics are increasingly driven by data and models.
Key aspects of Machine Learning Workflow depend on cleaned data:
- Observations: An instance of the data( usually a point or row in a dataset)
- Labels: Output Variables (s) being predicted
- Algorithms: Computer programs that estimate models based on available data.
- Features: Information we have for each observation (Variables)
- Model : Hypothesized relationship between observations and data.
Why is data cleaning so important?
Messy data can lead to garbage-in, garbage-out effect, and unreliable outcomes.
The Main data problems companies face:
- Too much data
- Lack of data
- Bad data
Having data ready for ML and AI ensures you are ready to infuse AI across your organization.
How can data be messy?
- Duplicate or unnecessary data
- Inconsistent text and typos
- Missing data
-Outliers
-Data Sourcing issues: - Multiple systems
- Different database types
-on premises, in cloud - and more.
Duplicate or unnecessary data
-Pay attention to duplicate values and research why there are multiple values.
- It’s a good idea to look at the features you are bringing in and filter the data as necessary ( Be careful not to filter too much if use features later).
Policies for Missing Data
-Remove the data: remove the rows entirely.
- Impute the data: replace with substituted values. Fill in the missing data with the most common value, the average value, etc.
- Mask the data: create a category for missing values.
What are the pros and cons for each of these approaches?
Remove the data Pros and cons
-Pros
* It will quickly clean your dataset without having to guess an appropriate replacement value.
Cons
* If certain values are missing values for many rows, we may end up losing too much information, or a biased dataset to some reason that the data was not collected.
Impute the data Pros and cons.
-Pros
* we don’t lose full rows or columns that may be important for our model as we would have when try to remove full rows.
-Cons
* We add another level of uncertainty to our model, as this is now based on estimates of what we think the true value for that missing value would have been.
Outliers
-An outlier is an observation in data that is distant from most other observations.
- Typically, these observations are aberrations and do not accurately represent the phenomenon we are trying to explain through the model.
-If we do not identify and deal with outliers, they can have a significant impact on the model.
- It is important to remember that some outliers are informative and provide insights into the data.
How to find outliers?
- Plots ( Histogram, Density plot, Box plot)
- Statistics( Interquartile Range, Standard deviation)
- Residuals ( Standardized, Deleted, Studentized)
Residuals
(difference between actual and predicted values of the outcome variable represent model failure).
Approaches to calculating residuals:
- Standardized: Residual divided by standard error.
- Deleted : residual from fitting model on all data excluding current observation.
-Studentized: Deleted residuals divided by residual standard error ( based on all data, or all data excluding current observation.)
Policies for outliers
- Remove them
- Assign the mean or median value
- Transform the variable.
- Predict what the value should be:
- Using similar observations to predict likely values.
- Using regression.
Keep them, but focus on model that are resistant to outliers.
Policies for outliers
- Remove them
- Assign the mean or median value
- Transform the variable.
- Predict what the value should be:
Learning Goals
- In this section, we will cover
- Approaches to conducting exploratory Data Analysis (EDA )
- EDA Techniques
- Sampling from DataFrames
- Producing EDA Visualizations
What is Exploratory Data Analysis?
Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with Visual methods.