Week 2 Flashcards
5 steps in Machine Learning Process
- Data Collection
- Data Exploration and Preparation
- Model Training
- Model Evaluation
- Model Improvement
Data Collection
Involves gathering learning materials or data set that algorithm will use to generate an actionable knowledge or intelligence. In most cases, the data will need to be combined into a single source like a spreadsheet or data set.
Data Exploration and Preparation
Exploring the data generally involves getting a basic understanding of a dataset through numerous variable summaries and visual plots. Through this exploration, we will often identify problems with the data, including missing values, noise, erroneous data, and skewed distributions.
Model Training
A subset of the entire data set and is used to build up the model.
Model Evaluation
- Because each machine learning results in a biased solution to the learning problem it is important to evaluate how well the algorithm learns from its experience and by the experience of the algorithm as more data becomes available.
- Its the stage in the machine learning process where you test how well the algorithm has work from the data
Model Improvement
If better performance is needing you might need to supplement data to become more efficient or perform additional steps to acquire accurate data
Vectors
- Fundamentals data structure of R
- all the elements in the vector must be of the same type
- either all elements are ‘characters’ or all elements are numeric or logical
- two special values “Null” and “NA” to indicate missing values
- R vectors are ordered so accessing it would require counting the position of the element
- Indexing always begins with [1]
How are Vectors created?
Vectors are created using
c()
Example :
subjects
Factors
- A special case of the vector that is solely used to represent categorical or ordinal variables.
- Example: Creating a factor from a vector
- subjects
- Faculty
Categorical vs Ordinal Variables
- The categorical or nominal variable has one or more categories. (female, male)
- The ordinal variable has a clear ordering of its elements
- (high, medium, low)
List
- Used for storing an ordered set of elements
- unlike vecotor, list is a collection of all kinds of data elements
- subject1
- temperature = temperature[1],
- flu_status = flu_status[],
- gender = gender[1],
- blood = blood[1]
Data Frame
- Most used data structure in R
- List of vectors or factors, each having exactly the same number of values
- syntax: data.frame()
- example:
- emp.data
- emp_id = c (1:3),
- emp_name = c(“Pat”, “John”, “Mike”)
- stringAsFactors = FALSE
- )
Data Frame Contd….
- Analogous to spreadsheet
- a data frame is two dimensional and is often displayed as a matric (rows and columns)
- in the language of machine learning, columns are often called as ‘Features’ or ‘Attributes’, while rows are called ‘examples’
to extract a column :
Emp.data$emp_name
Emp.data[c(“emp_name”,”salary”)]
Matrix
- Two dimensional table with rows and columns
- matric can contain only one type of data elements
- the syntax for creating a matrix: matric()
- you also need to specify the number of rows and columns