lecture 6 Flashcards
what are the attributes of the array ?
size: the dimenssion of numpy array
size: total number of elements in the numpy array
ndim: the number of dimenssion if the array
dtype: data type of element in the array
itemsize: the length of single array element in bytes
how to create a numpy array ?
first import numpy as np
for 1D use np.array() and pass the argument
i.e import numpy as np
# create a NumPy array from a list of 3 integers
a = np.array([1,2,3]) # Don’t forget the []
for 2D
do the same first steps importing numpy as np and np.array
i.e
# 2-d array or a 2x3 matrix
A = np.array([[1,2,3],[4,5,6]])
# another way of creating a 2-d array
a = np.array([1,2,3,4,5,6]).reshape([2,3])
for 3D arrays
the dimenssions of a 3D array are described by the number of layers the array contains, and the number of rows and columns in each layer.
3-d array
A = np.array([[[1,2,3],
[4,5,6]],
[[7,8,9],
[10,11,12]]
])
how many types of array are there?
1D example is vector
2D example is matrix
3D (3rd order tensor )
ND (ND array)
How can items of array can be accessed ?
Items of an array can be accessed and assigned to the same way as other python seqeunces (e.g lists). The indexes in Numpy arrays starts with 0.
a = np.arange(10)
a[0], a[2], a[-1] # output (0, 2, 9)
a[2:9:3] # [start:end:steps] by default, start is 0, end is the last and step is 1
a[:4] # array([0,1,2,3])
a[3:] example
what is data ?
Data is collection of examples. Each row is an example and each column is an feature. In fact, each examples are called samples
Explain data processing ?
Data can be incomplete, noisy and inconsistent. In fact, data processing is to resolve those issues and transforming raw data into understandable form.
Processing is key to good model performance and most often consumes time
what are two steps of data processing ?
two steps 1) understanding 2) preparing
what is pandas ?
pandas: python library used for working with data sets. Analysing, cleaning and manipulating data
how do you import pandas ?
import pandas as pd
what are data sets in panadas ?
Data sets in panadas are usually multi-dimensional tables, called data frames
A pandas data frame is a 2 dimensional data structure, like 2d array, or tables with rows and columns
what is the most common method used for getting a quick overview of the dataframe
it is head() i.e print(df.head(10))
#df is called dataframe, which contains the information fetched from the csv file
What is correlation coefficient?
The pearson correlation coefficents ( also known as Pearsons r) is a statistical measure that quantifies the strength and direction of linear relationship between two variables
what is the range of the correlation?
-1 to 1 with 0 indicating no linear correlation
How to deal with missing values in data ?
Solutions
-replace with mean
-remove rows with NaN
How to deal with duplicated data and outliers
Can keep only first occurrence of the data row by removing the second one and can also remove the outliers
how to detect missing values ?
insa() and isnull().