VA Session 5 Numpy & Pandas ! Flashcards
NumPy
Pyhon library for working with arrays of data (only one data type) (faster than lists
Creating Numpy Array
np.array([1,2,3])
Shapes:
- 1D array
- 2 D array
- 3 D array
- shape (x,) -> axis = 0
- shape (x,y) -> axis 0 & 1
- shape(x,y,z) -> axis 0,1,2
Pandas
- built on top of NumPy, - standard Python library for data analysis, Data Frames
- supports efficiently reading & writing data between in-memory data structures & different formats (e.g. CSV, text files, SQL database, Excel)
DataFrame
multiple different data types in different columns possible
Pandas: Loading data
pd.read_csv(“file.csv”)
Pandas: Connecting to a database (to read data from database directly into a Dataframe)
- Connect to database: db = sqlite3.connect(“path…”)
- Querying database: df = pd.read_sql_query(“SELECT * from Prodcut”, db)
Pandas: check 5 first or last rows
df.head(); df.tail()
Pandas: check basic info
df.info()
Pandas:check shape of Dataframe
df.shape
Pandas: check name of columns
df.columns
Pandas: check number of missing values - count how many 0 values per column
df.isnull().sum()
Pandas: count of all different values in column incl. missing values
df[feature].value_counts()
Pandas vs sqlite3 in Python when querying a database
- Pro Pandas: instantly get a Data Frame -> easier to work with than the returned list sql
- Pro sqlite 3: sqlite module is faster & list can easily turned into dataframe
- Loading data in python -> most cases Pandas (especially when working with flat files)
Differences NumPy & Pandas
- arrays vs. dataframe (2-dim arrays) (powerful tools)
- memory efficient vs memory consuming
- performance better if <50k rows vs if more
- performing numerical computations and processing on Multi- and single-dimensional array elements vs processing & analysing data in dataframe