FA5 + M5 - Sheet1 Flashcards
Which of the following libraries are used for mathematical and statistical operations on multi-dimensional arrays and matrices in Python?
Group of answer choices
Matplotlib
NumPy
Pandas
NumPy
Which of the following libraries are used for data visualization in Python?
Group of answer choices
NumPy
Matplotlib
SciPy
Matplotlib
Which of the following libraries are used for deep learning in Python?
Group of answer choices
TensorFlow
Scikit-learn
Keras
TensorFlow
Which of the following libraries are used for natural language processing in Python?
Group of answer choices
NLTK
Scrapy
Scikit-learn
NLTK
Which of the following libraries are used for creating spiders bots that scan website pages and collect structured data in Python?
Group of answer choices
Scrapy
Pandas
SciPy
Scrapy
Which of the following libraries are used for object identification, speech recognition, and more in Python?
Group of answer choices
PyTorch
Keras
Dist-keras
Tensorflow dapat pero Pytorch ung tama sa canvas
Which of the following libraries are used for reading data, selecting and filtering in data, and data manipulations in Python? There are two correct answer in the options, just choose one.
Group of answer choices
PyTorch
Pandas
NumPy
SciPy
Pandas
NumPy
Which of the following libraries are used for creating interactive and scalable visualizations in a browser using JavaScript widgets in Python? There are two correct ansers from the choices, just select one.
Group of answer choices
SciPy
Bokeh
NumPy
Bokeh
Plotly
Plotly
NumPy
SciPy
Bokeh
Bokeh
Plotly
Plotly
Which Python libraries are built on NumPy? There are two correct ansers from the choices, just select one.
Group of answer choices
Pandas
Seaborn
Scikit-Learn
Matplotlib
Pandas
Scikit-Learn
Which Python library provides machine learning algorithms?
Group of answer choices
Pandas
Scikit-Learn
NumPy
Matplotlib
Scikit-Learn
Data Wrangling:
SciPy
NumPy
pandas
Statistic
StatsModels
NLP
Natural Language Toolkit
SpaCy
gensim
Machine Learning
scikitlearn
xgboost
lightgbm
catboost
eli5
Deep Learning
TensorFlow
Pytorch
Keras
Distributed Deep Learning
dist-keras
elephas
spark-deep-learning
Visualization
matplotlib
Bokeh
plotly
Seaborn
pydot
it is intended for processing large multidimensional arrays and matrices, and an extensive collection of high-level mathematical functions and implemented methods makes it possible to perform various operations with these objects
NumPy (numpy.org)
it is based on NumPy and therefore extends its capabilities. SciPy main data structure is again a multidimensional array, implemented by Numpy.
SciPy (scipy.org/scipylib)
The package contains tools that help with solving linear algebra, probability theory, integral calculus and many more tasks
SciPy
provides high-level data structure and a vast variety of tools for analysis. The great feature of this package is the ability to translate rather complex operations with data into one or two commands.
Pandas (pandas.pydata.org)
contains many built-in methods for grouping, filtering, and combining data, as well as the time-series functionality
Pandas
is a low-level library for creating two-dimensional diagrams and graphs.
Matplotlib (matplotlib.org)
With iths help, you can build diverse charts, from histograms and scatterplots to non-Cartesian coordinates graphs.
Matplotlib (matplotlib.org)
Moreover, many popular plotting libraries are designed to work in conjunction with ____
matplotlib
is essentially a higher-level API based on the matplot library.
Seaborn (seaborn.pydata.org)
It contains more suitable default settings for processing charts.
Seaborn
Also, there is a rich gallery of visualizations including some complex types like time series, jointplots, and violin diagrams
Seaborn
is a popular library that allows you to build sophisticated graphics easily.
Plotly (plot.ly/python/)
The package is adapted to work in interactive web applications.
Plotly
Among its remarkable visualizations are contour graphics, ternary plots, and 3D charts
Plotly
The ____library creates interactive and scalable visualizations in a browser using JavaScript widgets.
Bokeh (bokeh.pydata.org/en/latest/)
The library provides a versatile collection of graphs, styling possibilities, interaction abilities in the form of linking plots, adding widgets, and defining callbacks, and many more useful features.
Bokeh
is a popular framework for deep and machine learning, developed in Google Brain.
TensorFlow (tensorflow.org)
It provides abilities to work with artificial neural networks with multiple data sets.
TensorFlow
Among the most popular TensorFlow applications are _____ and more.
object identification, speech recognition,
is a large framework that allows you to perform tensor computations with GPU acceleration, create dynamic computational graphs and automatically calculate gradients.
PyTorch (pytorch.org)
Above this, ____ offers a rich API for solving applications related to neural networks
PyTorch
is a high-level library for working with neural networks, running on top of TensorFlow, Theano, and now as a result of the new releases.
Keras (keras.io)
It simplifies many specific tasks and greatly reduces the amount of monotonous code. However, it may not be suitable for some complicated things.
Keras (keras.io)
These packages allow you to train neural networks based on the Keras library directly with the help of Apache Spark
Dist-keras (joerihermans.com/work/distributed-keras/)
dist-keras and others are gaining popularity and developing rapidly, and it is very difficult to single out one of the libraries since they are all designed to ______
solve a common task.
This Python module based on NumPy and SciPy is one of the best libraries for working with data.
Scikit-learn (scikit-learn.org/stable)
It provides algorithms for many standard machine learning and data mining tasks such as clustering, regression, classification, dimensionality reduction, and model selection
Scikit-learn
is an extension module that makes several frequent item set mining implementations available as functions.
PyFim
In PyFim, Currently _______ are available as functions, although the interfaces do not offer all of the options of the command line progarm
apriori, eclat, fpgrowth, sam, relim, carpenter, ista, accretion and apriacc
Often the results of machine learning models predictions are not entirely clear, and this is the challenge that ___ library helps to deal with.
eli5
it is a package for visualization and debugging machine learning models and tracking the work of an algorithm step by step.
Eli5 (eli5.readthedocs.io/en/latest/)
It provides support for scikit-learn, XGBoost, LightGBM, lightning, and sklearn-crfsuite libraries and performs the different tasks for each of them
eli5
is a set of libraries, a whole platform for natural language processing.
NLTK (nltk.org)
With the help of ____, you can process and analyze text in a variety of ways, tokenize and tag it, extract information, etc.
NLTK
is also used for prototyping and building research systems
NLTK
is a Python library for robust semantic analysis, topic modeling and vector-space modeling, and is built upon Numpy and Scipy.
Gensim (radimrehurek.com/gensim)
Gensim provides an implementation of popular NLP algorithms, such as _____.
word2vec
Although gensim has its own models.wrappers.fasttext implementation, the ____ can also be used for efficient learning of word representations.
fasttext library
is a library used to create spiders bots that scan website pages and collect structured data.
Scrapy (scrapy.org)
In addition, Scrapy can extract data from the ___
API
The library happens to be very handy due to its extensibility and portability
Scrapy
Introduces for multi-dimensional arrays and matrices, as well as functions that allow to easily perform advanced mathematical and statistical operations on those objects
NumPy
Provides vectorization of mathematical operations on array and matrices which significantly improves the performance
NumPy
Many other python libraries are built on ____
NumPy
adds data structures and tools designed to work with table - like data (similar to Series and Data Frames in R)
Pandas
Provides tools and data manipulation: reshaping, sorting, slicing, aggregation etc.
Pandas
Allow handling missing data
Pandas
provides machine learning algorithms: classification, regression, clustering, and model validation
Scikit-Learn
Build on NumPy, SciPy, and matplotlib
Scikit-Learn
Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats
matplotlib
A set of functionalities similar to those of MATLAB
matplotlib
Line plots, scatter plots, bar charts, histograms, pie charts etc
matplotlib
Relatively low-level; some effort needed to create advanced visualization
matplotlib
based on matplotlib. Provides high level interface for drawing attractive statistical graphics
Seaborn
Seaborn is similar (in style) to the popular ___ library in R
ggplot2
Loading Python Libraries
import numpy as np
import scipy as sp
impor pandas as pd
import matplotlib as mpl
import seaborn as sns
Press ____ to execute jupyter cell
Shift+Enter
There are numerous commands to read other data formats:
pd.read.excel(‘myfile.xlsx’, sheet_name = ‘Sheet1’, index_col = None, na_values = [‘NA’])
pd.read_stata(‘myfile.dts’)
List first 5 records
df.head()
To view the first 10 records
pd.iloc[:10]
To view the last few records
df.tail(10)
The most general dtype. Will be assigned to your column if column has mixed type numbers and strings
object (string)
Numeric characters, 64 refers to the memory allocated to hold this character
Int64 (Int)
Numeric characters with decimals. If a column contains number and Nans, pandas will default to float64, in case your missing value has a decimal
Float64 (Float)
Values meant to hold time data. Look into these for time series experiments
Datetime64, timedelta[ns] (N/A)
Check a particular column type
df[‘salary’].dtype
Check types for all the columns
df.dtypes
list the types of the columns
dtypes
list the column names
columns
list the row labels and column names
axes
number of dimensions
ndim
number of elements
size
return a tuple representing the dimensionality
shape
numpy representation of the data
values
Unlike attributes, python methods have ___
parentheses.
All attributes and methods can be listed with a ____
dir() function
first/last n rows
head( [n] ), tail( [n] )
generate descriptive statistics (for numeric columns only)
describe()
return max/min values for all numeric columns
max(), min()
return mean/median values for all numeric columns
mean(), median()
standard deviation
std()
returns a random sample of the data frame
sample([n])
drop all the records with missing values
dropna()
Using “group by” method we can:
Split the data into groups based on some criteria
Calculate statistics (or apply a function) to each group
Similar to dplyr() function in R
group data using rank
df_rank = df.groupby([‘rank’])
To subset the data we can apply ____
Boolean indexing.
To subset the data we can apply Boolean indexing.
This indexing is commonly known as a ____
filter
Any ____ can be used to subset the data:
Boolean operator
There are a number of ways to subset the Data Frame:
one or more columns
one or more rows
a subset of rows and columns
Rows and columns can be selected by their position or label
Slicing
When selecting one column, it is possible to use single set of brackets, but the resulting object will be a ____(not a DataFrame):
Series
When we need to select more than one column and/or make the output to be a DataFrame, we should use ____
double brackets:
When summing the data, missing values will be treated as ___
zero
If all values are missing, the sum will be equal to____
NaN
methods ignore missing values but preserve them in the resulting arrays
cumsum() and cumprod()
Missing values in ___ method are excluded (just like in R)
GroupBy
Many descriptive statistics methods have ___ option to control if missing data should be excluded. This value is set to True by default (unlike R)
skipna
computing a summary statistic about each group, i.e.
compute group sums or means
compute group sizes/counts
Aggregation
Common aggregation functions:
min, max
count, sum, prod
mean, median, mode, mad
std, var
are useful when multiple statistics are computed per column
agg()
Basic statistic (count, mean, std, min, quantiles, max)
describe()
Minimum and maximum values
min, max
Arithmetic average, median, and mode
mean, median, mode
Variance and standard deviation
var, std
Standard error of mean
sem
Sample skewness
skew
kurtosis
kurt
histogram
displot
estimate of central tendency for a numeric variable
barplot
similar to boxplot, also shows the probability density of the data
violinplot
Scatterplot
jointplot
regression plot
regplot
Pairplot
pairplot
Boxplot
boxplot
categorical scatterplot
swarmplot
general categorical plot
factorplot
both have a number of function for statistical analysis
statsmodel and scikit-learn
mostly used for regular analysis using R style formulas
statsmodel
is more tailored for Machine Learning
scikit-learn
statsmodels:
inear regressions
ANOVA tests
hypothesis testings
many more
scikit-learn:
kmeans
support vector machines
random forests
many more