Chapter 10 Flashcards by Mike L

What is big data?

Big data refers to a group of data that is too large to be processed through conventional methods, characterized by high volume and velocity of collection, and variety in type and quality.

How well did you know this?

Not at all

Perfectly

Nominal Variables

A nominal variable’s categories that have no ordering, existing in name only (i.e. ‘grapes’, ‘oranges’, ‘bananas’, etc).

How well did you know this?

Not at all

Perfectly

Ordinal variables

Ordinal variable categories have a specific ordering (i.e. ‘agree’, ‘neutral’, ‘disagree’)

How well did you know this?

Not at all

Perfectly

What are the two broad types of variables? What is the difference?

Quantitative: numerical
Categorical: name/category

How well did you know this?

Not at all

Perfectly

What are the two types of categorical variables?

Nominal: unordered, exist in name only
Ordinal: part of a set order

How well did you know this?

Not at all

Perfectly

What are the two types of quantitative variables?

Continuous: infinite along a continuum of values; typically real numbers; typically measurements;

Discrete: values are finite within a range, typically integers. Usually represent countable items/groups of items.

How well did you know this?

Not at all

Perfectly

What is data visualization? Why does it matter?

When data is displayed in a visual format meant to more easily convey information to people, such as a chart or graph.

Data shown in a text-only format often does not convey ideas or information very clearly.

How well did you know this?

Not at all

Perfectly

What is cardinality?
Why must you consider this?

The number of unique elements in a data set.

Consider cardinality when choosing a visual method by which to represent data, as some methods (such as pie charts) are only suited to low-cardinality, while others (such as charts) may be better for high cardinality.

How well did you know this?

Not at all

Perfectly

What is an example of a data visualization method suitable for low-cardinality?
What about high-cardinality?

Pie charts are an example of data visualization suitable for low-cardinality.

Scatter plots and histograms are example of visualization suitable for higher-cardinality.

How well did you know this?

Not at all

Perfectly

What is a good visual method for representing categorical data?

Bar graphs

How well did you know this?

Not at all

Perfectly

What command would import a module “pandas” module using the alias “pd”?

import pandas as pd

How well did you know this?

Not at all

Perfectly

What is matplotlib?

What does it replicate?

matplotlib is a module used for plotting data in Python.

It replicates the capabilities of MATLAB, an engineering-oriented programming langauge.

How well did you know this?

Not at all

Perfectly

This matplotlib function specifies the title of a plot:

plt.title()

How well did you know this?

Not at all

Perfectly

This matplotlib function specifies the x-axis label of a plot:

plt.xlabel()

How well did you know this?

Not at all

Perfectly

This matplotlib function specifies the y-axis label of a plot:

plt.y-label()

How well did you know this?

Not at all

Perfectly

This matplotlib function creates a text label:

plt.text()

This matplotlib function creates a text label for a specific data point:

plt.annotate()

This matplotlib function creates a legend for the plot:

plt.legend()

What is a data frame / DataFrame?

What are it’s 3 components?

A data frame is a two-dimensional tabular data structure with labeled columns and rows, similar to a spreadsheet.

It has 3 components:
-Index
-Columns
-Values

What is pandas?

When is this especially useful?

pandas is a Python library that provides tools for working with data frames (reading, writing, subsetting, and reshaping data, etc.)

Can be useful when dealing with a data set that has missing or misaligned data.

This dataframe method outputs summary statistics for numerical columns:

dataframe.describe()

This dataframe method outputs the first / last 5 rows in the dataframe:

dataframe.head()
dataframe.tail()

This dataframe method outputs the minimum / maximum value in a numerical column:

dataframe.min()
dataframe.max()

This dataframe method outputs the mean / median value in a numerical column:

dataframe.mean()
dataframe.median()

This dataframe method outputs a random row:

***dataframe.sample()***

This dataframe method outputs the standard deviation of values in a numerical column:

dataframe.std()

To select a column or columns of a data frame...

...use the command: ***data_frame["column"]***

What command returns the first rows of the a data frame?

dataframe[0:10]

How would you read a list of tabulated values into a data frame using ***pandas***?

import pandas as pd ex_data_frame = pd.read_csv('ex_data.csv')

What is ***subsetting***?

***Subsetting*** data is the process of retrieving parts of a data frame

What is ***loc()*** used for? What is ***iloc()*** used for?

***loc()*** is used to select a range of rows and/or a subset of columns (i.e: *titanic.loc[0:5,["pclass","age"]]*) ***iloc()*** is used to select a range of rows and/or columns (i.e: *exdatfrm.iloc([0:5, [0:5]*)

What are the two typical forms of data frames?

***long form***: each column is a variable and each row gives non-repeated data. ***wide form***: each data variable is in a different column.

What is ***data reshaping***?

This is the process of converting a data frame from ***long*** form to ***wide*** form or vice-versa.

What is pivoting? What is melting?

***Pivoting*** is the process of converting a data frame from ***long form to wide form***. ***Melting*** is the process of converting a data frame from ***wide form to long form***.