Chapter 10 Flashcards

1
Q

What is big data?

A

Big data refers to a group of data that is too large to be processed through conventional methods, characterized by high volume and velocity of collection, and variety in type and quality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Nominal Variables

A

A nominal variable’s categories that have no ordering, existing in name only (i.e. ‘grapes’, ‘oranges’, ‘bananas’, etc).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Ordinal variables

A

Ordinal variable categories have a specific ordering (i.e. ‘agree’, ‘neutral’, ‘disagree’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the two broad types of variables? What is the difference?

A

Quantitative: numerical
Categorical: name/category

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the two types of categorical variables?

A

Nominal: unordered, exist in name only
Ordinal: part of a set order

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the two types of quantitative variables?

A

Continuous: infinite along a continuum of values; typically real numbers; typically measurements;

Discrete: values are finite within a range, typically integers. Usually represent countable items/groups of items.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is data visualization? Why does it matter?

A

When data is displayed in a visual format meant to more easily convey information to people, such as a chart or graph.

Data shown in a text-only format often does not convey ideas or information very clearly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is cardinality?
Why must you consider this?

A

The number of unique elements in a data set.

Consider cardinality when choosing a visual method by which to represent data, as some methods (such as pie charts) are only suited to low-cardinality, while others (such as charts) may be better for high cardinality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is an example of a data visualization method suitable for low-cardinality?
What about high-cardinality?

A

Pie charts are an example of data visualization suitable for low-cardinality.

Scatter plots and histograms are example of visualization suitable for higher-cardinality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a good visual method for representing categorical data?

A

Bar graphs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What command would import a module “pandas” module using the alias “pd”?

A

import pandas as pd

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is matplotlib?

What does it replicate?

A

matplotlib is a module used for plotting data in Python.

It replicates the capabilities of MATLAB, an engineering-oriented programming langauge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

This matplotlib function specifies the title of a plot:

A

plt.title()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

This matplotlib function specifies the x-axis label of a plot:

A

plt.xlabel()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

This matplotlib function specifies the y-axis label of a plot:

A

plt.y-label()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

This matplotlib function creates a text label:

A

plt.text()

17
Q

This matplotlib function creates a text label for a specific data point:

A

plt.annotate()

18
Q

This matplotlib function creates a legend for the plot:

A

plt.legend()

19
Q

What is a data frame / DataFrame?

What are it’s 3 components?

A

A data frame is a two-dimensional tabular data structure with labeled columns and rows, similar to a spreadsheet.

It has 3 components:
-Index
-Columns
-Values

20
Q

What is pandas?

When is this especially useful?

A

pandas is a Python library that provides tools for working with data frames (reading, writing, subsetting, and reshaping data, etc.)

Can be useful when dealing with a data set that has missing or misaligned data.

21
Q

This dataframe method outputs summary statistics for numerical columns:

A

dataframe.describe()

22
Q

This dataframe method outputs the first / last 5 rows in the dataframe:

A

dataframe.head()
dataframe.tail()

23
Q

This dataframe method outputs the minimum / maximum value in a numerical column:

A

dataframe.min()
dataframe.max()

24
Q

This dataframe method outputs the mean / median value in a numerical column:

A

dataframe.mean()
dataframe.median()

25
Q

This dataframe method outputs a random row:

A

dataframe.sample()

26
Q

This dataframe method outputs the standard deviation of values in a numerical column:

A

dataframe.std()

27
Q

To select a column or columns of a data frame…

A

…use the command:

data_frame[“column”]

28
Q

What command returns the first
rows of the a data frame?

A

dataframe[0:10]

29
Q

How would you read a list of tabulated values into a data frame using pandas?

A

import pandas as pd

ex_data_frame = pd.read_csv(‘ex_data.csv’)

30
Q

What is subsetting?

A

Subsetting data is the process of retrieving parts of a data frame

31
Q

What is loc() used for?
What is iloc() used for?

A

loc() is used to select a range of rows and/or a subset of columns (i.e: titanic.loc[0:5,[“pclass”,”age”]])

iloc() is used to select a range of rows and/or columns (i.e: exdatfrm.iloc([0:5, [0:5])

32
Q

What are the two typical forms of data frames?

A

long form: each column is a variable and each row gives non-repeated data.

wide form: each data variable is in a different column.

33
Q

What is data reshaping?

A

This is the process of converting a data frame from long form to wide form or vice-versa.

34
Q

What is pivoting?
What is melting?

A

Pivoting is the process of converting a data frame from long form to wide form.

Melting is the process of converting a data frame from wide form to long form.

35
Q
A