Data Science Notes Flashcards

1
Q

What is a python list ?

A

A python list is a sequence of values. It can consist of any types of data.
Lists are mutable meaning that you can change the order of item and reassign a new item.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is NumPy ?

A

Numpy provides the ndarray object for efficient storage and manipulation of dense data arrays in python.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Pandas ?

A
  • This library provides the DataFrame object for efficient storage and manipulation of labeled/columnar data in python.
  • Pandas is high level tool for doing data manipulation/transformation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a distribution?

A

distribution is the set of all possible random variables together. a random variable is the result of each flip of a coin.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a binomial distribution?

A

the distribution is called binomial since there are two possible outputs a heads or a tails.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is discreet distribution?

A

There are only categories being used a heads and a tails and not real numbers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is broadcasting?

A

Broadcasting is simply a set of rules for applying binary ufuncs (addition, subtraction etc.) on arrays of different sizes.

for example: [1, 2, 3] + 5 = [6, 7, 8]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the rules of broadcasting?

A
  1. If array shapes differ, left-pad the smaller shape with 1s
  2. If any dimensions does not match, broadcast the dimensions with size = 1
  3. If neither non-matching dimension is 1, raise an error.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is fancy indexing?

A

it means passing an array of indices to access multiple array elements at once.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is central tendency?

A

Central tendency refers to the central position of the data (mean, median, mode) while the deviation describes how far spread out the data are from the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is deviation?

A

Deviation is most commonly measured with the standard deviation. A small standard deviation indicates the data are close to the mean, while a large standard deviation indicates that the data are more spread out from the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Descriptive statistics?

A

Descriptive statistics identify patterns in the data, but they don’t allow for making hypotheses about the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Inferential statistics?

A

Inferential statistics allow us to make hypotheses (or inferences) about a sample that can be applied to the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is correlation matrix with pandas’ corr method. ?

A

The values in the correlation matrix table will be between -1 and 1. A value of -1 indicates the strongest possible negative correlation, meaning as one variable decreases the other increases. And a value of 1 indicates the opposite.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Descriptive Statistics

A

Descriptive statistics are a collection of statistical tools which are used to quantitatively describe or summarize a collection of data. Descriptive statistics aim to summarize, and as such can be distinguished from inferential statistics, which are more predictive in nature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Population

A

A population is a selected individual or group representing the full set of members of a certain group of interest.

17
Q

Sample

A

A sample is a subset drawn from a larger population. If this drawing is accomplished in such a manner that each member of the population has an equitable chance of selection, the result is referred to as a random sample.

18
Q

Statistic

A

A statistic is a value which is generated from a sample. If I calculated the mean age of a subset of humans on planet Earth (much more feasible), this value would be a statistic. Hence, the discipline of statistics.

19
Q

Generalizability

A

Generalizability refers to the ability to draw conclusions about the characteristics of the population as a whole based on the results of data collected from a sample. This is ability is not a given, and depends heavily on the nature of sample collection, sample size, and various other factors.

20
Q

Distribution

A

A distribution is the arrangement of data by the values of one variable in order, from low to high. This arrangement, and its characteristics such as shape and spread, provide information about the underlying sample.

21
Q

Mean

A

Mean, along with median and mode, is one of the 3 major measures of central tendency, which collectively evaluate an important and basic aspect of a distribution. The simple arithmetic average of a distribution of variable values (or scores), the mean provides a single, concise numerical summary of a distribution. The mean is also likely the most common statistics encountered in general research. Population mean is denoted μ, while sample mean is denoted x̄.

22
Q

Median

A

The median is the score of a distribution residing at the 50th percentile, separating the top and bottom 50 percent of scores. The median is useful for both splitting a set of distribution scores in half and helping to identify the skew of a distribution.

23
Q

Mode

A

The mode is simply the score which appears most frequently in the distribution. Multimodal refers to a distribution with more than one mode; bimodal refers to a distribution with 2 modes.

24
Q

Skew

A

When there are more scores toward one end of the distribution than the other, this results in skew. When the scores of a distribution are more clustered at the high end, the relatively fewer number of scores on the low end result in a tail, with the scenario being referred to as negative skew. Positive skew is when a distribution shows a tail at its high end.

In general, in a negatively skewed distribution, we would expect the mean to be less than the median, while in a positively skewed distribution, we would expect the mean to be greater than the median.

25
Q

Range

A

One of the most important measures of dispersion, the range is the difference between the maximum and minimum values of a distribution.

26
Q

Variance

A

Variance is the statistical average of the dispersion of scores in a distribution. Variance is not often used on its own, but can be a useful calculation on the way to a more descriptive statistical measurement, such as standard deviation.

27
Q

Standard Deviation

A

The standard deviation of a distribution is the average deviation between individual distribution scores and the distribution’s mean. Individually, the standard deviation provides a good measure of how spread out a disquisitions scores are. When considered alongside the mean, these 2 measures provides a good overview of the distribution of scores.

28
Q

Interquartile Range (IQR)

A

The IQR is the difference between the score delineating the 75th percentile and the 25th percentile, the third and first quartiles, respectively.