Week 1- Analyzing Categorical Data, Displaying Quantitative Data With Graphs, and Describing Quantitative Data With Numbers Flashcards
Individual
Objects which are described by the data. Can be people, animals, or other objects.
(Think NOUNS!) Individual data is defined as collected data that can be associated with a single element in a sample. In a survey that collects data from 26 people (Person A, Person B, Person C, and so on up to Person Z), individual data refers to all data from Person A or Person B or Person C, and so on until Person Z
Variable
A characteristic of an individual. Variables can have different values for different
individuals. characteristic or attribute that can be measured or counted, and can take on different values within a dataset, meaning it can vary between different individuals or observations within a population being studied; essentially, it’s any quality or quantity that can change depending on the situation.
Categorical Variable
Places individuals in groups or categories. (Can also be called Qualitative Variable.) represents qualitative data by placing observations into distinct categories or groups, where the values are usually labels or names rather than numbers, and there is no inherent order between the categories; examples include gender (male, female), eye color (blue, brown, green), or shirt size (small, medium, large).
Quantitative Variable
Takes numerical values, for which it can make sense to find an average. C variable that can be measured and assigned a numerical value, meaning it describes a quantity rather than a quality; essentially, it is a variable that can be expressed as a number, like height, weight, age, or temperature
Distribution
What values a variable takes and how often it takes these values. (Think domain
or range. a function that shows the possible values for a variable and how often they occur
Frequency Table
Displays the counts of categorical data. a table that displays how often each value or category appears in a dataset, essentially summarizing the distribution of data by listing each possible value and the number of times it occurs within the data set; it provides a clear overview of how frequently different elements are present within the data
Relative Frequency Table
Displays the percent of categorical data in each category. a table that displays the proportion (usually expressed as a percentage) of times a specific value or category appears within a data set, calculated by dividing the frequency of each category by the total number of data points, allowing for easier comparison between different categories even when the sample sizes are different; essentially, it shows how often something occurs relative to the whole dataset.
Bar Graph
A graph which displays the frequency or relative frequency of categorical data. Each category is displayed as a bar. a visual representation of data using rectangular bars, where the length of each bar is proportional to the value it represents, allowing for easy comparison between different categories of data; essentially, it’s a graphical way to display categorical data using bars, with the height of the bar indicating the magnitude of the value.
Pie Chart
A chart which displays the relative frequency of categorical data as a portion of a circle. a circular graphic that visually represents data as slices of a pie, where the size of each slice is proportional to the percentage or proportion of a whole category it represents, allowing for easy comparison of parts within a whole data set; essentially, it shows how a total amount is divided into different categories.
Pictograph
A chart which is similar to a bar graph but which displays the data using an icon or picture, rather than a bar. uses picture symbols to illustrate statistical information. It is often more difficult to visualize data precisely with a pictograph. This is why pictographs should be used carefully to avoid misrepresenting data either accidentally or deliberately
Two-Way Table
Describes two categorical variables. Used to display the frequencies of data for two categorical variables, where one variable is represented by the rows and the other by the columns, allowing you to see how the categories of each variable relate to each other; it’s also known as a contingency table.
Marginal Distribution
For one of the variables in a two-way table, the distribution of the values of that variable among all of the individuals included in the table. (Generally, it will be the subtotals on the side or bottom of the table.) refers to the probability distribution of a single variable within a dataset, considering its values independently without reference to the values of other variables; essentially, it’s the distribution of one variable “on the margin” of a table, calculated by summing up the frequencies across the other variables in a joint distribution table.
Conditional Distribution
Describes values of a variable table for the individuals who have a specific value for another variable. a probability distribution that describes the likelihood of an event occurring given that another specific event has already happened, essentially showing the probability distribution of one variable based on a particular value of another variable; it focuses on a “sub-population” defined by the condition being considered.
Association
When specific values of one variable tend to occur with specific values of another variable. Refers to a relationship between two variables, meaning that when one variable changes, the other variable tends to change as well; it describes a pattern where the values of one variable provide information about the values of another variable, without necessarily implying causation.
Dotplot
A graph in which each data value is displayed above its location on a number line. a simple graphical representation of data where individual data points are plotted as dots on a number line, allowing for easy visualization of the distribution of data, including clusters, gaps, and outliers, particularly useful for smaller data sets; each dot represents a single data point, and stacked dots indicate repeated values
Symmetric
A distribution is symmetric if the right and left sides are approximately mirror images of each other. a data set where the values on either side of the mean are roughly equally balanced, meaning the left half of the distribution mirrors the right half, like a perfect reflection; essentially, if you split the data down the middle, both sides would look almost identical, with the mean, median, and mode usually being very close in value.
Skewed
A distribution is skewed if one side is much longer than the other side. A data distribution is not symmetrical, meaning the data points cluster more towards one side of the distribution than the other, creating a lopsided appearance on a graph, with one “tail” of the distribution being longer than the other; a positively skewed distribution has a longer tail on the right side, while a negatively skewed distribution has a longer tail on the left side.
Unimodal
Having a single peak. A probability distribution or data set that has only one peak or mode, meaning there is only one value that occurs most frequently within the data set; essentially, it describes a distribution with a single highest point on a graph representing the data
Bimodal
Having two clear peaks. (Multimodal would be having many clear peaks.) A distribution of data that has two distinct peaks, meaning there are two separate values that occur most frequently within the data set, essentially indicating the presence of two different groups within the data with different characteristics; the prefix “bi” means “two” and “modal” refers to the mode (most frequent value).
Stemplot
A graph in which digits are separated, with the higher-order digits forming a vertical axis, and the last digit listed as individual elements, similar to a dotplot. also called a stem-and-leaf plot, is a visual representation of data in statistics where each data point is divided into a “stem” (usually the first digit or digits) and a “leaf” (usually the last digit), allowing for easy visualization of the data distribution, including identifying clusters, outliers, and the overall shape of the data set.
Histogram
A graph for numerical data in which vales are grouped and the count of the value group is graphed with adjacent bars. a graphical representation of data distribution where the data is divided into ranges called “bins” and the height of each bar represents the frequency of data points falling within that specific range, essentially showing how often values occur within a certain interval on the x-axis; it’s a visual way to understand the shape and spread of continuous data sets.
Mean
The average of the observations. One of the measures of central tendency, the average of the given set of values. It denotes the equal distribution of values for a given data set. The mean is the average or the most common value in a collection of numbers.
Median
The midpoint of a distribution. Half of the observations will be larger, and half will
be smaller, than the median. The middle value in a set of data when arranged in ascending order, meaning that half of the data points are higher than the median and half are lower; it essentially represents the “middle point” of the data set. To find the median: Arrange the data points from smallest to largest. If the number of data points is odd, the median is the middle data point in the list.
Quartile
Each quartile represents the boundary line between successive quarters of the observations. For example, 25% of the observations are smaller than the 1 st quartile. A value that divides a set of data into four equal parts, meaning each part contains roughly 25% of the data points, when the data is arranged in ascending order; essentially, it acts as a cut-off point between these four groups, with the “first quartile” representing the lower 25%, the “second quartile” being the median, and the “third quartile” representing the upper 25% of the data. A division of observations into four defined intervals based on the values of the data and how they compare to the entire set of observations. Quartiles are organized into lower quartiles, median quartiles, and upper quartiles.
Interquartile Range (IQR)
The distance between the first quartile and the third quartile of a set of observations. tells you the spread of the middle half of your distribution. Quartiles segment any distribution that’s ordered from low to high into four equal parts. The interquartile range (IQR) contains the second and third quartiles, or the middle half of your data set.
Five-Number Summary
The Minimum, 1 st Quartile, Median, 3 rd Quartile, Maximum. set of five numbers that summarizes a data set’s characteristics. The five numbers are the minimum, first quartile, median, third quartile, and maximum. A box plot is a graphical device based on a five-number summary.
Boxplot
A graph which depicts the five-number summary. also called a box-and-whisker plot, is a graphical representation in statistics that displays the distribution of a dataset by showing the minimum value, first quartile, median, third quartile, and maximum value, allowing for easy visual comparison of data spread and potential outliers between different groups; the “box” represents the interquartile range (IQR) with the median line within it,
and the “whiskers” extend to the minimum and maximum values, excluding outliers which are often plotted as individual points beyond the whiskers.
Variance
The average of the square of the distances from each observation and the mean. a measure of how spread out a set of data points are from the mean, essentially indicating how much variation exists within a data set; it is calculated by finding the average of the squared differences between each data point and the mean of the data set
Standard Deviation
The square root of the variance. a measure of how spread out a set of data is, essentially indicating how far, on average, each data point deviates from the mean value within that set; a low standard deviation means data points are clustered close to the mean, while a high standard deviation indicates data is more dispersed across a wider range.