Chapter 1- PART 1 (Data Analysis) Flashcards
What tools are commonly used for summarizing data?
- Graphs: Bar charts, pie charts, histograms, scatterplots, and time series graphs.
- Numeric Summary Measures: Counts, percentages, averages, and measures of variability (e.g., variance, standard deviation).
- Tables: Summary tables that display totals, averages, and counts grouped by categories.
What are the four typical steps in the data analysis process?
- Recognizing a Problem: Identifying issues that need to be addressed (e.g., decreased sales).
- Gathering Data: Collecting relevant data through surveys, existing systems, or other sources.
- Analyzing Data: Using analytical tools to interpret the collected data.
- Acting on Analysis: Implementing changes based on the analysis, which may include policy adjustments or new initiatives.
What is a population in statistics?
A population is the complete set of all entities of interest in a study. It includes every possible member that fits the criteria you’re studying.
Studying entire population is impossible due to time, cost and logistics
What is a sample, and why do we use samples instead of populations?
A sample is a subset of the population that is used to collect data. Sampling allows researchers to make estimates or generalizations about the population without needing to gather data from every member.
Sample must be representative of the population to avoid biased results.
What is a dataset?
A dataset is an organized collection of data, typically arranged in a rectangular table format:
- Columns represent variables (characteristics like age, income, etc.).
- Rows represent observations (individual members, like a person or a company).
What are variables in a dataset?
Variables are the characteristics or attributes that are measured in a study. In a dataset, they are represented by columns.
What are observations in a dataset?
An observation is a single instance or row in a dataset. Each observation contains data for all the variables measured on one subject.
Why is the distinction between population and sample not always important when analyzing datasets?
The distinction between population and sample becomes crucial when you want to generalize findings beyond your dataset (such as making predictions or decisions). However, if your focus is strictly on analyzing the data at hand without making broader generalizations, whether the data comes from a full population or a sample doesn’t immediately affect the analysis process.
What are the 3 types of data?
- Numeric
- Categorical
- Date
Variable is numeric if meaningful arithmetic can be performed on it.
What are numeric variables?
Numeric Variables are numbers that represent quantities or measurements where mathematical operations like addition, subtraction, or averaging make sense.
For example, age, income, or temperature are numeric because it makes sense to perform calculations on them.
What are categorical variables?
Categorical Variables represent categories or groups that label data points. These can be numbers, but their purpose is to identify rather than quantify.
For example, eye color, brands, or types of fruit are categorical variables. Mathematical operations on these numbers don’t make sense.
Why Phone Numbers, Zip Codes, and Social Security Numbers Are Categorical?
- Phone numbers are used to identify specific lines, not to measure or quantify anything.
- Zip codes represent geographic locations, not quantities.
- SSNs (Social Security Numbers) are unique identifiers for individuals within a country.
- Credit card numbers uniquely identify an account, but they are not for calculations.
- ISBNs (International Standard Book Number) are used to identify books.
- IP addresses identify devices on a network.
- Flight numbers label specific airline routes.
Primary role is identification, not measurement thus categorical data.
What are dummy variables?
Dummy variables allow categorical data to be used in numerical models by converting categories into 0s and 1s (a series of binary values).
What is binning?
Binning is a data preprocessing technique where continuous numerical data is grouped into discrete intervals, or “bins.” This helps to simplify data, reduce noise, and make patterns more visible in analyses, especially when dealing with large ranges of values.
Turns continuos data into categorical data by grouping results like 0-10
What is VLOOKUP (Vertical Lookup)?
VLOOKUP (Vertical Lookup) is a function in Excel that allows you to search for a specific value in the first column of a table and return a corresponding value from a specified column in the same row.