A dataset is an organized collection of data, typically arranged in a rectangular table format: Columns represent variables (characteristics like age, income, etc.). Rows represent observations (individual members, like a person or a company).

Chapter 1- PART 1 (Data Analysis) Flashcards by Maryam Fatima

What tools are commonly used for summarizing data?

Graphs: Bar charts, pie charts, histograms, scatterplots, and time series graphs.
Numeric Summary Measures: Counts, percentages, averages, and measures of variability (e.g., variance, standard deviation).
Tables: Summary tables that display totals, averages, and counts grouped by categories.

How well did you know this?

Not at all

Perfectly

What are the four typical steps in the data analysis process?

Recognizing a Problem: Identifying issues that need to be addressed (e.g., decreased sales).
Gathering Data: Collecting relevant data through surveys, existing systems, or other sources.
Analyzing Data: Using analytical tools to interpret the collected data.
Acting on Analysis: Implementing changes based on the analysis, which may include policy adjustments or new initiatives.

How well did you know this?

Not at all

Perfectly

What is a population in statistics?

A population is the complete set of all entities of interest in a study. It includes every possible member that fits the criteria you’re studying.

Studying entire population is impossible due to time, cost and logistics

How well did you know this?

Not at all

Perfectly

What is a sample, and why do we use samples instead of populations?

A sample is a subset of the population that is used to collect data. Sampling allows researchers to make estimates or generalizations about the population without needing to gather data from every member.

Sample must be representative of the population to avoid biased results.

How well did you know this?

Not at all

Perfectly

What is a dataset?

A dataset is an organized collection of data, typically arranged in a rectangular table format:

Columns represent variables (characteristics like age, income, etc.).
Rows represent observations (individual members, like a person or a company).

How well did you know this?

Not at all

Perfectly

What are variables in a dataset?

Variables are the characteristics or attributes that are measured in a study. In a dataset, they are represented by columns.

How well did you know this?

Not at all

Perfectly

What are observations in a dataset?

An observation is a single instance or row in a dataset. Each observation contains data for all the variables measured on one subject.

How well did you know this?

Not at all

Perfectly

Why is the distinction between population and sample not always important when analyzing datasets?

The distinction between population and sample becomes crucial when you want to generalize findings beyond your dataset (such as making predictions or decisions). However, if your focus is strictly on analyzing the data at hand without making broader generalizations, whether the data comes from a full population or a sample doesn’t immediately affect the analysis process.

How well did you know this?

Not at all

Perfectly

What are the 3 types of data?

Numeric
Categorical
Date

Variable is numeric if meaningful arithmetic can be performed on it.

How well did you know this?

Not at all

Perfectly

What are numeric variables?

Numeric Variables are numbers that represent quantities or measurements where mathematical operations like addition, subtraction, or averaging make sense.

For example, age, income, or temperature are numeric because it makes sense to perform calculations on them.

How well did you know this?

Not at all

Perfectly

What are categorical variables?

Categorical Variables represent categories or groups that label data points. These can be numbers, but their purpose is to identify rather than quantify.

For example, eye color, brands, or types of fruit are categorical variables. Mathematical operations on these numbers don’t make sense.

How well did you know this?

Not at all

Perfectly

Why Phone Numbers, Zip Codes, and Social Security Numbers Are Categorical?

Phone numbers are used to identify specific lines, not to measure or quantify anything.
Zip codes represent geographic locations, not quantities.
SSNs (Social Security Numbers) are unique identifiers for individuals within a country.
Credit card numbers uniquely identify an account, but they are not for calculations.
ISBNs (International Standard Book Number) are used to identify books.
IP addresses identify devices on a network.
Flight numbers label specific airline routes.

Primary role is identification, not measurement thus categorical data.

How well did you know this?

Not at all

Perfectly

What are dummy variables?

Dummy variables allow categorical data to be used in numerical models by converting categories into 0s and 1s (a series of binary values).

How well did you know this?

Not at all

Perfectly

What is binning?

Binning is a data preprocessing technique where continuous numerical data is grouped into discrete intervals, or “bins.” This helps to simplify data, reduce noise, and make patterns more visible in analyses, especially when dealing with large ranges of values.

Turns continuos data into categorical data by grouping results like 0-10

How well did you know this?

Not at all

Perfectly

What is VLOOKUP (Vertical Lookup)?

VLOOKUP (Vertical Lookup) is a function in Excel that allows you to search for a specific value in the first column of a table and return a corresponding value from a specified column in the same row.

How well did you know this?

Not at all

Perfectly

How does VLOOKUP work?

Study These Flashcards

VLOOKUP stands for Vertical Lookup because it searches for values vertically down the first column of the table.

VLOOKUP syntax:
VLOOKUP(lookup_value, table_array, col_index_num, [range_lookup])

Study These Flashcards

lookup_value: The value you want to find (e.g., a specific product name or ID).
table_array: The range of cells where VLOOKUP will search for the lookup_value and where the data is stored (e.g., A1
).
col_index_num: The column number in the table_array from which to retrieve the value (e.g., if you want the value from the second column, you’d enter 2).
range_lookup: A logical value (TRUE or FALSE) that specifies whether you want an exact match (FALSE) or an approximate match (TRUE). Usually, you use FALSE to ensure you get the exact value.

2 types of datasets:

Study These Flashcards

Cross-sectional
Time series

What is cross-sectional data?

Study These Flashcards

Cross-sectional data is data collected from multiple subjects (such as people, companies, or countries) at a single point in time or over a very short period. It provides a “snapshot” of the situation at that specific moment.

used to compare and analyze differences between subjects

What is time series data?

Study These Flashcards

Time series data involves tracking one or more variables over a sequence of time periods. It captures how something changes over time, focusing on temporal trends rather than comparing different subjects.

used to study trends, make forecasts and understand how variables evolve

What are the key differences between cross-sectional and time series data?

Study These Flashcards

Data Collection:
* Cross-Sectional: Collected at a single point in time from multiple subjects.
* Time Series: Collected at multiple time points, focusing on the same variable(s) over time.
Focus of analysis:
* Cross-Sectional: Analyzes differences or relationships between subjects at one point in time.
* Time Series: Analyzes patterns, trends, and changes in data over time.
Applications:
* Cross-Sectional: Useful for comparing groups, identifying relationships, or describing a population at a specific moment.
* Time Series: Useful for identifying trends, detecting seasonality, and making predictions about the future.

How can categorical variables be summarized, and why is counting important?

Study These Flashcards

Summarizing categorical variables involves counting the occurrences of each category and presenting these counts as raw numbers or percentages. Since categorical variables represent groups or labels rather than quantities, arithmetic operations are inappropriate, making counting the most straightforward and meaningful way to summarize these variables.

What Are Categorical Variables?

Study These Flashcards

Categorical variables represent data that can be divided into distinct groups or categories. They do not hold numerical values that have a meaningful order or scale but instead describe characteristics or attributes like gender, region, or opinion.

What are the steps for summarizing categorical variables?

Study These Flashcards

Count the Number of Categories: Determine how many unique categories exist, such as two for Gender (Male, Female) or more for Region (North, South, East, West).
Name the Categories: Assign descriptive labels to each category. If numerical codes are used (e.g., 1 for Male, 2 for Female), provide text descriptions to enhance clarity.
Count Observations in Each Category: Count how many observations fall into each category. For example, in a dataset with 1,000 survey responses, there might be 560 Males and 440 Females.
Report Counts and Percentages:

Raw Counts: Simply list the number of observations in each category.
Percentages: Convert counts into percentages to show the relative frequency (e.g., 56% Male, 44% Female).

Display Data Graphically:

Column Chart: Shows frequency with bars representing counts or percentages for each category.
Pie Chart: Visualizes proportions of each category as slices of a circle, illustrating their share of the whole.

Why is it important to use descriptive labels when summarizing coded categorical data?

Descriptive labels enhance the clarity of the data by replacing numeric codes with meaningful descriptions. For instance, instead of showing “1” for “Strongly Agree,” using the actual term helps readers quickly understand the data without needing a codebook. This practice is especially helpful in reports and presentations where clear communication is essential.

Why is it useful to report both counts and percentages?

Reporting both counts and percentages provides a fuller picture of the data. Counts show the exact number of observations in each category, while percentages make it easy to compare the relative sizes of the groups, especially when dealing with varying totals. This dual reporting helps readers understand both the scale and the distribution of the data.

Chapter 1- PART 1 (Data Analysis) Flashcards

(26 cards)