Intro to Statistics Flashcards

1
Q

Numerical Data

A

Also known as quantitative data. Consists of numbers and can be divided into two types: discrete and continuous.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Discrete data

A

Type of numerical or quantitative data. Includes countable values, often integers.

Examples: Number of patients in a hospital, number of defective products in a batch, nuber of books on a shelf.

Characteristics: Discrete data often arisie from counting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Continuous Data

A

Includes measurable values that can take any value within a given range.

Examples: Height, weight, temperature, time, distance.

Characteristics: Continuous data often arise from measurements and can have an infinite number of possible values within a range.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Categorical Data

A

Also known as qualitative data. Consists of categories or groups. It can be divided into two types: nominal and ordinal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Nominal Data

A

Represents categories that do not have a specific order or ranking.

Examples: Gender, blood type, types of animals.

Characteristics: Nominal data are purely lavels and do not imply and sort of order.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Ordinal Data

A

Represents categories that have a specific order or ranking.

Examples: Education level, satisfaction rating, severity of pain.

Characteristics: Ordinal data indicate a meaningful order among categories, but tthe intervals between categories are not necessarily equal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Key Differences of Numerical and Categorical Data: Nature of Data

A

Numerical: Involves numbers and quantifiable data

Categorical: Involves categories and qualitative distinctions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Key Differences of Numerical and Categorical Data: Subtypes

A

Numerical: Discrete and Continuous.

Categorical: Nominal and Ordinal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Key Differences of Numerical and Categorical Data: Operations

A

Numerical: Can perform mathematical operations (addition, subtraction, etc.)

Categorical: Typically summarized by counts and proportions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Population

A

The complete set of all possible observations or data points.

Examples: All patients in a hospital, every student in a school, every product produced by a factory.

Use: Populations are used when the goal is to understand or make statements about the entire group.

Parameters: Characteristics of a population (such as mean, standard deviation).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Sample

A

A portion of the population selected for analysis.

Examples: A group of 100 patients from the hospital, 50 students from the school, 200 products from the factory.

Use: Samples are used to make estimakes or test hypotheses about the population.

Statistics: Characteristics of a sample (such as a sample mean, sample standard deviation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Key Differences of Population vs. Sample: Scope

A

Population: Includes all members of a specified group.

Sample: Includes a part of the population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Key Differences of Population vs. Sample: Size

A

Population: Generally large or infinate.

Sample: Manageable or finite.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Key Differences of Population vs. Sample: Notation

A

Population: Parameters often denoted by Greek letters (eg., μ for mean, σ for standard deviation).

Sample: Statistics opten denoted by Latin letters (eg., x̄ for mean, s for standard deviation).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why use Samples?

A

Practicality: Studying an entire population can be time-consuming, expensive, and logistically challenging

Feasibility: Sometimes it’s impossible to access every member of a population.

Efficiency: Properly selected samples can provide accurate and reliable insights about the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Sampling Methods

A

Random Sampling: Every member of the population has an equal change of being selected. This helps ensure the sample is representative of the population.

Stratified Sampling: The population is divided into subgroups (strata) based on specific characteristics, and samples are taken from each stratum.

Systematic Sampling. Every nth member of the population is selected.

Convenience Sampling. Samples are selected based on ease of access. This method is less reliavle but opten used for exploratory research.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Importance of Sampling in Research

A

Using a sample to infer about a population allows researchers to draw conclusions and make predictions without needing to collect data from every member of the population. However, it is crucial that the sample is representative of the population to ensure the validity and reliability of the infereces made.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Descriptive Statistics

A

Summarize and organize data to make it easily understandable. These statistics provide simple summaries about the sample and measures. They form the basis of virtually every quanititative analysis of data.

Example: In a study of patients’ blood pressure readings, this type of statistics might report the average (mean) blood pressure, the most common (mode) reading, and the range or readings.

Summary: Focus on summarizing and describing the features of a dataset. Useful for getting a clear picture of the data at hand.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Inferential Statistics

A

Make inferences about populations using data drawn from the population. Instead of summarizing the data itself, inferential statistics help make predictions or generalizations about a population based on a sample of data.

Example: In a clinical trial, this type of statistics might be used to determine whether a new medication significantly lowers blood pressure compared to a placebo, using data from a sample of patients.

Summary: Focus on making generalizations from a sample to a population. Useful for hypothesis testing and estimating population parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Predictive Statistics (aka Predictive Analytics)

A

Use statistical models and machine learning techniques to predict future events or outcomes based on historical data. These statistics are often used in data mining, business forcasting, and machine learning applications.

Example: In healthcare, predictive statistics might be used to predict the likelihood of a patient developing a certain disease based on their medical history and other risk factors.

Summary: Focus on using models to predict future outcomes based on past data. Useful for forecasting and making informed decisions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Characteristics of Descriptive Statistics: Purpose

A

To describe and sumarize the main features of a dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Characteristics of Inferential Statistics: Purpose

A

To infer conclusions about a population based on a sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Characteristics of Predictive Statistics: Purpose

A

To make predictions about future outcomes based on patterns in historical data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Characteristics of Descriptive Statistics: Methods

A

Include measures of central tendency and measures of variability or spread.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Characteristics of Inferential Statistics: Methods

A

Use probability theory to estimate population parameters, test hypotheses, and make predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Characteristics of Predictive Statistics: Methods

A

Utilize statistical models and algorithms to forcast future events.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Characteristics of Descriptive Statistics: Common Techniques

A

Central Tendency: Mean, median, mode

Dispersion: Range, variance, standard deviation, interquartile range

Data Visualization: Charts, graphs, tables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Characteristics of Inferential Statistics: Common Techniques

A

Hypothesis Testing: T-tests, chi-square tests, ANOVA, regression analysis.

Confidence Intervals: Range within which a population parameter is expected to lie.

Sampling Methods: Random sampling, stratified sampling, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Characteristics of Predictive Statistics: Common Techniques

A

Regression Analysis: Linear regression, logistic regression

Machine Learning Models: Decision trees, neural networks, support vector machines

Time Series Analysis: ARIMA, exponential smoothing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Example of Descriptive Statistics

A

In a study of patients’ blood pressure readings, descriptive statistics might report the avverage (mean) blood pressure, the most common (mode) reading, and the range of readings.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Example of Predictive Statistics

A

In healthcare, predictive statistics might be used to predict the likelihood of a patient developing a certain disease based on their medical history and other risk factors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Example of Inferential Statistics

A

In a clinical trial, inferential statistics might be used to determine wheter a new medication significantly lowers blood pressure compared to a placebo, using data from a sample of patients.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Raw Data

A

Also known as primary data, is data that has not been altered, cleaned, or processed after it has been collected. It is the original, unmodifed data gathered directly from a source.

Summary: Unprocessed and original, requiring cleaning and organizing.

34
Q

Processed Data

A

Data that has been cleaned, transformed, and organized to make it suitable for analysis. Involves steps like validation, sorting, aggregation, and normalization.

Summary: Cleaned and transformed, ready for analysis.

35
Q

Raw Data: Characteristics

A

Unprocessed: Data in it’s original form, without any modifications.

Detailed: Contains all the collected details, including potential errors and irrelevant information.

Unorganized: Often unstructured and may need cleaning or organizing before analysis.

36
Q

Raw Data: Examples

A

Responses from survey participants.

Sensor readings from scientific instruments.

Transaction logs from an online store.

37
Q

Raw Data: Advantages

A

Authentic: Provides a true representation of the collected information.

Comprehensive: Contains all the nuances and details of the original data.

38
Q

Raw Data: Disadvantages

A

Cumbersome: Can be large and unwieldy, requiring significant preprocecssing.

Error-Prone: Likely to contain errors, outliers, and irrelevant information.

39
Q

Processed Data: Characteristics

A

Organized: Data is structured and ready for analysis.

Cleaned: Errors and irrelevant information are removed.

Transformed: Data may be aggregated, summarized, or conferted into a different format.

40
Q

Processed Data: Examples

A

Average scores of survey responses.

Monthly sales reports.

Cleaned and formatted datasets used in machine learning models.

41
Q

Processed Data: Advantages

A

Usable: Ready for analysis, visualization, and interpretation.

Accurate: Erros and irrelevant information have been removed.

42
Q

Processed Data: Disadvantages

A

Abstracted: Some details of the raw data may be lost in the processing.

Dependent on Method: Quality and usefulness depend on the processing methods used.

43
Q

Primary Data

A

Data collected directly from the source for a specific research purpose. It is original data gathered firsthand by the researcher.

Summary: Collected firsthand for a specific research purpose, used by the researcher for new analysis.

44
Q

Primary Data: Characteristics

A

Original: Collected directly from the source by the researcher.

Specific Purpose: Gathered for a specific research question or objective.

Control: Researcher has control over the data collection process.

45
Q

Primary Data: Examples

A

Data collected from experiments or clinical trials.

Surveys and questionnaires filled out by participants.

Observations recorded by researchers in the field.

46
Q

Primary Data: Advantages

A

Relevant: Specifically colelcted for the reserach purpose.

Current: Data is up-to-date and reflects current conditions.

47
Q

Primary Data: Disadvantages

A

Time-Consuming: Can take a lot of time to collect.

Costly: Often requires significant resources to gather.

48
Q

Secondary Data

A

Data that was collected by someone else for a different purpose but is being used by a researcher for a new analysis. It is not collected firsthand by the current reseracher.

Summary: Collected by someone else for a different purpose, used by the researcher for new analysis.

49
Q

Secondary Data: Characteristics

A

Pre-Existing: Already collected and available for use.

Broad Purpose: Originally gathered for a different research question or objective.

Accessibility: Researcher uses data collected by others

50
Q

Secondary Data: Examples

A

Data from government reports and censuses.

Research articles and academic journals

Historical records and archived data.

51
Q

Secondary Data: Advantages

A

Time-Saving: Readily available, saving time on data collection.

Cost-Effective: Often free or less expensive than collecting primary data.

52
Q

Secondary data: Disadvantages

A

Less Control: Researcher has no control over how the data was collected.

Potentially Outdated: May not reflect the current conditions or context.

53
Q

Cross-Sectional Data

A

Is collected at a single point in time from multiple subjects or entities. This type of data provides a snapshot of a particular phenomenon at a specific moment.

Summary: Snapshot of multiple subjects at a single time point.

54
Q

Cross-Sectional Data: Characteristics

A

Single Time Point: Data is collectged at one specific point in time.

Multiple Subjects: Includes data from different entities, such as individuals, companies, or countries.

Snapshot: Provides a snapshot view of the situation at the moment.

55
Q

Cross-Sectional Data: Examples

A

A survey conducted on a group of people to gather their opinions on a specific topic at a certain point in time.

The demographic information of a population collected in a census.

The financial statemetns of different companies for a particular fiscal year.

56
Q

Cross-Sectional Data: Advantages

A

Quick to Collect: Since data is collected at one point in time, it is often faster to gather.

Simple Analysis: Often simpler to analyze compared to time series or panel data.

57
Q

Cross-Sectional Data: Disadvantages

A

Limited Insight: Does not provide information about changes over time.

Snapshot View: Only provides a single view of the phenomenon, potentially missing temporal dynamics.

58
Q

Time Series Data

A

Data collected over time, capturing how a particular variable changes at different time points. This type of data helps in understanding trends, patterns, and forcasting future values.

Summary: Data on a single subject collected at multiple subjects at a single time point.

59
Q

Time Series Data: Characteristics

A

Multiple Time Points: Data is collected at regular intervals over a period.

Single Subject: Usually focuses on one subject or entity.

Temporal Order: The order of data points is important as it represents changes over time.

60
Q

Time Series Data: Examples

A

Daily stock prices of a company over a year.

Monthly unemployment rates over several years.

Annual rainfall data for a specific region over decades.

61
Q

Time Series Data: Advantages

A

Trend Analysis: Useful for identifying trends, cycles, and seasonal patterns.

Forcasting: Can be used to make predictions about future values.

62
Q

Time Series Data: Disadvantages

A

Complex Analysis: Requires more sophisticated statistical techniques.

Time-Consuming: Collecting data over a long period can be time-consuming.

63
Q

Panel Data

A

AKA longitudinal data, combines elements of both cross-sectional and time series data. It consists of multiple subjects measured repeatedly over time.

Summary: Combines cross-sectional and time series data, tracking multiple subjects over multiple time points.

64
Q

Panel Data: Characteristics

A

Multiple Subjects and Time Points: Data is collected from several subjects over multime time periods.

Rich Information: Provides both cross-ssectional and temporal insights.

Complex Structure: Each subject has its own time series of data points.

65
Q

Panel Data: Examples

A

Annual income and expenditure data for a sample of households over several years.

Health records of patients measured at regular intervals over timel.

Employee performance metrics tracked quarterly over several years.

66
Q

Panel Data: Advantages

A

Comprehensive Analysis: Allows for the study of dynamics and casual relationships.

Control for Variability: Can control for individual heterogenity, improving the robustness of statistical analysis.

67
Q

Panel Data: Disadvantages

A

Data Collection: Collecting panel data can be challenging and resource-intensive.

Complexity: Analyzing panel data often requires advanced statistical methods and software.

68
Q

Structured Data

A

Data that is organized and formatted in a way that makes it easily searchable and analyzable by traditional databases and tools. It follows a predefined schema and is usually stored in tabular form with rows and columns.

69
Q

Structured Data: Characteristics

A

Format: Organized in a clear, predictable structure, often in tables.

Schema: Follows a predefined schema or model.

Ease of Access: Easily searchable and analyzable using traditional database systems like SQL

70
Q

Structured Data: Examples

A

Relational Database: Data stored in tables within a database, such as customer information in a CRM system.

Spreadsheets: Data organized in rows and columns in applications like Microsoft Excel or Google Sheets.

Sensor Data: Time-stamped readings from IoT devices that are stored in a structured format.

71
Q

Structured Data: Advantages

A

Efficiency: Easy to enter, store, query, and analyze.

Consistency: Structured format ensures data integrity and consistency.

Automation: Can be processed using automated tools and algorithms.

72
Q

Structured Data: Disadvantages

A

Flexibility: Limited flexibility in terms of the types of data that can be stored.

Scalability: May become complex and less efficient with very large datasets.

73
Q

Unstructured Data

A

Data is information that doesn’t have a predefined format or organization. It is more complex to analyze and search because it doesn’t fit neatly into tables or rows or columns.

74
Q

Unstructured Data: Characteristics

A

Format: Lacks a predefined structure; can come in various formats such as text, images, audio, and video.

Schema: Does not follow a predefined schema or model.

Complexity: Requires advanced tools and techniques for analysis and processing.

75
Q

Unstructured Data: Examples

A

Text Documents: Word documents, PDF files, and text files.

Multimedia: Images, videos, and audio recordings.

Emails and Social Media: Messages, posts, tweets, and comments.

Web Pages: HTML pages and web content.

76
Q

Unstructured Data: Advantages

A

Richness: Can capture a wide variety of information, providing richer insights.

Flexibility: No need for a predefined structure, allowing for more diverse data types.

77
Q

Unstructured Data: Disadvantages

A

Analysis: More challenging to analyze and process; requires advanced techniques such as natural language processing (NLP) and machine learning.

Storage: Often requires more storage space and sophisticated management.

78
Q

Data Quality

A

The degree to which data serves its intended purpose. It refers to the degree of:
- Timeliness
- Accuracy
- Completeness
- Reliability
- Consistency

79
Q

Factors that Drive Data Quality

A
  • Quality of Measuring Devices, Questionnaires Used, and Approaches to Data Collection
  • Clarity of the Information Needed and Its Communication to Personnel Involved in Data Collection
  • Elimination of Outliers and Non-Representative Data
  • Use of Appropriate Formats
  • The Expertise of Individuals Involved in Data Collection
  • Willingness to Provide Data to Concerned Parties
80
Q

Steps to Minimize Poor Data Collection

A
  • Allocate adequate time for data collection
  • Track and eliminate outliers
  • Train data collection personnel
  • Pretest and evaluate questionnaires