Intro to Statistics Flashcards
Numerical Data
Also known as quantitative data. Consists of numbers and can be divided into two types: discrete and continuous.
Discrete data
Type of numerical or quantitative data. Includes countable values, often integers.
Examples: Number of patients in a hospital, number of defective products in a batch, nuber of books on a shelf.
Characteristics: Discrete data often arisie from counting.
Continuous Data
Includes measurable values that can take any value within a given range.
Examples: Height, weight, temperature, time, distance.
Characteristics: Continuous data often arise from measurements and can have an infinite number of possible values within a range.
Categorical Data
Also known as qualitative data. Consists of categories or groups. It can be divided into two types: nominal and ordinal.
Nominal Data
Represents categories that do not have a specific order or ranking.
Examples: Gender, blood type, types of animals.
Characteristics: Nominal data are purely lavels and do not imply and sort of order.
Ordinal Data
Represents categories that have a specific order or ranking.
Examples: Education level, satisfaction rating, severity of pain.
Characteristics: Ordinal data indicate a meaningful order among categories, but tthe intervals between categories are not necessarily equal.
Key Differences of Numerical and Categorical Data: Nature of Data
Numerical: Involves numbers and quantifiable data
Categorical: Involves categories and qualitative distinctions
Key Differences of Numerical and Categorical Data: Subtypes
Numerical: Discrete and Continuous.
Categorical: Nominal and Ordinal.
Key Differences of Numerical and Categorical Data: Operations
Numerical: Can perform mathematical operations (addition, subtraction, etc.)
Categorical: Typically summarized by counts and proportions.
Population
The complete set of all possible observations or data points.
Examples: All patients in a hospital, every student in a school, every product produced by a factory.
Use: Populations are used when the goal is to understand or make statements about the entire group.
Parameters: Characteristics of a population (such as mean, standard deviation).
Sample
A portion of the population selected for analysis.
Examples: A group of 100 patients from the hospital, 50 students from the school, 200 products from the factory.
Use: Samples are used to make estimakes or test hypotheses about the population.
Statistics: Characteristics of a sample (such as a sample mean, sample standard deviation)
Key Differences of Population vs. Sample: Scope
Population: Includes all members of a specified group.
Sample: Includes a part of the population.
Key Differences of Population vs. Sample: Size
Population: Generally large or infinate.
Sample: Manageable or finite.
Key Differences of Population vs. Sample: Notation
Population: Parameters often denoted by Greek letters (eg., μ for mean, σ for standard deviation).
Sample: Statistics opten denoted by Latin letters (eg., x̄ for mean, s for standard deviation).
Why use Samples?
Practicality: Studying an entire population can be time-consuming, expensive, and logistically challenging
Feasibility: Sometimes it’s impossible to access every member of a population.
Efficiency: Properly selected samples can provide accurate and reliable insights about the population
Sampling Methods
Random Sampling: Every member of the population has an equal change of being selected. This helps ensure the sample is representative of the population.
Stratified Sampling: The population is divided into subgroups (strata) based on specific characteristics, and samples are taken from each stratum.
Systematic Sampling. Every nth member of the population is selected.
Convenience Sampling. Samples are selected based on ease of access. This method is less reliavle but opten used for exploratory research.
Importance of Sampling in Research
Using a sample to infer about a population allows researchers to draw conclusions and make predictions without needing to collect data from every member of the population. However, it is crucial that the sample is representative of the population to ensure the validity and reliability of the infereces made.
Descriptive Statistics
Summarize and organize data to make it easily understandable. These statistics provide simple summaries about the sample and measures. They form the basis of virtually every quanititative analysis of data.
Example: In a study of patients’ blood pressure readings, this type of statistics might report the average (mean) blood pressure, the most common (mode) reading, and the range or readings.
Summary: Focus on summarizing and describing the features of a dataset. Useful for getting a clear picture of the data at hand.
Inferential Statistics
Make inferences about populations using data drawn from the population. Instead of summarizing the data itself, inferential statistics help make predictions or generalizations about a population based on a sample of data.
Example: In a clinical trial, this type of statistics might be used to determine whether a new medication significantly lowers blood pressure compared to a placebo, using data from a sample of patients.
Summary: Focus on making generalizations from a sample to a population. Useful for hypothesis testing and estimating population parameters.
Predictive Statistics (aka Predictive Analytics)
Use statistical models and machine learning techniques to predict future events or outcomes based on historical data. These statistics are often used in data mining, business forcasting, and machine learning applications.
Example: In healthcare, predictive statistics might be used to predict the likelihood of a patient developing a certain disease based on their medical history and other risk factors.
Summary: Focus on using models to predict future outcomes based on past data. Useful for forecasting and making informed decisions.
Characteristics of Descriptive Statistics: Purpose
To describe and sumarize the main features of a dataset.
Characteristics of Inferential Statistics: Purpose
To infer conclusions about a population based on a sample.
Characteristics of Predictive Statistics: Purpose
To make predictions about future outcomes based on patterns in historical data.
Characteristics of Descriptive Statistics: Methods
Include measures of central tendency and measures of variability or spread.
Characteristics of Inferential Statistics: Methods
Use probability theory to estimate population parameters, test hypotheses, and make predictions.
Characteristics of Predictive Statistics: Methods
Utilize statistical models and algorithms to forcast future events.
Characteristics of Descriptive Statistics: Common Techniques
Central Tendency: Mean, median, mode
Dispersion: Range, variance, standard deviation, interquartile range
Data Visualization: Charts, graphs, tables
Characteristics of Inferential Statistics: Common Techniques
Hypothesis Testing: T-tests, chi-square tests, ANOVA, regression analysis.
Confidence Intervals: Range within which a population parameter is expected to lie.
Sampling Methods: Random sampling, stratified sampling, etc.
Characteristics of Predictive Statistics: Common Techniques
Regression Analysis: Linear regression, logistic regression
Machine Learning Models: Decision trees, neural networks, support vector machines
Time Series Analysis: ARIMA, exponential smoothing.
Example of Descriptive Statistics
In a study of patients’ blood pressure readings, descriptive statistics might report the avverage (mean) blood pressure, the most common (mode) reading, and the range of readings.
Example of Predictive Statistics
In healthcare, predictive statistics might be used to predict the likelihood of a patient developing a certain disease based on their medical history and other risk factors.
Example of Inferential Statistics
In a clinical trial, inferential statistics might be used to determine wheter a new medication significantly lowers blood pressure compared to a placebo, using data from a sample of patients.
Raw Data
Also known as primary data, is data that has not been altered, cleaned, or processed after it has been collected. It is the original, unmodifed data gathered directly from a source.
Summary: Unprocessed and original, requiring cleaning and organizing.
Processed Data
Data that has been cleaned, transformed, and organized to make it suitable for analysis. Involves steps like validation, sorting, aggregation, and normalization.
Summary: Cleaned and transformed, ready for analysis.
Raw Data: Characteristics
Unprocessed: Data in it’s original form, without any modifications.
Detailed: Contains all the collected details, including potential errors and irrelevant information.
Unorganized: Often unstructured and may need cleaning or organizing before analysis.
Raw Data: Examples
Responses from survey participants.
Sensor readings from scientific instruments.
Transaction logs from an online store.
Raw Data: Advantages
Authentic: Provides a true representation of the collected information.
Comprehensive: Contains all the nuances and details of the original data.
Raw Data: Disadvantages
Cumbersome: Can be large and unwieldy, requiring significant preprocecssing.
Error-Prone: Likely to contain errors, outliers, and irrelevant information.
Processed Data: Characteristics
Organized: Data is structured and ready for analysis.
Cleaned: Errors and irrelevant information are removed.
Transformed: Data may be aggregated, summarized, or conferted into a different format.
Processed Data: Examples
Average scores of survey responses.
Monthly sales reports.
Cleaned and formatted datasets used in machine learning models.
Processed Data: Advantages
Usable: Ready for analysis, visualization, and interpretation.
Accurate: Erros and irrelevant information have been removed.
Processed Data: Disadvantages
Abstracted: Some details of the raw data may be lost in the processing.
Dependent on Method: Quality and usefulness depend on the processing methods used.
Primary Data
Data collected directly from the source for a specific research purpose. It is original data gathered firsthand by the researcher.
Summary: Collected firsthand for a specific research purpose, used by the researcher for new analysis.
Primary Data: Characteristics
Original: Collected directly from the source by the researcher.
Specific Purpose: Gathered for a specific research question or objective.
Control: Researcher has control over the data collection process.
Primary Data: Examples
Data collected from experiments or clinical trials.
Surveys and questionnaires filled out by participants.
Observations recorded by researchers in the field.
Primary Data: Advantages
Relevant: Specifically colelcted for the reserach purpose.
Current: Data is up-to-date and reflects current conditions.
Primary Data: Disadvantages
Time-Consuming: Can take a lot of time to collect.
Costly: Often requires significant resources to gather.
Secondary Data
Data that was collected by someone else for a different purpose but is being used by a researcher for a new analysis. It is not collected firsthand by the current reseracher.
Summary: Collected by someone else for a different purpose, used by the researcher for new analysis.
Secondary Data: Characteristics
Pre-Existing: Already collected and available for use.
Broad Purpose: Originally gathered for a different research question or objective.
Accessibility: Researcher uses data collected by others
Secondary Data: Examples
Data from government reports and censuses.
Research articles and academic journals
Historical records and archived data.
Secondary Data: Advantages
Time-Saving: Readily available, saving time on data collection.
Cost-Effective: Often free or less expensive than collecting primary data.
Secondary data: Disadvantages
Less Control: Researcher has no control over how the data was collected.
Potentially Outdated: May not reflect the current conditions or context.
Cross-Sectional Data
Is collected at a single point in time from multiple subjects or entities. This type of data provides a snapshot of a particular phenomenon at a specific moment.
Summary: Snapshot of multiple subjects at a single time point.
Cross-Sectional Data: Characteristics
Single Time Point: Data is collectged at one specific point in time.
Multiple Subjects: Includes data from different entities, such as individuals, companies, or countries.
Snapshot: Provides a snapshot view of the situation at the moment.
Cross-Sectional Data: Examples
A survey conducted on a group of people to gather their opinions on a specific topic at a certain point in time.
The demographic information of a population collected in a census.
The financial statemetns of different companies for a particular fiscal year.
Cross-Sectional Data: Advantages
Quick to Collect: Since data is collected at one point in time, it is often faster to gather.
Simple Analysis: Often simpler to analyze compared to time series or panel data.
Cross-Sectional Data: Disadvantages
Limited Insight: Does not provide information about changes over time.
Snapshot View: Only provides a single view of the phenomenon, potentially missing temporal dynamics.
Time Series Data
Data collected over time, capturing how a particular variable changes at different time points. This type of data helps in understanding trends, patterns, and forcasting future values.
Summary: Data on a single subject collected at multiple subjects at a single time point.
Time Series Data: Characteristics
Multiple Time Points: Data is collected at regular intervals over a period.
Single Subject: Usually focuses on one subject or entity.
Temporal Order: The order of data points is important as it represents changes over time.
Time Series Data: Examples
Daily stock prices of a company over a year.
Monthly unemployment rates over several years.
Annual rainfall data for a specific region over decades.
Time Series Data: Advantages
Trend Analysis: Useful for identifying trends, cycles, and seasonal patterns.
Forcasting: Can be used to make predictions about future values.
Time Series Data: Disadvantages
Complex Analysis: Requires more sophisticated statistical techniques.
Time-Consuming: Collecting data over a long period can be time-consuming.
Panel Data
AKA longitudinal data, combines elements of both cross-sectional and time series data. It consists of multiple subjects measured repeatedly over time.
Summary: Combines cross-sectional and time series data, tracking multiple subjects over multiple time points.
Panel Data: Characteristics
Multiple Subjects and Time Points: Data is collected from several subjects over multime time periods.
Rich Information: Provides both cross-ssectional and temporal insights.
Complex Structure: Each subject has its own time series of data points.
Panel Data: Examples
Annual income and expenditure data for a sample of households over several years.
Health records of patients measured at regular intervals over timel.
Employee performance metrics tracked quarterly over several years.
Panel Data: Advantages
Comprehensive Analysis: Allows for the study of dynamics and casual relationships.
Control for Variability: Can control for individual heterogenity, improving the robustness of statistical analysis.
Panel Data: Disadvantages
Data Collection: Collecting panel data can be challenging and resource-intensive.
Complexity: Analyzing panel data often requires advanced statistical methods and software.
Structured Data
Data that is organized and formatted in a way that makes it easily searchable and analyzable by traditional databases and tools. It follows a predefined schema and is usually stored in tabular form with rows and columns.
Structured Data: Characteristics
Format: Organized in a clear, predictable structure, often in tables.
Schema: Follows a predefined schema or model.
Ease of Access: Easily searchable and analyzable using traditional database systems like SQL
Structured Data: Examples
Relational Database: Data stored in tables within a database, such as customer information in a CRM system.
Spreadsheets: Data organized in rows and columns in applications like Microsoft Excel or Google Sheets.
Sensor Data: Time-stamped readings from IoT devices that are stored in a structured format.
Structured Data: Advantages
Efficiency: Easy to enter, store, query, and analyze.
Consistency: Structured format ensures data integrity and consistency.
Automation: Can be processed using automated tools and algorithms.
Structured Data: Disadvantages
Flexibility: Limited flexibility in terms of the types of data that can be stored.
Scalability: May become complex and less efficient with very large datasets.
Unstructured Data
Data is information that doesn’t have a predefined format or organization. It is more complex to analyze and search because it doesn’t fit neatly into tables or rows or columns.
Unstructured Data: Characteristics
Format: Lacks a predefined structure; can come in various formats such as text, images, audio, and video.
Schema: Does not follow a predefined schema or model.
Complexity: Requires advanced tools and techniques for analysis and processing.
Unstructured Data: Examples
Text Documents: Word documents, PDF files, and text files.
Multimedia: Images, videos, and audio recordings.
Emails and Social Media: Messages, posts, tweets, and comments.
Web Pages: HTML pages and web content.
Unstructured Data: Advantages
Richness: Can capture a wide variety of information, providing richer insights.
Flexibility: No need for a predefined structure, allowing for more diverse data types.
Unstructured Data: Disadvantages
Analysis: More challenging to analyze and process; requires advanced techniques such as natural language processing (NLP) and machine learning.
Storage: Often requires more storage space and sophisticated management.
Data Quality
The degree to which data serves its intended purpose. It refers to the degree of:
- Timeliness
- Accuracy
- Completeness
- Reliability
- Consistency
Factors that Drive Data Quality
- Quality of Measuring Devices, Questionnaires Used, and Approaches to Data Collection
- Clarity of the Information Needed and Its Communication to Personnel Involved in Data Collection
- Elimination of Outliers and Non-Representative Data
- Use of Appropriate Formats
- The Expertise of Individuals Involved in Data Collection
- Willingness to Provide Data to Concerned Parties
Steps to Minimize Poor Data Collection
- Allocate adequate time for data collection
- Track and eliminate outliers
- Train data collection personnel
- Pretest and evaluate questionnaires