Good to know Flashcards
Define the term Data Mining briefly
Data Mining is the process of discovering patterns, correlations, and insights from large datasets using techniques from statistics, machine learning, and database systems. It focuses on extracting useful information and making it actionable.
Define the term Data Science briefly
Data Science is an interdisciplinary field that combines domain expertise, programming, statistics, and data analysis to derive meaningful insights and solutions from data. It involves data preparation, modeling, visualization, and communication of results.
Define the term Big Data briefly
Big Data refers to extremely large datasets that cannot be easily managed, processed, or analyzed using traditional data-processing tools. These datasets are characterized by their volume, velocity, variety, and veracity (the “4 Vs”).
Briefly describe the six phases of the CRISP-DM (Cross-Industry Standard Process for Data Mining) process
Business Understanding:
Define the project objectives and requirements from a business perspective. Identify the key questions and the success criteria.
Data Understanding:
Collect the initial data and examine it to understand its structure, quality, and relevance. Identify potential data quality issues or patterns.
Data Preparation:
Clean, format, and transform the data into a usable form. This phase includes handling missing values, feature selection, and creating new variables if necessary.
Modeling:
Apply various modeling techniques (e.g., regression, classification, clustering). Select and optimize algorithms based on the data and business objectives.
Evaluation:
Assess the models to ensure they meet the project goals. Compare results against business objectives and validate the model’s accuracy and utility.
Deployment:
Implement the results into real-world operations, such as decision-making processes or automated systems. Create reports or dashboards to share findings with stakeholders.
Define “data” in the context of data mining
In the context of data mining, data refers to a collection of raw facts, measurements, or observations that can be structured (like databases), semi-structured (like JSON or XML files), or unstructured (like text, images, or videos). This data serves as the foundation for analysis and is used to extract patterns, insights, and useful knowledge.
Key characteristics of data in data mining include:
Volume: The size of the dataset.
Variety: The different types and formats of data.
Quality: The completeness, accuracy, and relevance of the data.
Context: The domain or setting from which the data originates, which influences its interpretation.
Describe the four main scales of measurement in data and provide an example for each
Nominal Scale - Categorizes data without any order or ranking. The values are simply labels or names.
- Characteristics - No numerical or quantitative significance; only categories.
- Example - Gender (Male, Female, Other).
Types of fruits (Apple, Banana, Orange).
Ordinal Scale - Categorizes data with a meaningful order or rank, but the intervals between ranks are not uniform or measurable.
- Characteristics - Shows relative positioning, but differences between values are not meaningful.
- Example - Customer satisfaction (Very Unsatisfied, Unsatisfied, Neutral, Satisfied, Very Satisfied).
Education level (High School, Bachelor’s, Master’s, PhD).
Interval Scale - Represents data with ordered categories where the intervals between values are meaningful and equal, but there is no true zero point.
- Characteristics - Allows for meaningful addition and subtraction, but not ratios.
Example - Temperature in Celsius or Fahrenheit (e.g., 20°C, 30°C, 40°C).
Time of day in 24-hour format (e.g., 10:00 AM, 12:00 PM).
Ratio Scale - Contains all properties of the interval scale, but also has a true zero point, allowing for meaningful ratios.
- Characteristics - Enables all mathematical operations, including multiplication and division.
- Example - Weight (e.g., 50 kg, 100 kg).
Distance (e.g., 5 km, 10 km).
Explain the difference between continuous and discrete data, providing an example of each
Continuous Data
- Refers to data that can take any value within a given range, including fractions and decimals.
- It is measured on a scale and represents quantities.
- Continuous data can be infinitely divided into smaller parts.
Example
- Height of a person (e.g., 5.8 feet).
- Temperature in Celsius (e.g., 22.5°C).
Discrete Data
- Refers to data that consists of distinct, separate values and cannot take fractional values.
- It is counted, not measured, and represents categories or whole numbers.
Example
- Number of students in a class (e.g., 30 students).
- Rolls of a die (e.g., outcomes are 1, 2, 3, 4, 5, 6).
Difference - Continuous data represents measurements and is infinitely divisible, while discrete data represents counts or categories and consists of distinct values.
Identify and explain two different data types commonly used in data mining, with an example for each
Structured Data
- Structured data is organized and formatted in a way that makes it easily searchable in relational databases and spreadsheets. It consists of rows and columns, where each column represents a specific attribute, and each row is a record.
- Follows a predefined schema.
- Typically numeric or categorical.
- Example: A customer database with columns for Customer ID, Name, Age, and Purchase Amount.
Unstructured Data
- Unstructured data is not organized in a predefined format or schema. It is often text-heavy but may include images, videos, and other non-tabular data.
- Lacks a defined model or schema.
- Requires more complex processing (e.g., Natural Language Processing, image recognition).
- Example: Social media posts, such as tweets or Facebook comments, where the content, hashtags, and user reactions are free-form.
Define supervised and unsupervised learning
Supervised learning
Supervised learning is a type of machine learning where the model is trained on labeled data. The data includes input-output pairs, and the goal is to learn a mapping function from inputs to outputs.
Key Characteristics - The output (or target variable) is known during training.
It is used for tasks like classification (e.g., predicting categories) and regression (e.g., predicting continuous values).
Example - Predicting house prices based on features like size, location, and number of rooms (where the training data includes house prices as labels).
Unsupervised Learning
Unsupervised learning is a type of machine learning where the model is trained on data without labeled outputs. The goal is to discover hidden patterns or structures in the data.
Key Characteristics - There are no predefined labels or target variables.
It is used for tasks like clustering (e.g., grouping similar items) and dimensionality reduction (e.g., reducing data complexity).
Example - Grouping customers based on purchasing behavior for targeted marketing (clustering).