1. From Data Analysis to Data Mining Flashcards
What is the first step in the data mining process?
Data Collection -Gathering raw data from various sources.
What is the purpose of Data Cleaning?
To remove noise and correct inconsistencies in the data.
What does Data Integration involve?
Combining multiple data sources into a cohesive dataset.
What is Data Selection?
Retrieving relevant data for analysis from databases.
What is the goal of Data Transformation?
To modify and consolidate data into appropriate formats for analysis.
What is Data Mining?
Applying intelligent methods to extract patterns from the data.
What is Pattern Evaluation?
Identifying significant patterns that represent knowledge based on specific measures.
What does Knowledge Presentation involve?
Using visualization and representation techniques to present the mined knowledge to users.
What is the iterative nature of the data mining process?
It often requires revisiting previous steps for refinement.
Why is domain knowledge important in data mining?
It helps in understanding the context and relevance of the data.
What role does feature selection play in data mining?
It identifies the most relevant variables for analysis.
What is the significance of data visualization in data mining?
It helps in interpreting complex data patterns and results.
How does data mining contribute to decision-making?
By providing insights and patterns that inform strategic choices.
What is the importance of model validation in data mining?
To ensure the accuracy and reliability of the extracted patterns.
What is the final step in the data mining process?
Knowledge representation and communication of results to stakeholders.
What is an interesting pattern?
A pattern is interesting if:
1. it is easily understood by humans;
2. valid on new or test data with some degree of certainty;
3. potential useful;
4. novel
or if it validates an hypothesis the user sought to confirm
An interesting pattern represents knowledge.
Identify the Data Science’s Steps
- Data Collection
- Data Cleaning
- Data Integration
- Data Selection data relevant to the analysis
- Data Transformation
- Data Mining
- Pattern evaluation
- Knowledge presentation
Name the main data mining tasks.
1- Anomaly Detection
2.Association Rule
3.Clustering
4.Classification
5.Regression
6. Summarization
The quality of the data source is based on
- Completeness
2.Correctness - How relevant it is to the problem being solved
Identify some Python Machine learning packages for data mining.
- scikit-learn
- kedro
- TensorFlow
- keras
- PyTorch
Correctness will depend on:
-Validity
-Accuracy
-Consistency
- Uniformity
Transformations can be made through
Normalizing and Scaling
Encoding Categorical Variables
Discretization
Domain related transformations
What to do if there are missing values?
- Do nothing
- Drop the missing values
* Replace them by other values (mean, median, mode, interpolation)-> known as imputation.