Data Mining Flashcards
What is Data Mining?
Is the process of extracting knowledge from data. It aims to identify correlations in data, find patterns and variations, understand trends, and predict probabilities.
Patterns
Pattern are a variable that changes in a repeating or predictable way.
Trends
Trends is a general change in one variable compared to another over time.
Data Mining Techniques
Data Mining Techniques
- Classification.
- Clustering.
- Anomalies.
- Association Rule Mining.
- Sequential Patterns.
- Affinity grouping.
- Decision Trees.
- Regression.
Commonly use software’s tools for data mining:
- Spreadsheets.
- R-Language.
- Python.
- IBM SPSS Statistics.
- IBM Watson Studio.
- SAS.
Spreadsheets (Excel and Google Sheets)
Are used for hosting data that has been exported from other systems, so they can be accessible, easy-to-read, and use to draw comparations between sets of data.
Excel add-ins: Data mining Client, XLMiner, and KnowledgeMiner.
GoogleSheets add-ins: Text Analysis, Text Mining, and Google Analytics.
R-Language packages:
- tm: a framework for text mining applications within R.
- twitteR: a framework for mining tweets.
R-Language
commonly use for statistical modeling and computations by statisticians and data miners. With R Libraries we can perform data mining operations such as:
- Regression.
- Classification.
- Data Clustering.
- Association Rule Mining.
- Text Mining.
- Outlier Detection.
- Social Network Analysis.
Python libraries:
- Pandas.
- NumPy.
- Jupyter.
Pandas
Any type of data format can be uploaded and organize, sort, and manipulate. We can perform:
- Basic numerical computations such as mean, median, mode, and range.
- Calculate statistics and make correlations between data and distribution of data.
NumPy
is a tool for mathematical computing and data preparation, that offers a host of built-in functions and capabilities for data mining.
Jupyter
Data Scientist and Data Analysis use this tool to perform data mining and statistical analysis.
IBM SPSS(Statistical Package for Social Sciences) Statistics
Popularly used for advanced analytics, text analytics, trend analysis, validation of assumptions, and translation of business problems into data science solutions
IBM Watson Studio
Leverages a collection of open source tools such as Jupyter notebooks, and extends them with closed source IBM tools that make it a powerful environment for data analysis and data science.
SAS Enterprise Miner
Is a powerful graphical workbench that enables the capabilities for interactive data exploration, mine, transform, identify anomalies, analyze big data, identify patterns, and identity relationships within data.