09 - Data Analysis Flashcards
What is data analysis?
Data Analysis is a process of collecting, transforming, cleaning, and modeling data with the goal of discovering the required information.
The idea is to do a separation of a whole into its component parts. For example in criminal data analysis:
- Build criminal pattern to detect and prevent crimes
- Breakdown the pattern of crimes to evaluate when, where, and why they are occurring
What are data limitation issues?
-
Missing data
- Missing altogether
- Specific items
-
Altered data
- In the rest, in-transit, in the possession
- Intentional vs. unintentional
- Different definitions of the same data (Possibility of misunderstanding)
- Non-existent data (Data of interest do not exist or is not in a usable form)
- Data is available in different forms
What are the four steps of data analysis?
-
Planing
- Identification of data, data availability & quality
-
Data Collection (Inventory / Pre-Processing)
- Improving data quality
- Information gathering
-
Data Preparation
- Structure data for analysis tool
- Data cleaning
- Depends on type of data
-
Data Analysis
- Visualization
- Statistics
- Relationship linking
What are the major tasks of data pre-processing?
-
Data summarisation
- Understand the distribution of data
-
Data cleaning
- Fill in missing values, remove outliers / noisy data
-
Data integration
- Integration multiple data sources
-
Data transformation
- Aggregation of data
-
Data reduction
- Take whats relevant
-
Data discretisation
- Reduction with focus on numbers
What are the four types of central tendency?
- Mean (sensitive to extreme value)
- Median (remove outliers)
- Midrange (min + max / 2)
- Mode (Value that occures most frequently)
What is the five number summary?
The five-number summary is a set of descriptive statistics that provide information about a dataset. It consists of the five most important sample percentiles:
min, max, med, Q1 (first quartile), Q3 (third quartile)
The five-number summary can be used to examine the credit card summary and its distribution and detect outliers as fraud values.
What is a data warehouse?
A data warehouse, also known as an enterprise data warehouse, is a system used for reporting and data analysis, and is considered a core component of business intelligence. DWs are central repositories of integrated data from one or more disparate sources.
What is Behaviour profiling?
Finding of suspicious behaviour by analyzing databases. Capability to recognize patterns of criminal activity. Predict when and where crimes are likely to take place.
What is data mining and what is the ultimate goal?
Data mining (knowledge discovery in databases). Extraction of interesting (hidden) information or patterns from data in large databases.
The ultimate goal of data mining is the prediction of human behaviour.
What are potential applications for data mining?
- Database analysis and decision support.
- Market analysis and management
- Risk analysis and management
- Fraud detection and management
- Text mining (news group, email, documents)
- Web analysis
- Intelligent query answering
How can data mining help to detect fraud?
Data mining can help to detect fraud by using the historical data to build models of fraudulent behavior and then to use data mining techniques to identify similar instances.
Name a few different data mining techniques?
- Link analysis
- Intelligent agents
- Text mining
- Neural Networks
- Machine-learning algorithms
What is clustering?
Clustering is a technique to group a set of data objects into clusters.
What are the three main components of link analysis?
- Entities
- Events
- Associations
Name a few data mining tools?
- Open-Source
- Weka
- Orange
- Rapid Miner
- NLTK
- Commercial
- SAS
- IBM SPSS
- Web site
- kdnuggets.com