Unit 1 - What is Data Science? Flashcards
Data Science
Data Science is an interdisciplinary field that is concerned with
- collecting,
- preparing,
- processing,
and obtaining insight
from available data.
Data mining
Data mining is the process of gaining insights into a data set to recognise hidden patterns, known as pattern recognition.
This is done through analysis and model fitting – trying to find a model that represents the data, or the process that generates the data.
CRISP
The cross industry standard process (CRISP) model of the data mining process.
6 Phases of the CRISP model
- understanding the business scenario that the data mining process will be performed for
- understanding the data involved in the task and defining what is and isn’t needed
- preparing the data in order to make the data mining task less cumbersome and easier to achieve
- fitting a model that performs the required task
- evaluating the model using metrics that are suitable to the task in hand (evaluating classification is different to evaluating regression or clustering)
- deploying the model to be utilised by the business.
System testing
This is testing whether a system is working as intended. It normally investigates the integration of different components and whether there are any issues that might arise due to integrity.
4 Types of System Testing Faults
- Accidental
- Logical
- Flow
- Implementational
Unit testing
Unit testing is testing an individual module or component of a system.
As with system testing, it is normally conducted for logical or implementational error regarding the intended functionality of the unit.
This type of testing is more prevalent and occurs several times in the life of the component, whenever there is a change to its functionality or coding.
Model testing
Model testing depends on the task in hand whether it is classification, clustering or regression.
In model testing the prediction performance of the model and its level of accuracy in performing the required task are tested.
Discriminative models
Discriminative models can be categorised by addressing their intrinsic capabilities.
This can distinguish between models that are capable of only discriminating between the different classes.
Generative models
These are another type of models, capable of generating synthetic data that is likely to come with the tasks being dealt with.
These are more powerful than discriminative models, but they are more difficult to build and often need more computational power.
Bayes Theorem
Bayes Theorem is the backbone of Bayesian Statistics and Bayesian models. As opposed to Frequentist Statistics, Bayes Theorem defines the probability (P) of an event (H) conditioned on another event (E) as follows:
P(H | E) = P(E | H) . P(H) | P(E)
Precision of an attribute
Precision of a feature (taken from a measurement) is the closeness of measurements to one another.
Bias of an attribute
Bias is a systematic variation of measurements from the actual quantity being measured.
Stratification
Sampling by maintaining the distribution of the underlying data.
Standardisation
Standardisation involves calculating the mean x̄ and standard deviation sx of a feature x.
This is done by looking into the data that resides inside the feature as samples to calculate these statistical measures.
The following transformation is then applied to the data that involves the features x′= (x-x̄)/sx.
The new features x′ that were calculated out of x has a mean 0 and standard deviation of 1, i.e., it is standardised.