DataCamp: Data Science for Business Flashcards
What is data science
It’s a set of methodologies used to gather thousands of forms of data available and draw meaningful conclusions from the data.
What can data do ?
describe current state of an organisation or process
detect anomalous events
diagnose the cause of events and behaviours
predict future events
Data Science Workflow
Data Collection
Exploration and Visualization
Experimentation and Prediction
Applications of Data Science:
Traditional Machine Learning
Internet of Things - Refers to gadgets that are not standard computers but have ability to transmit data such as smart watches and home appliances.
Image classification using Deep Learning
Data Pseudonymization
Anonymization of data to protect privacy.
PII - Personally Identifiable Information
Data that can be linked to an individual
Data Sources
Web events, logistics data, Financial transactions, customer data
Solicited Data (SD)
Obtained by requesting opinion from customers such as surveys, in-app questionnaires, focus groups and customer reviews. SD are used to: 1. De-risk decision making 2. Monitor quality and 3. Create marketing collateral.
Types of Solicited Data
Qualitative (subjective): conversations, open-ended questions. Help to generate hypothesis.
Quantitative: multiple choice, rating scale. Used to validate hypothesis.
Net Promoter Score
Quantitative method of measuring stated preference
Other data sources
APIs (a way of requesting data from 3rd parties over the internet), such as Twitter, Wikipedia, Google Maps, Yahoo Finance
Public records
Mechanical Turks - Manual data input by humans
Unstructured data
such as emails, text, video and audio files, web pages and social media are stored in document databases. Use NoSQL for data retrieval
Tabular data
Relational database. Use SQL for data retrieval.
Dashboards
A dshboard is a set of metrics, usually in the form of graphs, that update on a pre-defined frequency such as daily, weekly, real-time etc. They help to visualize and explore collected data. Examples of dashboards include time series (which tracks a value over time)
Ad hoc request
Requests for data that does not need to be repeated on a weekly or daily basis. Such request should be specific, include context and priority.
A/B Testing
A/B Testing is a type of experiment for de-risking choices between two options such as changes to a website, addition of new features or wording email subjects.
A/B Testing steps
- Picking a metric to track
- Calculating the sample size using baseline metric and test sensitivity.
- Running the experiment
- Determining the significance
Machine Learning
A set of methods for making predictions based on existing data.
Supervised learning
A subset of machine learning where the data has labels and features which are used for making predictions. Used to solve problems such as recommendation systems, email subject optimization and churn prediction.
Clustering
ML algorithms that divide data into clusters. It is applied in Customer segmentation (where customers are didvided into different groups with common attributes), Image categorisation and anomaly detection. It is a part of Unsupervised learning (which uses data with only features and no labels).
Special topics in Machine Learning
Time Series
Time series forecasting is any type of ML where time is an important feature. It shows periodic patterns and can help spot seasonality.
Natural Language Processing (NLP)
Refers to ML problem where the dataset (input data) is text. Possible applications include customer reviews, tweet, medical records and email subjects. NLP can be used to classify sentiment and cluster medical records.
Deep Learning (Neural Networks)
An area of ML used to solve more complex problems. Requires more data than traditional ML. Best used where inputs are less structured such as large amounts of texts and images. Main drawback is the lack of explainability of predictions but it makes highly accurate predictions.
Explainable AI
Refers to methods that allow humans to understand the factors behind each prediction.