Introduction to data analytics Flashcards
What are the 4 Vs of Big Data?
Volume
Velocity
Variety
Veracity
What paradigm do big data scientists use?
Retrospective data mining with multiple hypotheses
Looking for patterns without a particular hunch
Types of Data?
Structured Data
- SQL Database
- Excel Spreadsheet / CSV File
Unstructured Data
- Free Text Responses
- Doctor’s notes
Typical Data Structures
Typical Datasets - CSV - eXtensible Markup Language (XML) - JavaScript Object Notation (JSON) Nested JSON in CSV - SQL - Excel Data Formats Other - .txt files for text - RGB data for images
What is Quantitative Data?
Numbers, such as -
- Temperature
- Heart rate
- Likert ratings
What is Qualitative Data?
Text, such as -
- Surveys
- Interviews
- App comments / User Feedback
What is Quantitative Analysis?
Qauntitative Statistical Analysis:
- Descriptive statistics: Mean and SD
- Inferential Statistics: t-tests
What is Qualitative Analysis?
Thematic Analysis
Advanced methods:
- Text analytics: word embeddings
- Review mining
Gartner Analytic Continuum
Descriptive Analytics - Hindsight Diagnostics Analytics - Insight Predictive Analytics - Foresight Prescriptive Analytics
Increasing difficulty and value
Typical Data Analytics Process
Data gathering/wrangling/linking -> data cleansing -> exploratory data analysis [EDA] -> supervised machine learning
EDA
- Data visualisation
- Association mining
- Unsupervised machine learning
Supervised Machine Learning
- Feature engineering
- Model building
- Model optimisation
- Model evaluation
Machine Learning
Supervised ML
- Labelled data
- DL, SVM, Logit, Decision Trees, K-NN
Unsupervised ML
- Unlabeled Data
- Clustering, association rule mining
Semi-supervised ML
- Some labelled data
The 5 Tribes of ML & the No free lunch theorem
Symbolists | Structure Inference | Production Rule System & Inverse Deduction
Connectionist | Estimating Parameters | Back propagation & Deep Learning
Bayesians | Weighing Evidence | HMM Graphical Model
Evolutionaries | Structure Learning | Genetic Algorithms & Evolutionary Programming
Analogisers | Mapping to Novelty | kNN and SVM
The Neat and Scruffy Data Scientist
Neat: they care about the details and the ML methods
Scruffy: they care about the results and are somewhat ignorant of details and the methods