Lecture 1 Flashcards
data science
What is the role of a data scientist? (responsible data analytics from a data scientist perspective)
data scientist:
technical tools:
* has statistical tools for data analytics
* has the fundamentals of machinine learning for data analytics
* makes the design choices
Responsible analysis
* accounts for data bias and bias mitigation
* accounts for other stakeholders and “non-customers”
* decides on which design choices are made
What are the four flavours/parts in data analytics?
Descriptive analytics
Diagnostic analytics
Predictive analytics
Prescriptive analytics
What is descriptive analytics?
(Main question and tools)
Main question: What is happening?
Tools: Visualization, Statistics
What is Diagnostic analytics?
(Main question and tools)
Main question: why did it happen?
tools: Advanced statistics. clustering
What is Predictive analytics?
(Main question and tools)
Main question: What is likely to happen?
Tools: Supervised, unsupervised machine learning
What is Prescriptive analytics?
(Main question and tools)
Main question: What should I do about it?
Tools: Monitoring, Stakeholder analysis
What are the data science aspects?
- Proper Data Usage
- Data Nature
- Data Type
- Data Visualiation
- Modelling
- Validation
What are the goals of data science?
- To have an overview and terminology
- to know where to look for answers
- to ask the “right” questions
- to answer the “right” answers
- data value
- opportunities
- challenges
I think to understand the data value and add value to the data
What does data science consist of? (Data as integral part)
- collecting
- curating
- cleaning
of the data
collecting gathering the data
curating select, organize, and look after the data
cleaning (the data that has been collected and curated) now fixing or removing incorrect, corrupted, incorrectly formatted, duplicate or incomplete data within the data set.
this can be visualized, analysized, modeling
These steps, collecting curating and cleaning can be presented by: visualizing, analysing, and modeling (slide 37 week 1)
Modelling questions (to ask yourself)
- Why do I want to model?
- what is useful to model?
- what can i model?
- how will the model be used?
- who is going to use the model?
Data questions (to ask yourself)
- What data do I need?
- What data do i have?
- How hard is it to get the data?
What is the essence of data science?
To refine the questions,
( slide 48, bit vague but i think asking questions as a DS, getting responses (from customers who do not know a lot about data science) and based on those responses refining the question and asking new, more specific questions. )
What is data?
- “Factual information (such as
measurements or statistics)
used as a basis for reasoning,
discussion, or calculation - Information in digital form that
can be transmitted or
processed - Information output by a
sensing device or organ that
includes both useful and
irrelevant or redundant
information and must be
processed to be meaningful” –
Merriam-Webster Dictionary
Types of Data
There is a lot of different types:
* Transport
* Geographical
* cultural
* scientific
* financial
* statistical
* meteorological (about weather)
* natural (nature)
Types of data structures
- Structured data
- semi-structured data
- unstructured data
What is the difference between structured, semi-structured and unstructured data ?
Structured data is stored in a predefined format and is highly specific; whereas unstructured data is a collection of many varied data types that are stored in their native formats; while semi-structured data does not follow the tabular data structure models associated with relational databases or other data table forms
Differences between structured and unstructured data
Structured data:
1. displayed in rows, columns and relational databases
2. it is made out of number dates and strings
3. estimated to be 20% of enterprise data
4. requires less storage
5. easier to manage and protect with legacy soultions
Unstructured data:
1. cannot be displayed in rows, columns and relational databases
2. images, audio, video, word processing files, e-mails, spreadsheets
3. estimated 80% of enterprise data
4. requires more storage
5. more difficult to manage and protect with legacy solutions
Unstructured (digital) data examples
An image
* An image is basically a matrix of numbers
* each element of the matrix (a pixel) is identified by three values: R, G, B (red green blue)
Signal/sound/speech
* a signal is represented as a vector (array)
* time corresponds to the index of the array
* the different values represent the content of the array
Text
* A text is represented as a vector (array)
* the position of a letter in the text correspond to the index of the array
* the different values represent the content of the array.
Structured data, specifically tabular data, what are the databases?
Excel files, CSV files for example
Structured data types
(what are the two that it can be broken down in?)
Quantitive (also called numerical) data can be directly represented as a number (integer or floating point). can be broken down further into:
* continuous (height, weight, age)
* discrete (number of cards, number of patients, number of books)
Categorical data is generally textual and (often) needs further processing to be analyzed can be broken down further into:
* ordinal (grades, size of clothing, study level)
* nominal (hair colour, gender, marital status)
What is metadata? and why is it there?
data about the data.
IT is usually employed for administative/archival purposes
Different types of metadata
- descriptive
- technical
- administrative
- structural
- rights
- presentation
What is descriptive metadata?
Defines of describes an information resource to aid identification, recovery, and retrieval at any and all levels of aggregation
for example:
* publication-level metadata
* citation metadata
* subject indexing
* linking metadata
What is technical metadata?
describes obejectives technical information about an information resource
for example:
* file size
* pixel height
* duration
What is Administrative metadata?
Supports the general management and use of an information resource
for example:
* identification metadata
* content lifecycle metadata
* versioning metadata
What is structural metadata?
defines what the component objects are and how they relate to each other
for example:
* product definition metadata
* product organization, assembly metadata
What is ‘rights’ metadata?
Supports the management of an information resource’s intellectual property.
* geographic scope metadata
* timeframe metadata
* rights holder metadata
What is presentation metadata?
Defines how information will be fomratted for a particular object.
for example:
* tagging for browser display
What is the ‘Role of data Nature’?
The role of data nature defines the:
* type of questions
* type of model
* data collection approach
What implications do data bring?
- how to store it?
- how to process it?
- what ML approach?
- what computational power?
- what are the inputs and outputs?