Data Science for Business Leaders Flashcards
https://www.datacamp.com/courses/data-science-for-business-leaders
What is data science?
Data science is a set of methodologies for taking in thousands of forms of data that are available to us today, and using them to draw meaningful conclusions.
What can data do?
- Describe the current state of an organization or process
- Detect anomalous events
- Diagnose the causes of events and behaviors
What are the three steps of the data science workflow?
- Data collection
- Exploration and visualization
- Experimentation and prediction
What do we need for machine learning?
- A well-defined question
- A set of example data
- A new set of data to use our algorithm on
What are some applications of data science?
Fraud detection, IoT, image recognition…
What are some common jobs in a data science team?
Data engineer, data analyst, machine learning scientist…
What are the responsibilities of data engineers?
- Information architects: control the flow of information
- Build the storage solutions and infrastructure
- Maintain data access: ensure the data is easy to access and process
What tools do data engineers use?
- SQL, to store and manage big data
- Java, Scala or Python to process data and automate data related tasks
What are the responsibilities of data analysts?
- Create dashboards
- Hypothesis testing
- Data visualization
What tools do data analysts use?
- Spreadhseets for simple storage and analysis
- SQL for large scale analysis
- BI Tools (Tableau, Power BI, Looker) for dashboarding and sharing information
What are the responsibilities of a machine learning scientist?
- Make predictions and extrapolations
- Classify data
- Predict stock prices
- Process images
- Automate text analysis
What tools do machine learning scientists use?
- Python or R for creating predictive models
What are three types of team structures for a data science team?
- isolated
- embedded
- hybrid
What are the characteristics of an isolated data science team?
An isolated data science team contains one or mutiple types of data employees, without engineering or product members.
What are the characteristics of an isolated data science team?
Each data employee is part of a squad containing engineers and product managers.
What are the characteristics of a hybrid data science team?
The hybrid structure is similar to the embedded structure, but includes an additional sync for all data employees across all squads, allowing uniform data processes.
What are some common sources of data?
- Web events
- Customer data
- Logistics data
- Customer transactions
What does PII mean?
Personally Identifiable Information
What information does PII include?
- Name
- Locatio
- Email address
- Any other piece of information that can be used to tie a web event back to a real human
What is data pseudonymization?
Assign a user a user ID, and store that information in a separate table with restricted access and regular logs audit. Events are then identified by the user ID rather than the user name.
What is data anonymization?
Assigning the user a user ID, then destroying the table with the actual user names.
What does GDPR mean?
General Data Protection Regulation
What does GDPR consist in?
- Applies to all data inside the EU
- Give individuals control over their personal data
- Regulates how long data can be stored
- Mandates appropriate anonymization
- Disclose data collection and gain consent
What is solicited data?
Solicited data is data gathered when asking customers about their opinion.
What is solicited data useful for?
- Create marketing collateral
- Attenuate decision making risk
- Monitor quality
What are some common types of solicited data?
- Surveys
- Customer reviews
- In-app questionnaires
- Focus groups
What does NPS stand for?
Net Promoter Score
What does the NPS measure?
The Net Promotioner Score measures how likely users are to recommend a product.
What are the types of soliciated data?
- Qualitative (very subjective, requires a lot of analysis)
- Conversations
- Open-ended questions
- Good for generating hypotheses
- Quantitative (can be easily summarized in a graph)
- Multiple choice
- Rating scale
- Good for validating hypotheses
What are the two types of preferences?
Sated and revelaed
What is a stated preference?
A stated preference qualifies what a user says they want or believe.
What is a revealed preference?
A revealed preference qualifies a preference made visible by a user’s action or purchasing decision.
What is an example where stated and revealed preference differ?
People will state they prefer to go to the gym and exercise, but their behavior reveals that they prefer to go to the beach and relax.
What are some best practices when soliciting data?
- Be specific
- Avoid loaded language
- Calibrate (compare to known quantities)
- Require actionable results (have a hypothesis for each question)
What are some common ways to collect external data?
- APIs
- Public records
- Mechanical turk
What does API stand for?
Application Programming Interface
What are some notable APIs?
- Wikipedia
- Yahoo Finance
- Google Maps
What are some notable public records?
- data.gov
- data.europa.eu
What does mechanical turk consist in?
Mechanical turk consists in getting humans to complete a task we eventually plan on computerizing (labeling pictures for image recognition). Several people will qualify a few images, with the same image being qualified by several people to ensure qualification quality. AWS M Turk can be used to recruit such people.
What can mechanical turk be used for?
- Label customer reviews
- Extract text from a form
- Highlight key words in a sentence
What are some types of data storage?
- Unstructured (document database)
- Tabular (relational database)
What are some examples of unstructured data?
- Text
- Video and audio files
- Web pages
- Social media
What query language do document databases use?
NoSQL
What query language do relational databases use?
SQL
What is a dashboard?
A dash board is a set of metrics, usually in the form of graphd, that update on schedule.
What are some common dashboard elements?
- Tracking a value over time
- Tracking composition over time
- Categorical comparison
- Highlighting a single number
- Displaying text
Where can you build dashboards?
- Spreadsheets: Excel or Google Sheets
- BI Tools: Power BI, Looker
- Customized tools: R Shiny or d3.js
What is an ad hoc analysis request?
- Not repeated on a weekly or daily basis
- Can come from many places
- Product
- Finance
- Engineering