The Data Analytics Journey Flashcards
WGU Class D596
Quantative Data
Quantitative data represents numerical values that can be measured or counted. It answers questions like “How many?” or “How much?”
Discrete data
Countable values. Distinct and separate; they cannot take on values between the defined points.
Like number of students in a class or pets in a home
Continuous data
Continuous data is a type of quantitative data that can take on any value within a range.
Height, temperature, time
categorical data
Categorical data represents categories or labels rather than numerical values.
Can be nominal or ordinal
Nominal data
Categories have no natural order.
ex: colors, types of pets, martital status, favorite sports, car brands
Ordinal data
Categories have a meaningful order, but the intervals between them are not equal.’ Examples: Ratings (poor, fair, good, excellent), educational levels (high school, bachelor’s, master’s).
Part to whole
Shows how individual parts contribute to the whole. Great for when you want to display proportions or percentages.
Distribution
Shows how values in a dataset are spread or distributed across a range. Understanding the spread, skewness, or patterns in your data.
Nominal Comparison
Compares values for categorical (nominal) variables without any specific order. Comparing quantities between categories.
Time Series
Data collected over time (e.g., daily, monthly, yearly) to track trends or patterns. Analyzing how data changes over time.
Correlation
Shows the relationship between two variables, indicating whether they move together (positive correlation), move oppositely (negative correlation), or show no relationship.
Ranking
Compares items in a dataset by sorting them in ascending or descending order. Highlighting the relative positions or hierarchy of categories.
Deviation
Shows how data deviates from a baseline, expected value, or the mean. Highlighting differences or anomalies in the data.
What charts are good for deviation?
Diverging bar chart
Line chart (with baseline or reference line)
Error bars.
What charts are good for ranking?
Bar chart (sorted by value)
Column chart
Dot plot.
What charts are good for correlation?
Scatter plot
Bubble chart
Heatmap.
What charts are good for time series?
Line chart
Area chart.
What charts are good for nominal comparison?
Bar chart
Column chart
What charts are good for distribution
Histogram
Box plot
Violin plot.
What charts are good for part-to-whole?
Pie chart
Donut chart
Stacked bar chart (with percentages).
What are visual elements to use when designing charts?
Similarity & Contrast
Dominance & Emphasis
Scale & Proportion
Hierarchy
Balance & Symmetry
Regression
Regression is a technique that allows an analyst to predict an outcome (either numerical or categorical) based on a set of predictor variables.
Regression analysis
A statistical method that identifies the relationship between variables.
Classification
A type of supervised machine learning task where the goal is to predict a categorical label for a given input based on a set of features. It involves assigning items to predefined classes or categories based on their characteristics.
Clustering
An unsupervised learning algorithm used to group data points into clusters based on their similarity without prior knowledge of labels.
T or F: Decision Trees are an example of Clustering
False. a supervised learning algorithm used for classification and regression tasks. It creates a model that predicts the value of a target variable by learning simple decision rules inferred from the data.
What is the classification process?
The process typically involves: Data preparation, feature selection/extraction, model training, prediction, evaluation
Market Basket Analysis
A data mining technique used to understand purchasing behavior by identifying relationships between items in a transaction
Process Mining
A technique that analyzes event logs from business processes to identify inefficiencies, bottlenecks, or opportunities for optimization.
T-Test
A statistical test used to determine if there is a significant difference between the means of two groups.
Text Mining
The process of extracting meaningful information from unstructured text data.
Neural Networks
A machine learning model inspired by the human brain, consisting of layers of interconnected “neurons” that learn patterns in data.
Principal Component Analysis (PCA)
A dimensionality reduction technique that simplifies data by converting it into principal components (uncorrelated variables).
Supervised Learning
A machine learning approach where the model is trained on labeled data to predict outcomes for new data.
Regression ML
A supervised learning technique used to predict a continuous output (numerical value) based on input features.
Unsupervised learning
A machine learning approach where the model works with unlabeled data to discover patterns or structures.
Time Series Model
A model designed to analyze and predict values that change over time.
What are these algorithms an example of?
K-Means, DBSCAN, Hiearchial?
Clustering. a type of unsupervised machine learning used to group data points into clusters
What model uses forecasting and detecting seasonal patterns?
Time Series Model
What are decision trees, support vector machines, and k-Nearest Neighbors examples of?
Classification
What uses image and/or speech recognition, or predictive analytics?
Neural networks
Google Sheets, MySQL, and sales data are examples of?
Structured data
What is semi-structured data?
Data that does not follow a rigid structure but still has some level of organization, typically using tags or markers to separate elements
What are examples of semi-structured data?
JSON files, XML files, emails
What is unstructured data?
Data that does not follow any predefined format or structure, making it difficult to store and analyze in traditional databases.
T or F: You use SQL on semi-structured and unstructured data.
False. Usually MongoDB or Apache Hive
What is AutoML?
Automated Machine Learning (AutoML) is a framework or set of tools that automates the process of developing, training, tuning, and deploying machine learning (ML) models.
What are keys to managing stakeholders in a project?
Obtain a project sponsor
Identify project stakeholders, group them by power, influence, and need
Survey other stakeholders and create an engagement map
Pinpoint stakeholder frustrations and visions of success when talking and interviewing stakeholders
What are keys to communicating the data effectively?
Continue learning about the business.
Tie the data to the business question asked
Avoid granualarity
Make data easy to consume
Ask for feedback
Don’t discuss technical unless important to business question
What is the key to persuasion?
Communication, emotional intelligence, active listening, logic and reasoning, interpersonal skills, and negotiation
What questions should we ask ourselves when posing a question?
Does the receiver understand what is asked, have you phrased the question based on what the receiver knows or may not know, is the question logical, and is your tone neutral?
How do you summarize what you hear?
Using your own words, capturing the intent the receiver/speaker is trying to express while filling in their words and actions as if understanding the feeling accurately.
T or F: Discrete data can be decimals
False. Discrete data are whole numbers
T or F: Continuous data can be fractions and decimals
True.
What is a data analytics plan?
A data analytics project plan outlines the steps and processes involved in conducting a data analytics project from start to finish.
Define a EDARP
It is a Exploratory Data Analysis Research Plan. Convincing the organization the potential value of your work. It understands the objectives and details path to reaching the objectives.
What makes a good data analytics plan?
Scoping meetings, aligning the list of requirements, building a mockup, avoiding commiting to deadlines until processing data, creating a UAT document, avoiding feature creep, hosting regular meetings with end-users/stakeholders, releasing a minimum viable product, conducting demo/training, scheduling regroups & adoption, obtaining feedback, building a contigency plan
What are examples of classification in ML
Spam dectection, classify email spam or not spam.
Sentiment analysis - determine the sentiment of text as positive, negative, neutral
Image recognition - identify images as cats vs. dogs