Google Data Analysis Flashcards
A/B testing
The process of testing two variations of the same web page to determine which page is more successful at attracting user traffic and generating revenue
Compatibility
How well two or more datasets are able to work together
Data analysis process
The six phases of ask, prepare, process, analyze, share, and act whose purpose is to gain insights that drive informed decision-making
Data analysis
The collection, transformation, and organization of data in order to draw conclusions, make predictions, and drive informed decision-making
Data life cycle
The sequence of stages that data experiences, which include plan, capture, manage, analyze, archive, and destroy
First-party data
Data collected by an individual or group using their own resources
Gap analysis
A method for examining and evaluating the current state of a process in order to identify opportunities for improvement in the future
Problem types
The various problems that data analysts encounter, including categorizing things, discovering connections, finding patterns, identifying themes, making predictions, and spotting something unusual
Statistical power
The probability that a test of significance will recognize an effect that is present
Statistical significance
The probability that sample results are not due to random chance
Wide data
A dataset in which every data subject has a single row with multiple columns to hold the values of various attributes of the subject
Administrative metadata
Metadata that indicates the technical source of a digital asset
Descriptive metadata
Metadata that describes a piece of data and can be used to identify it at a later point in time
Structural metadata
Metadata that indicates how a piece of data is organized and whether it is part of one or more than one data collection
3 types of metadata
- descriptive
- structural
administrative
Foreign key
A field within a database table that is a primary key in another table (Refer to primary key)
Primary key
An identifier in a database that references a column in which each value is unique (Refer to foreign key)
Metadata
Data about data
Elements of metadata
- title and discription
- tags and categories
-who created it and when
-who last modified it and when - who can access or update it
Metadata repository
A database created to store metadata
Metadata repositories
- describe the state and location of the meatdata
- describe the structures of the tables inside
- describe how the data flows through the repository
- keep track of who access the metadata and when
Hypothesis testing
A process to determine if a survey or experiment has meaningful results
Confidence level
The probability that a sample size accurately reflects the greater population
Margin of error
The maximum amount that sample results are expected to differ from those of the actual population
to calculate margin of error you need
- population size
- sample size
- confidence level
Dirty data
Data that is incomplete, incorrect, or irrelevant to the problem to be solved
Clean data
Data that is complete, correct, and relevant to the problem being solved
Data engineer
A professional who transforms data into a useful format for analysis and gives it a reliable infrastructure
Data warehousing specialist
A professional who develops processes and procedures to effectively store and organize data
Confidence interval
A range of values that conveys how likely a statistical estimate reflects the population
Statistical significance
The probability that sample results are not due to random chance
Why a minimum sample of 30?
this recommendation is based on the central limit theorem (CLT) in the field of probability and statistics. A sample of 30 is the smallest sizes for which the clt still valid
Das zentrale Grenzwertsatz
Das zentrale Grenzwertsatz (englisch: central limit theorem) ist ein wichtiger Satz der Wahrscheinlichkeitstheorie. Es besagt, dass sich die Summe von unabhängigen und identisch verteilten Zufallsvariablen einer bestimmten Verteilung annähert, wenn die Anzahl der Summanden groß genug ist.
Genauer gesagt besagt der zentrale Grenzwertsatz, dass die Verteilung der Summe von unabhängigen und identisch verteilten Zufallsvariablen einer Normalverteilung annähert, wenn die Anzahl der Summanden groß genug ist. Dies bedeutet, dass viele Zufallsvariablen, die in der Realität auftreten, durch eine Normalverteilung approximiert werden können.
Dieser Satz ist von großer Bedeutung in der Statistik, da er es ermöglicht, viele statistische Tests durchzuführen und Schätzungen zu machen, auch wenn die zugrunde liegende Verteilung unbekannt ist. Der zentrale Grenzwertsatz ist ein wichtiger Bestandteil vieler statistischer Methoden und hat Anwendungen in vielen Bereichen, wie der Finanzmathematik, der Physik und der Ingenieurwissenschaften.
Random sampling
A way of selecting a sample from a population so that every possible type of the sample has an equal chance of being chosen
Sampling bias
Overrepresenting or underrepresenting certain members of a population as a result of working with a sample that is not representative of the population as a whole
Sample
In data analytics, a segment of a population that is representative of the entire population
types of insufficient data
- data from only one sourse
- data that keeps updating
- outdated data
- geographically limited data
SMART methodology
A tool for determining a question’s effectiveness based on whether it is specific, measurable, action-oriented, relevant, and time-bound
critical questions about the predictiv analytical models
- why is it taking so long to put new or updated models into production?
- who created the model and why?
- what input variables are used to make predictions and to make precisions?
- how are models used?
- how are models performing and when were they last updated?
- where is the supporting documentation?
no ansvers - no real value
making predictions
using data to make an informed decision about how things may be in the future
6 problem types that data analysts typically face
- making predictions
- cetegorizing things
- spotting something unusual
- identifying themes
- discovering connections
- finding patterns
data analysis process (google)
The six phases of ask, prepare, process, analyze, share, and act whose purpose is to gain insights that drive informed decision-making
Data-driven decision-making
Using facts to guide business strategy
Algorithm
A process or set of rules followed for a specific task
how data is collected
- interviews
- observations
- forms
- questionairies
- survey
- cookies
Metric
A single, quantifiable type of data that is used for measurement
Problem domain
The area of analysis that encompasses every activity affecting or affected by a problem
Structured thinking
The process of recognizing the current problem or situation, organizing available information, revealing gaps and opportunities, and identifying options
Scope of work (SOW)
An agreed-upon outline of the tasks to be performed during a project
Report
A static collection of data periodically given to stakeholders
Quantitative data
A specific and objective measure, such as a number, quantity, or range
Quantitative data tools
- structural interviews
- survey
- polls
Qualitative data
A subjective and explanatory measure of a quality or characteristic
Qualitative data tools
- focus groups
- social media
- in-person interviews
Data life cycle
The sequence of stages that data experiences, which include plan, capture, manage, analyze, archive, and destroy
Best practices when organizing data
- naming conventios
- foldering
- archiving older files
data life cicle
5) archive
keep relevant data stored long-term and future reference
Bias
A conscious or subconscious preference in favor of or against a person, group of people, or thing
Confirmation bias
The tendency to search for or interpret information in a way that confirms pre-existing beliefs
Interpretation bias
The tendency to interpret ambiguous situations in a positive or negative way
Data integrity
The accuracy, completeness, consistency, and trustworthiness of data throughout its life cycle
Data replication
The process of storing data in multiple locations
Data transfer
The process of copying data from a storage device to computer memory or from one computer to another
Data manipulation
The process of changing data to make it more organized and easier to read
Statistical significance
The probability that sample results are not due to random chance
Data bias
When a preference in favor of or against a person, group of people, or thing systematically skews data analysis results in a certain direction
types of data bias
- observer bias
- interpretation bias
- confirmation bias
- sampling bias
Continuous data
Data that is measured and can have almost any numeric value
Discrete data
Data that is counted and has a limited number of values
Ordinal data
Qualitative data with a set order or scale
External data
Data that lives, and is generated, outside of an organization
Nominal data
A type of qualitative data that is categorized without a set order
Internal data
Data that lives within a company’s own systems
Qualitative data
A subjective and explanatory measure of a quality or characteristic
Second-party data
Data collected by a group directly from its audience and then sold