Google Data Analysis Flashcards
A/B testing
The process of testing two variations of the same web page to determine which page is more successful at attracting user traffic and generating revenue
Compatibility
How well two or more datasets are able to work together
Data analysis process
The six phases of ask, prepare, process, analyze, share, and act whose purpose is to gain insights that drive informed decision-making
Data analysis
The collection, transformation, and organization of data in order to draw conclusions, make predictions, and drive informed decision-making
Data life cycle
The sequence of stages that data experiences, which include plan, capture, manage, analyze, archive, and destroy
First-party data
Data collected by an individual or group using their own resources
Gap analysis
A method for examining and evaluating the current state of a process in order to identify opportunities for improvement in the future
Problem types
The various problems that data analysts encounter, including categorizing things, discovering connections, finding patterns, identifying themes, making predictions, and spotting something unusual
Statistical power
The probability that a test of significance will recognize an effect that is present
Statistical significance
The probability that sample results are not due to random chance
Wide data
A dataset in which every data subject has a single row with multiple columns to hold the values of various attributes of the subject
Administrative metadata
Metadata that indicates the technical source of a digital asset
Descriptive metadata
Metadata that describes a piece of data and can be used to identify it at a later point in time
Structural metadata
Metadata that indicates how a piece of data is organized and whether it is part of one or more than one data collection
3 types of metadata
- descriptive
- structural
administrative
Foreign key
A field within a database table that is a primary key in another table (Refer to primary key)
Primary key
An identifier in a database that references a column in which each value is unique (Refer to foreign key)
Metadata
Data about data
Elements of metadata
- title and discription
- tags and categories
-who created it and when
-who last modified it and when - who can access or update it
Metadata repository
A database created to store metadata
Metadata repositories
- describe the state and location of the meatdata
- describe the structures of the tables inside
- describe how the data flows through the repository
- keep track of who access the metadata and when
Hypothesis testing
A process to determine if a survey or experiment has meaningful results
Confidence level
The probability that a sample size accurately reflects the greater population
Margin of error
The maximum amount that sample results are expected to differ from those of the actual population
to calculate margin of error you need
- population size
- sample size
- confidence level
Dirty data
Data that is incomplete, incorrect, or irrelevant to the problem to be solved
Clean data
Data that is complete, correct, and relevant to the problem being solved
Data engineer
A professional who transforms data into a useful format for analysis and gives it a reliable infrastructure
Data warehousing specialist
A professional who develops processes and procedures to effectively store and organize data
Confidence interval
A range of values that conveys how likely a statistical estimate reflects the population
Statistical significance
The probability that sample results are not due to random chance
Why a minimum sample of 30?
this recommendation is based on the central limit theorem (CLT) in the field of probability and statistics. A sample of 30 is the smallest sizes for which the clt still valid
Das zentrale Grenzwertsatz
Das zentrale Grenzwertsatz (englisch: central limit theorem) ist ein wichtiger Satz der Wahrscheinlichkeitstheorie. Es besagt, dass sich die Summe von unabhängigen und identisch verteilten Zufallsvariablen einer bestimmten Verteilung annähert, wenn die Anzahl der Summanden groß genug ist.
Genauer gesagt besagt der zentrale Grenzwertsatz, dass die Verteilung der Summe von unabhängigen und identisch verteilten Zufallsvariablen einer Normalverteilung annähert, wenn die Anzahl der Summanden groß genug ist. Dies bedeutet, dass viele Zufallsvariablen, die in der Realität auftreten, durch eine Normalverteilung approximiert werden können.
Dieser Satz ist von großer Bedeutung in der Statistik, da er es ermöglicht, viele statistische Tests durchzuführen und Schätzungen zu machen, auch wenn die zugrunde liegende Verteilung unbekannt ist. Der zentrale Grenzwertsatz ist ein wichtiger Bestandteil vieler statistischer Methoden und hat Anwendungen in vielen Bereichen, wie der Finanzmathematik, der Physik und der Ingenieurwissenschaften.
Random sampling
A way of selecting a sample from a population so that every possible type of the sample has an equal chance of being chosen
Sampling bias
Overrepresenting or underrepresenting certain members of a population as a result of working with a sample that is not representative of the population as a whole
Sample
In data analytics, a segment of a population that is representative of the entire population
types of insufficient data
- data from only one sourse
- data that keeps updating
- outdated data
- geographically limited data
SMART methodology
A tool for determining a question’s effectiveness based on whether it is specific, measurable, action-oriented, relevant, and time-bound
critical questions about the predictiv analytical models
- why is it taking so long to put new or updated models into production?
- who created the model and why?
- what input variables are used to make predictions and to make precisions?
- how are models used?
- how are models performing and when were they last updated?
- where is the supporting documentation?
no ansvers - no real value
making predictions
using data to make an informed decision about how things may be in the future
6 problem types that data analysts typically face
- making predictions
- cetegorizing things
- spotting something unusual
- identifying themes
- discovering connections
- finding patterns
data analysis process (google)
The six phases of ask, prepare, process, analyze, share, and act whose purpose is to gain insights that drive informed decision-making
Data-driven decision-making
Using facts to guide business strategy
Algorithm
A process or set of rules followed for a specific task
how data is collected
- interviews
- observations
- forms
- questionairies
- survey
- cookies
Metric
A single, quantifiable type of data that is used for measurement
Problem domain
The area of analysis that encompasses every activity affecting or affected by a problem
Structured thinking
The process of recognizing the current problem or situation, organizing available information, revealing gaps and opportunities, and identifying options
Scope of work (SOW)
An agreed-upon outline of the tasks to be performed during a project
Report
A static collection of data periodically given to stakeholders
Quantitative data
A specific and objective measure, such as a number, quantity, or range
Quantitative data tools
- structural interviews
- survey
- polls
Qualitative data
A subjective and explanatory measure of a quality or characteristic
Qualitative data tools
- focus groups
- social media
- in-person interviews
Data life cycle
The sequence of stages that data experiences, which include plan, capture, manage, analyze, archive, and destroy
Best practices when organizing data
- naming conventios
- foldering
- archiving older files
data life cicle
5) archive
keep relevant data stored long-term and future reference
Bias
A conscious or subconscious preference in favor of or against a person, group of people, or thing
Confirmation bias
The tendency to search for or interpret information in a way that confirms pre-existing beliefs
Interpretation bias
The tendency to interpret ambiguous situations in a positive or negative way
Data integrity
The accuracy, completeness, consistency, and trustworthiness of data throughout its life cycle
Data replication
The process of storing data in multiple locations
Data transfer
The process of copying data from a storage device to computer memory or from one computer to another
Data manipulation
The process of changing data to make it more organized and easier to read
Statistical significance
The probability that sample results are not due to random chance
Data bias
When a preference in favor of or against a person, group of people, or thing systematically skews data analysis results in a certain direction
types of data bias
- observer bias
- interpretation bias
- confirmation bias
- sampling bias
Continuous data
Data that is measured and can have almost any numeric value
Discrete data
Data that is counted and has a limited number of values
Ordinal data
Qualitative data with a set order or scale
External data
Data that lives, and is generated, outside of an organization
Nominal data
A type of qualitative data that is categorized without a set order
Internal data
Data that lives within a company’s own systems
Qualitative data
A subjective and explanatory measure of a quality or characteristic
Second-party data
Data collected by a group directly from its audience and then sold
Population
In data analytics, all possible data values in a dataset
Third-party data
Data provided from outside sources who didn’t collect it directly
Structured data
Data organized in a certain format such as rows and columns
Unstructured data
Data that is not organized in any easily identifiable manner
Long data
A dataset in which each row is one time point per subject, so each subject has data in multiple rows
Dataset
A collection of data that can be manipulated or analyzed as one unit
Attribute
A characteristic or quality of data used to label a column in a table
Fairness
A quality of data analysis that does not create or reinforce bias
Query
A request for data or information from a database
Data governance
A process for ensuring the formal management of a company’s data assets
Naming conventions
Consistent guidelines that describe the content, creation date, and version of a file in its name
Data-inspired decision-making
Exploring different data sources to find out what they have in common
Data analysis
The collection, transformation, and organization of data in order to draw conclusions, make predictions, and drive informed decision-making
Data science
A field of study that uses raw data to create new ways of modeling and understanding the unknown
Data analysis process
The six phases of ask, prepare, process, analyze, share, and act whose purpose is to gain insights that drive informed decision-making
Formula
A set of instructions used to perform a calculation using the data in a spreadsheet
Observation
The attributes that describe a piece of data contained in a row of a table
Data ecosystem
The various elements that interact with one another in order to produce, manage, store, organize, analyze, and share data
Data
A collection of facts
Data validation
A tool for checking the accuracy and quality of data
Analytical skills
Qualities and characteristics associated with using facts to solve problems
Observer bias
The tendency for different people to observe things differently (also called experimenter bias)
Unbiased sampling
When the sample of the population being measured is representative of the population as a whole
Data interoperability
The ability to integrate data from multiple sources and a key factor leading to the successful use of open data among companies and governments
Data anonymization
The process of protecting people’s private or sensitive data by eliminating identifying information
Openness
The aspect of data ethics that promotes the free access, usage, and sharing of data
Currency
The aspect of data ethics that presumes individuals should be aware of financial transactions resulting from the use of their personal data and the scale of those transactions
Consent
The aspect of data ethics that presumes an individual’s right to know how and why their personal data will be used before agreeing to provide it
Transaction transparency
The aspect of data ethics that presumes all data-processing activities and algorithms should be explainable and understood by the individual who provides the data
Ownership
The aspect of data ethics that presumes individuals own the raw data they provide and have primary control over its usage, processing, and sharing
Data ethics
Well-founded standards of right and wrong that dictate how data is collected, shared, and used
Data model
A tool for organizing data elements and how they relate to one another
Data element
A piece of information in a dataset
Open data
Data that is available to the public
threats to data integrity
- humon error
- viruses
- malware
- hacking
- system failures
Data formats
- inernal - external
- continuous -discrete
- structured - unstructured
- nominal - ordinal
- -qualitative - quantitative
- primary - secondary
Primary data
collected by a researcher from first-hand sources
Data type
An attribute that describes a piece of data based on its values, its programming language, or the operations it can perform
2 common methods to develop data models
-entity relationship diagram (ERD)
- unified modeling lanquage (UML)
5) Share DA-process
- untersrand visualization
-create effective visuals
-bring data to life
-use data storytelling - communicate to help others understand results
Sorting
The process of arranging data into a meaningful order to make it easier to understand, analyze, and visualize
decision intelligence
is a combination of applied data science and the social and managerial science
data life cicle 6)destroy
remove data from storage and delete any shared copies of the data
3) Process DA-Process
- create and transform data
- maintain data intergrity
-test data
-clean data
-verify and report on cleaning results
4) Analyse DA-Process
-use tools to format and transform data
-sort and filter data
-identify patterns and draw conclusions
-make predictions and recommendations
make data-driven decisions
1) Prepare DA-Process
-understand how data is generated and collected
- identify and use different data formats, types and structures
- make sure data is unbiased and credible
-organize and protect data
Step 1 - Ask
-define the problem you are trying to solve
-make sure you fully understand the stakeholders expectations
-focus on the actual problem and avoid any distractions
-collaborate with stackeholders and keep open line of communication
-take astep back and see the whole situation in context
Ask DA-Process
-aks effective questions
-define the problem
-use structured thinking
-communicate whit others
dashboards pros&cons
pros:
-dynamic automatic and interactive
-more stackholder access
-low maintenance
cons:
-labor intensive design
-can be confusing
-potentially uncleaned data
reports pros&cons
pros:
-high level historical data
- easy to design
-pre-cleande and sorted data
cons:
-continual maintenance
-less visually appealing
-static
Step 4. Analyse
think analyticaly about my data
- perform calculations
-combine data from multiple sourses
-create tables with your results
Q:
1) what story is my data telling me
2)how will my data help me solve this problem?
Step 3. Process
clean data of any possible errors, inaccuracies, inconsistencies
- using spreadsheet functions fo find incorreclty entered data
- using SQL functions to check for extra spaces
-removing repeated entries
-checking for bias in data
Q:
1. what data errors or inaccuracies might get in my way of getting out of best possible answer to find the problem i’m trying to solve
2. how can i clean my data so the information i have is more consistent
Data analytics
The science of data
Confidence interval
A range of values that conveys how likely a statistical estimate reflects the population
A/B testing
The process of testing two variations of the same web page to determine which page is more successful at attracting user traffic and generating revenue
Access control
Features such as password protection, user permissions, and encryption that are used to protect a spreadsheet
Accuracy
The degree to which data conforms to the actual entity being measured or described
Action-oriented question
A question whose answers lead to change
Analytical thinking
The process of identifying and defining a problem, then solving it by using data in an organized, step-by-step manner
Bad data source
A data source that is not reliable, original, comprehensive, current, and cited (ROCCC)
Big data
Large, complex datasets typically involving long periods of time, which enable data analysts to address far-reaching business problems
Boolean data
A data type with only two possible values, usually true or false
Changelog
A file containing a chronologically ordered list of modifications made to a project
Compatibility
How well two or more datasets are able to work together
Completeness
The degree to which data contains all desired components or measures
Consistency
The degree to which data is repeatable from different points of entry or collection
Context
The condition in which something exists or happens
Cross-field validation
A process that ensures certain conditions for multiple data fields are satisfied
Dashboard
A tool that monitors live, incoming data
Data analyst
Someone who collects, transforms, and organizes data in order to draw conclusions, make predictions, and drive informed decision-making
Data constraints
The criteria that determine whether a piece of a data is clean and valid
Data design
How information is organized
Data mapping
The process of matching fields from one data source to another
Data merging
The process of combining two or more datasets into a single dataset
Data range
Numerical values that fall between predefined maximum and minimum values
Data security
Protecting data from unauthorized access or corruption by adopting safety measures
Data strategy
The management of the people, processes, and tools used in data analysis
Data visualization
The graphical representation of data
Estimated response rate
The average number of people who typically complete a survey
Experimenter bias
The tendency for different people to observe things differently (Refer to Observer bias)
Gap analysis
A method for examining and evaluating the current state of a process in order to identify opportunities for improvement in the future
General Data Protection Regulation of the European Union (GDPR)
Policy-making body in the European Union created to help protect people and their data
Good data source
A data source that is reliable, original, comprehensive, current, and cited (ROCCC)
Incomplete data
Data that is missing important fields
Inconsistent data
Data that uses different formats to represent the same thing
Incorrect/inaccurate data
Data that is complete but inaccurate
Mandatory
A data value that cannot be left blank or empty
Normalized database
A database in which only related data is stored in each table
Outdated data
Any data that has been superseded by newer and more accurate information
Problem types:
The various problems that data analysts encounter, including categorizing things, discovering connections, finding patterns, identifying themes, making predictions, and spotting something unusual
Redundancy
When the same piece of data is stored in two or more places
Reframing
The process of restating a problem or challenge, then redirecting it toward a potential resolution
Regular expression (RegEx)
A rule that says the values in a table must match a prescribed pattern
Small data
Small, specific data points typically involving a short period of time, which are useful for making day-to-day decisions
Stakeholders
People who invest time and resources into a project and are interested in its outcome
Technical mindset
The ability to break things down into smaller steps or pieces and work with them in an orderly and logical way
Transferable skills
Skills and qualities that can transfer from one job or industry to another
Typecasting
Converting data from one type to another
Validity
The degree to which data conforms to constraints when it is input, collected, or created
Verification
A process to confirm that a data-cleaning effort was well executed and the resulting data is accurate and reliable
aspects of data ethics
-ownership
-transaction transparency
-consent
-currency
-privacy
-openness
PII
personal identifiable information
information that can be used by itself or with other data to track down a person’s identity
privacy
preserving a data subjets information and activity any time a data transaction occurs
structured data
-defined data types
-most often quantitative data
-easy to organise
-easy to search
-easy to analyse
-stored in relational databases & data warehouese
-contained in rows and colums
Re-identification
A process used to wipe data clean all personally identifying information
confidence level
confidence level is targered before you start your study, because it will affect how big your margin of error is at the end of your study.
how confident you are in the survey sesults. F.e. a 95% confidence level means that if you were to run the same survey 100 times you would be get similar results 95 of those 100 times.
data collection considerations
how the data will be collected
chose data sources
decide what data to use
how much data to collect select the right data type
determine the time frame
6) act DA process
-apply your insights
-solve problem
-make decisions
-create something new
good data sources
Reliable
Original
Comprehensive
Current
Cited
spotting something unusual
identifying data that is different from the norm
4) analyse DA process
use data to solve problems make decisions and support buisness goals
data anlyticer skills quantities
curiosity
understanding context
having technical mindset
data design
data strategy
SAS’s iterative life cycle
ask-prepare-explore-model-implement-act-evaluate