general analysis Flashcards
data analysis
CTO CPD
using tools to collect, transform, and organize information to draw useful conclusions, make predictions, drive informed decision making
analytics
the science of data, a very broad concept that encompasses everything from the job of managing and using data to the tools and methods that data workers use every day; this contains data ecosystems and data analysis
business task
the question or problem data analysis addresses for a business
data strategy
the management of people (they know how to use the right data to address problems working on), processes (the path to that solution is clear and accessible), and tools (the right technology is used for the job) used in data analysis
decision intelligence
formalizes the process of selecting between options; a combination of applied data science and the social and managerial sciences
business analytics
the use of math and statistics to derive meaning from data in order to make better business decisions
types:
- descriptive analytics–the interpretation of historical data to identify trends and patterns
- predictive analytics–centers on taking that information and using it to forecast future outcomes
- diagnostic analytics–can be used to identify the root cause of a problem
- prescriptive analytics–testing and other techniques are employed to determine which outcome will yield the best result in a given scenario
metric
single quantifiable type of data that can be used for measurement; may be an aggregation of attributes in the data
data validation
a tool for checking the accuracy and quality of data before adding or importing it; a form of data cleansing or cleaning
data mapping
process of matching fields from one database to another; important to data migration and data integration
schema
a way of describing how something is organized (this came up in context of data mapping, and foreign and primary keys)
spotlighting
scanning through the data quickly to identify the most important insights
statement of work
a document that clearly identifies the products and services a vendor or contractor will provide to an organization; similar to scope of work, but statment of work is fully client-facing (vs scope of work’s being more internal-teams, project-facing)
profit margin
a percentage indicating how many units of profit have been generated for each unit of sale: 100*unit_profit/unit_revenue
return on investment (ROI)
formula that uses the metrics of investment and profit to evaluate the success of an investment; net profit over time of an investment, divided by cost of investment (so a proportion or percentage)
data source types (1st, 2nd, 3rd)
- first party data–data collected by an individual or group using their own resources
- second party data–data collected by a group directly from its audience and then sold; this is aka “someone else’s first-party data”; data collected from a trusted partner
- third-party data–data provided by an entity that did not collect the data themselves; eg data aggregators
qualitative data value types
nominal–a type of qualitative data that is categorized without a set order (so un-orderable); eg have you watched a certain movie? (yes/no/not sure)
ordinal–qualitative data with a set order or scale (eg rating a movie 1 to 5)
mental model
thought process and the way you approach a problem
changelog
chronological list of changes made to an existing project; date, added, improved, removed features; a document used to record the notable changes made to a project over its lifetime across all of its tasks; it is typically curated so that the changes it records are listed chronologically across all versions of the project
data aggregation
gathering data from multiple sources in order to combine it into a single, summarized collection; helps identify trends, makes comparisons, and gather insights that would not otherwise possible if looking at each piece of data on its own
data: internal, external, structured, unstructured
internal data–data that lives within a companies own systems, and may well be collected by the organization itself; aka primary data; may be easier to collect and be more reliable than external data
external data–data that lives and is generated outside an organization; aka secondary data; this can be valuable when the analysis depends on as many sources as possible
structured data–data that is organized in a certain format, such as rows and columns in a spreadsheet
unstructured data–data that is not structured in any identifiable manner; eg audio and video data might be considered “unstructured”
composite key
a primary key formed by using multiple columns / variables / fields in a relational database table
normalization (data)
a process of organizing data in a relational database. For example, creating tables and establishing relationships between those tables. It is applied to eliminate data redundancy, increase data integrity, and reduce complexity in a database
data security
protecting data from unauthorized access or corruption by adopting safety measures
data analysis lifecycle
APP ASA
ask
prepare
process
analyze
share
act
data ecosystem
PMS OASis
various elements that interact with one another to produce, manage, store, organize, analyze, and share data
a given data ecosystem is typically in context of a given purpose or organization, like a particular business, like a retail store or a farm, and can include hardware and software
data driven decision making
the use of facts to guide business strategy; enables companies to use data analytics to find the best possible solution to a problem, complement observation with objective data, and get a complete picture of a problem and its causes
benefits:
gain valuable insights
verify theories or assumptions
better understand opportunities and challenges
support objectives
make a plan
analytical skills
CCuT DiS
the qualities and characteristics of solving problems using facts; identifying and defining a problem, then solving it using data in an organized, step-by-step manner
components:
- curiosity
- understanding context; eg the ability to identify context and identify out-of-context elements; includes recognizing and adding or creating structure (such as naming columns in a data table)
- having a technical mindset–involves the ability to break things down into smaller pieces and work with them in an orderly and logical way
- data design–how one organizes information
- data strategy–the management of people (they know how to use the right data to address problems working on), processes (the path to that solution is clear and accessible), and tools (the right technology is used for the job) used in data analysis (ppt)
analytical thinking
ViS PCP
identifying and defining a problem, then solving it using data in an organized, step-by-step manner
aspects:
- visualization–the graphical representation of information; eg graphs, maps, and other design elements
- strategy–having a strategic mindset helps the analyst see what they want to achieve with the data and how they can get there; strategy also helps with the quality and usefullness of data
- being problem oriented–helps identify, describe, and solve problems
- correlation–identify correlations or relationships between two or more pieces of data
- big picture and detail-oriented thinking–being able to see the big picture as well as the details
data lifecycle
PCMAAD
the path of data from “birth to death”
main stages:
- plan–a business decides what kind of data it needs, how it will be managed along its lifecycle, who will be responsible for it, and the optimal outcomes
- capture–data is collected from a variety of sources, and is brought into the organization
- manage–how and where data is stored, the tools used to keep it safe and secure, and actions taken to make sure it’s maintained properly; this phase is important to data “cleansing”
- analyze–data is used to solve problems, make decisions, and support business goals; during this phase, the analytics team may use formulas to perform calculations, create a report from the data, or use spreadsheets to aggregate data
- archive–storing data in a place it’s available (but may not be used again)
- destroy–erasing data, after it’s certain it’s no longer needed; this also relates to privacy and security
structured thinking
POGO
the process of recognizing the current problem or situation, organizing available information, revealing gaps and opportunities, and identifying the options
structured thinking also involves having a clear list of what you are expected to deliver, a timeline for major tasks and activities, and checkpoints so the team knows you’re making progress
problem domain
the specific area of analysis that encompasses every activity affecting or affected by the problem
the problem domain is your problem, plus everything else related that might lead to a solution (eg if creating firmware for military aircraft, the problem domain may be weapons, sensors, and control systems)
SMART method for questions
- specific–simple, significant, and focused on a single topic or a few closely related ideas (SSF)
- measurable–can be quantified and assessed (eg “how many” or “what percentage/proportion”; note the course includes yes/no responses as “measurable”)
- action–action-oriented questions encourage change; seeing the current state, how to transform it to the desired future state–form the question so that the answers to the question are actionable
- relevant–matter, are important, and have significance to the problem being solved
- time–time bound questions specify the time period to be studied (ie the time period to pull data from to help answer the problem-as-question)
three useful questions for analysis (and the five whys)
what is the root cause?
- this is a good starting question, eg as part of the initial “ask” portion of the data analysis process
- the method of the “five whys”–ask why five times (ie like a kid, iteratively asking why of the just-past answer to why)–the last answer may be the most valuable and get to the root of the matter
where are the gaps in our process?
- “understand where you are now vs where you want to be”–identify the “gaps” that exist between the current and future state, and figure out how to bridge them
- gap analysis–examines and evaluates how a process works currently, in order to get the enterprise where you want it in the future
what did we not consider before?
what information or procedure might be missing from a process; this helps identify ways to make better decisions and form better strategies moving forward
main analysis problem types
making predictions–use data to make an informed decision about how things may be in the future
eg a hospital might use remote monitoring (from patient’s home) to help make predictions about upcoming adverse events
categorizing things–assign information to different groups based on clusters or common features
re categorizing vs finding themes–categorizing things involves assigning items to categories; identifying themes takes those categories a step further by grouping them into broader themes
spotting something unusual–identify data different from the norm
eg a school system with a sudden increase in registrations, maybe linked to several new apartment complexes being built locally
identifying themes–a step further for categorization by grouping information into broader concepts
Themes are most often used to help researchers explore certain aspects of data. In a user study, user beliefs, practices, and needs are examples of themes, while user predicted income bracket might be a user category
discovering connections–allows finding similar challenges faced by different entities (so kind of across different domains), then combining data and insights to address them
eg a scooter supplier has problems with the wheels it purchases, where the rubber supplier had trouble finding the materials for the wheels, so if these entities got together, they might mutually benefit each other (a connection between these vertical levels of the process)
finding patterns–use historical data to understand (eg) what happened in the past and so what might be likely to happen again
eg customer buying habits are sampled throughout the year, possibly finding upticks in hat/glove purchases in colder months, or canned goods demand spikes ca forecast hurricaines
scope of work
SPuDD TuMoR
an agreed-upon outline of the work you’re going to perform on a project; sets the expectations and boundaries of a project, eg for the project team to work off of and plan from
this is an essential, industry-standard tool; a well-defined SOW keeps you, your team, and everyone involved with a project on the same page, and ensures that all contributors, sponsors, and stakeholders share the same understanding of the relevant details
often tied to the ask phase: preparing to write an SOW is about asking questions to learn the necessary information about the project, but it’s also about clarifying and defining what you’re being asked to accomplish, and what the limits or boundaries of the “ask” are
usual components:
scope–what is and what isn’t in the project’s scope
problem statement–the business question or business problem
data–anything data-centric; might include data preparation, validation, and various analysis aspects
deliverables
- items or tasks that will be completed before the project can be finished
- What work is being done, and what things are being created as a result of this project? When the project is complete, what are is expected to be delivered to the stakeholders?
- be as specific as possible, and use quantitative statements and measurable objectives whenever possible
timeline–a granular way of mapping expectations for how long each step of the process should take; includes due dates for when deliverables, milestones, and/or reports are due; helps ensure project is running on schedule
milestones–significant tasks (even, in a sense sub-projects) you will confirm along your timeline to help everyone know the project is on track
reports–how and when you’ll give status updates to the team and stakeholders, including what they will contain, and when/why they will be issued; reports help notify everyone as you finalize deliverables and meet milestones
the four V’s
usually in context of big data
components:
volume–the amount of data
variety–the different kinds of data
velocity–how fast the data can be processed
veracity–the quality and reliability of the data
considerations for data collection
SQFT
- source–first party (yourself / your company)? or second-party? or third-party?
- quantity–if it’s unfeasible to fetch data from a whole population, consider sampling, and have to decide how much of a sample to take
- format–what format is the data in
- time frame–if the time frame is short, may be restricted to historical data
considerations for data quality
ROCCCiou
- reliable–good data sources are reliable; ie trusted to provide accurate and complete, unbiased information that has been vetted for use
- original–validate data with the original source of the data
- cited–who created the dataset, is it part of a credible organization, when was the data last refreshed
- comprehensive–contain all critical information needed to answer the question or find the solution
- current–the usefullness of data decreases as time passes; best data sources are current and relevant to the task at hand
data bias
SOIC
a type of error that systematically skews results in a certain direction, making them unreliable
- sample bias / selection bias–the sample is not representative of the population as a whole
- observer bias, aka experimenter bias or research bias–tendency for different people to observe things differently
- interpretation bias–tendency to always interpret ambiguous situations in a positive or negative way (possibly based on different backgrounds and experiences)
- confirmation bias–tendency to search for or interpret information in a way that confirms preexisting beliefs
data ethics
OTC CPO
- ownership–who owns data; may not be the organization that collected, processed, etc–it’s usually the individuals the data is drawn from
- transaction transparency–all data processing activities and algorithms should be completely explainable and understood by the people the data is drawn from (this is to overcome bias)
- consent–an individual’s right to know exact details about how and why their data is being used; why collected, how used, how long stored (related to USE)
- currency–individuals should be aware of what kinds of financial transactions, and at what scale those transactions are on, in the context of submitting their data (so in one sense keeping in mind what profit motives are behind the data collection and use)
- privacy–
- data privacy, aka information privacy aka data protection, preserving a data subject’s information and activity whenever a data transaction occurs; eg personally identifiable information (PII)
- includes:
- protection from unauthorized access
- freedom from inappropriate use of our data
- the right to inspect, update or correct our data (related to STORAGE)
- ability to give consent to use our data
- legal right to access our data
- protection from unauthorized access
- openness–free access, usage, and sharing of data
open data
ARUI
- access and availability–must be available as a whole, preferrably online, and downloadable in a modifiable form
- reuse and redistribution–allowed for reuse and redistribution, including use with other data sets
- universal participation–any should be able to user, reuse and distribute the data; no restrictions for use only by certain groups, or certain industries
- interoperability–ability of data systems and services to openly connect and share data
transforming data
separate and combine data, as well as create new variables; eg joining data sets, or standardizing entries over various tables
tidying data
variables organized into columns, observations organized into rows, each value must have its own cell
cleaning data
preview and rename data so it’s easier to work with; check types, ranges, missing values, rename variables/cols; this seems also to include eg summary(df), str(df) and associated dataframe viewing functions
organizing data
sort, filter, and summarize data, including groupings, and dealing with missing values (overlap with clean, perhaps)
data integrity
ACCT
- accuracy–the degree of conformity of a measure to a standard or true value
- completeness–the degree to which all required measures are known; related to missing values
- consistency–the degree to which a set of data-derived measures is equivalent across all systems
- trustworthiness–aka validity; ensure data-derived measures conform to defined business rules or constraints
data cleaning (brief points)
TR P/U KF M/D
consider data type, range, presence / uniqueness, keys, formatting, missing / duplicates
resolve “dirty data”–incomplete, incorrect, or irrelevant to the problem being solved
dealing with insufficient data
CoPMoC
data can be missing, wrong, insufficient, or not aligned with business objectives
try to identify trends with existing data, wait for more data, talk with stakeholders and adjust objective, gather new data
correct / omit, proxy, collect, modify objective
post-cleaning steps
VRDC
verification, reporting on processing / cleaning phase, documenting, and changelogs