general analysis Flashcards

1
Q

data analysis

A

CTO CPD

using tools to collect, transform, and organize information to draw useful conclusions, make predictions, drive informed decision making

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

analytics

A

the science of data, a very broad concept that encompasses everything from the job of managing and using data to the tools and methods that data workers use every day; this contains data ecosystems and data analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

business task

A

the question or problem data analysis addresses for a business

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

data strategy

A

the management of people (they know how to use the right data to address problems working on), processes (the path to that solution is clear and accessible), and tools (the right technology is used for the job) used in data analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

decision intelligence

A

formalizes the process of selecting between options; a combination of applied data science and the social and managerial sciences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

business analytics

A

the use of math and statistics to derive meaning from data in order to make better business decisions

types:

  • descriptive analytics–the interpretation of historical data to identify trends and patterns
  • predictive analytics–centers on taking that information and using it to forecast future outcomes
  • diagnostic analytics–can be used to identify the root cause of a problem
  • prescriptive analytics–testing and other techniques are employed to determine which outcome will yield the best result in a given scenario
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

metric

A

single quantifiable type of data that can be used for measurement; may be an aggregation of attributes in the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

data validation

A

a tool for checking the accuracy and quality of data before adding or importing it; a form of data cleansing or cleaning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

data mapping

A

process of matching fields from one database to another; important to data migration and data integration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

schema

A

a way of describing how something is organized (this came up in context of data mapping, and foreign and primary keys)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

spotlighting

A

scanning through the data quickly to identify the most important insights

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

statement of work

A

a document that clearly identifies the products and services a vendor or contractor will provide to an organization; similar to scope of work, but statment of work is fully client-facing (vs scope of work’s being more internal-teams, project-facing)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

profit margin

A

a percentage indicating how many units of profit have been generated for each unit of sale: 100*unit_profit/unit_revenue

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

return on investment (ROI)

A

formula that uses the metrics of investment and profit to evaluate the success of an investment; net profit over time of an investment, divided by cost of investment (so a proportion or percentage)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

data source types (1st, 2nd, 3rd)

A
  • first party data–data collected by an individual or group using their own resources
  • second party data–data collected by a group directly from its audience and then sold; this is aka “someone else’s first-party data”; data collected from a trusted partner
  • third-party data–data provided by an entity that did not collect the data themselves; eg data aggregators
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

qualitative data value types

A

nominal–a type of qualitative data that is categorized without a set order (so un-orderable); eg have you watched a certain movie? (yes/no/not sure)
ordinal–qualitative data with a set order or scale (eg rating a movie 1 to 5)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

mental model

A

thought process and the way you approach a problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

changelog

A

chronological list of changes made to an existing project; date, added, improved, removed features; a document used to record the notable changes made to a project over its lifetime across all of its tasks; it is typically curated so that the changes it records are listed chronologically across all versions of the project

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

data aggregation

A

gathering data from multiple sources in order to combine it into a single, summarized collection; helps identify trends, makes comparisons, and gather insights that would not otherwise possible if looking at each piece of data on its own

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

data: internal, external, structured, unstructured

A

internal data–data that lives within a companies own systems, and may well be collected by the organization itself; aka primary data; may be easier to collect and be more reliable than external data
external data–data that lives and is generated outside an organization; aka secondary data; this can be valuable when the analysis depends on as many sources as possible
structured data–data that is organized in a certain format, such as rows and columns in a spreadsheet
unstructured data–data that is not structured in any identifiable manner; eg audio and video data might be considered “unstructured”

21
Q

composite key

A

a primary key formed by using multiple columns / variables / fields in a relational database table

22
Q

normalization (data)

A

a process of organizing data in a relational database. For example, creating tables and establishing relationships between those tables. It is applied to eliminate data redundancy, increase data integrity, and reduce complexity in a database

23
Q

data security

A

protecting data from unauthorized access or corruption by adopting safety measures

24
Q

data analysis lifecycle

A

APP ASA

ask
prepare
process
analyze
share
act

25
Q

data ecosystem

A

PMS OASis

various elements that interact with one another to produce, manage, store, organize, analyze, and share data

a given data ecosystem is typically in context of a given purpose or organization, like a particular business, like a retail store or a farm, and can include hardware and software

26
Q

data driven decision making

A

the use of facts to guide business strategy; enables companies to use data analytics to find the best possible solution to a problem, complement observation with objective data, and get a complete picture of a problem and its causes

benefits:
gain valuable insights
verify theories or assumptions
better understand opportunities and challenges
support objectives
make a plan

27
Q

analytical skills

A

CCuT DiS

the qualities and characteristics of solving problems using facts; identifying and defining a problem, then solving it using data in an organized, step-by-step manner

components:

  • curiosity
  • understanding context; eg the ability to identify context and identify out-of-context elements; includes recognizing and adding or creating structure (such as naming columns in a data table)
  • having a technical mindset–involves the ability to break things down into smaller pieces and work with them in an orderly and logical way
  • data design–how one organizes information
  • data strategy–the management of people (they know how to use the right data to address problems working on), processes (the path to that solution is clear and accessible), and tools (the right technology is used for the job) used in data analysis (ppt)
28
Q

analytical thinking

A

ViS PCP

identifying and defining a problem, then solving it using data in an organized, step-by-step manner

aspects:

  • visualization–the graphical representation of information; eg graphs, maps, and other design elements
  • strategy–having a strategic mindset helps the analyst see what they want to achieve with the data and how they can get there; strategy also helps with the quality and usefullness of data
  • being problem oriented–helps identify, describe, and solve problems
  • correlation–identify correlations or relationships between two or more pieces of data
  • big picture and detail-oriented thinking–being able to see the big picture as well as the details
29
Q

data lifecycle

A

PCMAAD

the path of data from “birth to death”

main stages:

  • plan–a business decides what kind of data it needs, how it will be managed along its lifecycle, who will be responsible for it, and the optimal outcomes
  • capture–data is collected from a variety of sources, and is brought into the organization
  • manage–how and where data is stored, the tools used to keep it safe and secure, and actions taken to make sure it’s maintained properly; this phase is important to data “cleansing”
  • analyze–data is used to solve problems, make decisions, and support business goals; during this phase, the analytics team may use formulas to perform calculations, create a report from the data, or use spreadsheets to aggregate data
  • archive–storing data in a place it’s available (but may not be used again)
  • destroy–erasing data, after it’s certain it’s no longer needed; this also relates to privacy and security
30
Q

structured thinking

A

POGO

the process of recognizing the current problem or situation, organizing available information, revealing gaps and opportunities, and identifying the options

structured thinking also involves having a clear list of what you are expected to deliver, a timeline for major tasks and activities, and checkpoints so the team knows you’re making progress

31
Q

problem domain

A

the specific area of analysis that encompasses every activity affecting or affected by the problem

the problem domain is your problem, plus everything else related that might lead to a solution (eg if creating firmware for military aircraft, the problem domain may be weapons, sensors, and control systems)

32
Q

SMART method for questions

A
  • specific–simple, significant, and focused on a single topic or a few closely related ideas (SSF)
  • measurable–can be quantified and assessed (eg “how many” or “what percentage/proportion”; note the course includes yes/no responses as “measurable”)
  • action–action-oriented questions encourage change; seeing the current state, how to transform it to the desired future state–form the question so that the answers to the question are actionable
  • relevant–matter, are important, and have significance to the problem being solved
  • time–time bound questions specify the time period to be studied (ie the time period to pull data from to help answer the problem-as-question)
33
Q

three useful questions for analysis (and the five whys)

A

what is the root cause?

  • this is a good starting question, eg as part of the initial “ask” portion of the data analysis process
  • the method of the “five whys”–ask why five times (ie like a kid, iteratively asking why of the just-past answer to why)–the last answer may be the most valuable and get to the root of the matter

where are the gaps in our process?

  • “understand where you are now vs where you want to be”–identify the “gaps” that exist between the current and future state, and figure out how to bridge them
  • gap analysis–examines and evaluates how a process works currently, in order to get the enterprise where you want it in the future

what did we not consider before?

what information or procedure might be missing from a process; this helps identify ways to make better decisions and form better strategies moving forward

34
Q

main analysis problem types

A

making predictions–use data to make an informed decision about how things may be in the future
eg a hospital might use remote monitoring (from patient’s home) to help make predictions about upcoming adverse events

categorizing things–assign information to different groups based on clusters or common features

re categorizing vs finding themes–categorizing things involves assigning items to categories; identifying themes takes those categories a step further by grouping them into broader themes

spotting something unusual–identify data different from the norm

eg a school system with a sudden increase in registrations, maybe linked to several new apartment complexes being built locally

identifying themes–a step further for categorization by grouping information into broader concepts

Themes are most often used to help researchers explore certain aspects of data. In a user study, user beliefs, practices, and needs are examples of themes, while user predicted income bracket might be a user category

discovering connections–allows finding similar challenges faced by different entities (so kind of across different domains), then combining data and insights to address them

eg a scooter supplier has problems with the wheels it purchases, where the rubber supplier had trouble finding the materials for the wheels, so if these entities got together, they might mutually benefit each other (a connection between these vertical levels of the process)

finding patterns–use historical data to understand (eg) what happened in the past and so what might be likely to happen again

eg customer buying habits are sampled throughout the year, possibly finding upticks in hat/glove purchases in colder months, or canned goods demand spikes ca forecast hurricaines

35
Q

scope of work

A

SPuDD TuMoR

an agreed-upon outline of the work you’re going to perform on a project; sets the expectations and boundaries of a project, eg for the project team to work off of and plan from
this is an essential, industry-standard tool; a well-defined SOW keeps you, your team, and everyone involved with a project on the same page, and ensures that all contributors, sponsors, and stakeholders share the same understanding of the relevant details

often tied to the ask phase: preparing to write an SOW is about asking questions to learn the necessary information about the project, but it’s also about clarifying and defining what you’re being asked to accomplish, and what the limits or boundaries of the “ask” are

usual components:

scope–what is and what isn’t in the project’s scope

problem statement–the business question or business problem

data–anything data-centric; might include data preparation, validation, and various analysis aspects

deliverables

  • items or tasks that will be completed before the project can be finished
  • What work is being done, and what things are being created as a result of this project? When the project is complete, what are is expected to be delivered to the stakeholders?
  • be as specific as possible, and use quantitative statements and measurable objectives whenever possible

timeline–a granular way of mapping expectations for how long each step of the process should take; includes due dates for when deliverables, milestones, and/or reports are due; helps ensure project is running on schedule

milestones–significant tasks (even, in a sense sub-projects) you will confirm along your timeline to help everyone know the project is on track

reports–how and when you’ll give status updates to the team and stakeholders, including what they will contain, and when/why they will be issued; reports help notify everyone as you finalize deliverables and meet milestones

36
Q

the four V’s

A

usually in context of big data

components:
volume–the amount of data
variety–the different kinds of data
velocity–how fast the data can be processed
veracity–the quality and reliability of the data

37
Q

considerations for data collection

A

SQFT

  • source–first party (yourself / your company)? or second-party? or third-party?
  • quantity–if it’s unfeasible to fetch data from a whole population, consider sampling, and have to decide how much of a sample to take
  • format–what format is the data in
  • time frame–if the time frame is short, may be restricted to historical data
38
Q

considerations for data quality

A

ROCCCiou

  • reliable–good data sources are reliable; ie trusted to provide accurate and complete, unbiased information that has been vetted for use
  • original–validate data with the original source of the data
  • cited–who created the dataset, is it part of a credible organization, when was the data last refreshed
  • comprehensive–contain all critical information needed to answer the question or find the solution
  • current–the usefullness of data decreases as time passes; best data sources are current and relevant to the task at hand
39
Q

data bias

A

SOIC

a type of error that systematically skews results in a certain direction, making them unreliable

  • sample bias / selection bias–the sample is not representative of the population as a whole
  • observer bias, aka experimenter bias or research bias–tendency for different people to observe things differently
  • interpretation bias–tendency to always interpret ambiguous situations in a positive or negative way (possibly based on different backgrounds and experiences)
  • confirmation bias–tendency to search for or interpret information in a way that confirms preexisting beliefs
40
Q

data ethics

A

OTC CPO

  • ownership–who owns data; may not be the organization that collected, processed, etc–it’s usually the individuals the data is drawn from
  • transaction transparency–all data processing activities and algorithms should be completely explainable and understood by the people the data is drawn from (this is to overcome bias)
  • consent–an individual’s right to know exact details about how and why their data is being used; why collected, how used, how long stored (related to USE)
  • currency–individuals should be aware of what kinds of financial transactions, and at what scale those transactions are on, in the context of submitting their data (so in one sense keeping in mind what profit motives are behind the data collection and use)
  • privacy–
    • data privacy, aka information privacy aka data protection, preserving a data subject’s information and activity whenever a data transaction occurs; eg personally identifiable information (PII)
    • includes:
      • protection from unauthorized access
        • freedom from inappropriate use of our data
        • the right to inspect, update or correct our data (related to STORAGE)
        • ability to give consent to use our data
        • legal right to access our data
  • openness–free access, usage, and sharing of data
41
Q

open data

A

ARUI

  • access and availability–must be available as a whole, preferrably online, and downloadable in a modifiable form
  • reuse and redistribution–allowed for reuse and redistribution, including use with other data sets
  • universal participation–any should be able to user, reuse and distribute the data; no restrictions for use only by certain groups, or certain industries
  • interoperability–ability of data systems and services to openly connect and share data
42
Q

transforming data

A

separate and combine data, as well as create new variables; eg joining data sets, or standardizing entries over various tables

43
Q

tidying data

A

variables organized into columns, observations organized into rows, each value must have its own cell

44
Q

cleaning data

A

preview and rename data so it’s easier to work with; check types, ranges, missing values, rename variables/cols; this seems also to include eg summary(df), str(df) and associated dataframe viewing functions

45
Q

organizing data

A

sort, filter, and summarize data, including groupings, and dealing with missing values (overlap with clean, perhaps)

46
Q

data integrity

A

ACCT

  • accuracy–the degree of conformity of a measure to a standard or true value
  • completeness–the degree to which all required measures are known; related to missing values
  • consistency–the degree to which a set of data-derived measures is equivalent across all systems
  • trustworthiness–aka validity; ensure data-derived measures conform to defined business rules or constraints
47
Q

data cleaning (brief points)

A

TR P/U KF M/D

consider data type, range, presence / uniqueness, keys, formatting, missing / duplicates

resolve “dirty data”–incomplete, incorrect, or irrelevant to the problem being solved

48
Q

dealing with insufficient data

A

CoPMoC

data can be missing, wrong, insufficient, or not aligned with business objectives

try to identify trends with existing data, wait for more data, talk with stakeholders and adjust objective, gather new data
correct / omit, proxy, collect, modify objective

49
Q

post-cleaning steps

A

VRDC

verification, reporting on processing / cleaning phase, documenting, and changelogs