Data Analysis Concepts Flashcards

1
Q

What are the Data Analysis phases (in this course)?

A

AP-PASA
Ask: understand problem, goals, stakeholders - plan project
Prepare: get the data
Process: clean, organize transform
Analyze: explore, visualize, stats
Share: communicate, report, story
Act: solutions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Data Analysis

A

Turning data into insights for informed action - reduce risk of wasted efforts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the SMART question methodology?

A

Specific: simple, focused
Measurable: quantifiable
Action-oriented: Encourage chage
Relevant
Time-bound

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What’s the difference between a Data Analyst, Data Engineer, & Data Scientist?

A

Analyst: answers questions with existing data – SQL, spreadsheets, DB’s, BI, dashboards
Engineer: turn raw data into actionable pipelines
Scientist: creates new ways of modeling and using data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What’s the difference between data-driven vs data-inspired decision-making?

A

Data-driven: using facts to guide strategy… requires quality & quantity… over-reliance can result in historical bias, ignoring qualitative insight
Data-inspired: adds in other sources of info - feelings/experience, difficult to measure qualities, related concepts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Quantitative vs Qualitative data: explain differences and give examples

A

Quantitative: specific & measurable. Often gives WHAT of a problem.
* Structured interviews, surveys/polls

Qualitative: subjective or explanatory - can’t be quanitified. Often gives WHY of a problem.
* Focus groups, social media text/review analysis, in-person interviews
Powerful when combined

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Report vs Dashboard: explain differences, strengths/weaknesses

A

Report: Static, distributed periodically
+/- High level, historical
+. Quick to build, easy IF maintained
+. Static data - no cleaning
-. Continual mainteance
-. Less interactive

Dashboard: Real-time data, multiple datasets in one place
+. Dynamic, automated, interactive
+. User exploration
-. Labor-intensive design
-. Can be confusing/overwhelming (requires training)
-. More initial effort, and may need fixes
-. Potentially unclean data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

3 Types of Common Dashboard focus

A

Strategic: long term goals - highest level metrics over time frame
Operational: short-term performance and goals (most common - real-time status)
Analytical: datasets and mathematics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Small Data vs Big Data: define and explain the differences in use

A

**Small Data: ** specific, short time-period, day-to-day decisions
- usually spreadsheets
- small/mid-size businesses
- simple to collect, store, manage, sort, visualize
- usually manageable size for analysis

**Big Data: ** larger, less-specific, longer time period, big decision
- usually database, queried
- larger businesses
- takes effort to collect, store, manager, sort, visualize
- usually needs to be broken down for analysis
- often more data than needed - challenge is to sift for gems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is structured thinking?

A

Process - recognize problem, organize availble info, reveal gaps/opportunities, identify options for action.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Scope of Work & Statement of Work: Define and explain difference

A

Scope of Work: agreed upon timeline, including deliverables, milestones, and reports
Statement of Work: identifies products/services vendor or contractor will provide an organization (objectives, guidelines, deliverables, schedule, cost)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the W questions to explore possible bias in data?

A

**Who: **person/organization who collected/funded
**What: **things in world the could have impacted
**Where: **origin of data
**When: **time data was created/collected
Why: motivation behind creation/collection
**How: **methods used to create/collect

Important to include context/possible bias when presenting/reporting data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are some tips when dealing with Executive team stakeholders?

A
  • Strategic
  • Headlines first
  • Limited time
  • Details in appendix
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

1st vs 2nd vs 3rd party data sources: what’s the difference?

A

First Party: collected by individual/group themselves for own use
Second Party: collected by a group from its own audience, then sold
Third Party: collected by outside sources who didn’t collect it directly themselves - requires more checking

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Discrete vs Continuous?
Ordinal vs Nominal?
Internal vs External?

A

Discrete: whole numbers only
**Continuous: **any numeric value
Ordinal: qualitative data with set order/sequence
Nominal: qualitative data with no order/sequence
**Internal: **lives in org’s systems (more reliable, easier to collect)
**External: **lives outisde org’s systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Structured vs Unstructured data?

A

Structured: defined data types; usually quantitative; easy to organize/search/analyze; rows/columuns; stored in DB or warehouse
Unstructured: varied data types; usually qualitative; difficult to search; more freedom for analysis; stored in data lakes, warehouses, NoSQL DB’s; Can’t be put in rows/columns; Examples: txt msgs, soc media comments, phone call transcripts, images/audio/video

17
Q

What is Data Modeling?

A

Diagramming how data is organized/structured.
Data analysts don’t create data models, but they must be able to read/understand them.

18
Q

What are the three types of data modeling (from lowest to highest level of detail)

A

**1. Conceptual **(Business concepts): high level view of data, key entities
2. Logical (data entities): relationships, attributes, entities (not actual table/column names)
3. Physical (physical tables): specific definitions of all tables, attributes, relationships, columns, data types, ACTUAL names

19
Q

ERD vs UML

A

**ERD: **Entity Relationship Diagram - visual display of relationships between entities/database
**UML: **Unified Modeling Language - detailed diagrams that show ERD contents PLUS system behaviors and workflows (more detailed)

20
Q

Data Types:
Number
String
Boolean

A

Number: numbers only, decimal
String: text characters, punctuation
Boolean: T or F (formula uses Boolean operators AND, OR, NOT

21
Q

Wide Data vs Long Data

A

Wide Data: each subject has single row with multiple columns for attributes. Easier to compare specific attribute across different subjects. Analysts often transform Long data into Wide data for analysis/visualizations
**Long Data: **each row is one time point per subject. Each subject may have multiple rows. Versatile way to store data. More advanced, more detail on each subject.

22
Q

What is data transformation?

A

Changing data format, structure, values (adding/copying/deleting/renaming/combinging/joining datasets/reformatting).
Goal: reorganize data for easier use/analysis; improve compatibility/portability; merging datasets; ehnancing with more detail/fields, comparing data

23
Q

What are the three common types of metadata

A

Descriptive: used to identify record later on (name, ID, title, author, etc)
Structural: how organized related to other data/collections
Administrative: technical source of an asset

24
Q

What 3 elements are needed to calculate a sample size? What do they each represent?

A

Population size (total)
Confidence Level (%): how sure you are that results are representative (if you did survey again, likelihood you’d get similar results). Standard 95% or 90%
Margin of Error (+/- %): how much your results might vary from ACTUAL value. standard

25
Q

Explain differences between
- Margin of Error
- Confidence Level
- Confidence Interval

A
  • Margin of Error: % variation results are from actual population result. Standard for surveys: 3%-5%
  • Confidence Level: % confident that results are representative (likelihood of simialr results if repeated). Standard 90%-95%
  • Confidence Interval: range of values the ACTUAL value might be at the current confidence level
    (If confidence level = 95% and margin of error is +/- 3% and value was 1,000…… “We are 95% confident that the actual value is between 9,070 and 1,030.”
26
Q

What is statistical significance?
What is statistical power?

A

statistical significance: Likelihood that outcome is due to chance (for experiment or relating two variables). Expressed as P-value between 0.0-1.0. P-value less than .05 (5%) considered statistically significant. (calculated after study).
statistical power: probability of getting meaningful (statistically significant) results from a test (determined before study). Expressed as 0.0 to 1.0. Usually power of at least 0.8 is required to consider results statistically significant.

27
Q

What is response distribution and how does it relate to margin of error?

A

Response distribution reflect the degree of variation in answers - which in turn affects margin of error. Expressed as a % where 50% is maximal variation and 0% and 100% are uniform/unanimous. (highly skewed sample reflects highly skewed population)
For this reason, the actual margin of error should be calculated for EACH QUESTION rather than for the survey as a whole.
PLAN for maximal variation (50% response distribution) but PUBLISH margin of error based on actual response distribution