midterm Flashcards
DATA
big data
extremely large and diverse collections of structured, unstructured, and semi-structured data that continues to grow exponentially over time
DATA
5 V’s of big data
volume → scale of data
variety → different forms of data
veracity → uncertainty of data
can you trust it?
are you required to clean data?
velocity → analysis of streaming data
value — what we get out of the data
to answer questions
DATA
5 principles of data ethics for business professionals
ownership — individual has ownership over their personal info
* consent through signed written agreements, digital privacy policies, pop-ups with checkboxes
transparency — subjects have a right to know how you plan to collect, store, and use it
privacy — safeguard personally identifiable information via dual authentication, file encryption
de-identify datasets → removing PII
**intention **— why are you collecting data?
outcomes — disparate impacts (harmful even if intentions are good)
ex. arrest ads
DATA
data analysis process
- define why you need data analysis
- begin collecting data from sources
- clean through unnecessary data
- begin analyzing the data
- interpret results and apply them
DATA
exploratory data analysis techniques
summary stats: mean, median, mode, min, max, SD
data visualization: charts, graphs
outlier detection: Z-score, box plot, scatter plots
**correlation analysis: **correlation matrices, scatter plots
data distribution assessment: histograms, density plots
dimensionality reduction: PCA
DATA
business intelligence
tools and techniques that process data and conduct statistical analysis for insight and discovery
used to discover meaningful relationships in the data, detect trends, identify opportunities and risks
DATA
data ethics
moral obligations of gathering, protecting, and using personally identifiable info and how it affects individuals
to protect customers’ safety, save org from legal issues
DATA
where can algorithms have bias?
ethical use of algorithms → bias:
1. training — unrepresentative datasets = favors some outcomes
2. code — might have been written to produce biased results
3. feedback — can be influenced by biased feedback
DATA
data network effect
ex. of companies
growth cycle in which data is used to acquire customers, who create more data, which attracts more customers
* common growth model for ecommerce
* smart companies use the data to inform investment in their operations + build defensible business models
* have to cultivate cultures that facilitate the data network effect
Netflix, Tesla
DATA
do you start with building the infrastructure of the data? what are the issues involved?
start with infrastructure: where do you get the data?
start with data: build the infrastructure over time > hard to store initially
DATA
data integrity
accuracy, consistency, and reliability of data throughout its lifecycle
DATA
data exploration
data analytics process where analysts investigate the dataset to gain insights, identify patterns, and understand the underlying structure of the data
helps understand the data, assess the data quality, select important features of data, detect outliers, and identify relationships and patterns.
DATA
statistics, probability
statistics — branch of mathematics dealing with the collection, analysis, interpretation, presentation, and organization of data
probability — a mathematical tool used to study randomness; the chance of an event occurring
DATA
simple random sampling, stratified sampling, cluster sampling
SRS: take a single random sample
SS: sort into homogenous strata and then take samples from the strata that are proportionate to the actual proportions
CS: sort into heterogenous clusters and take samples from the cluster
DATA
direct network effects
increased users/usage of a product lead to direct increase in the value to existing users
ex. telephones, facebook