midterm Flashcards
DATA
big data
extremely large and diverse collections of structured, unstructured, and semi-structured data that continues to grow exponentially over time
DATA
5 V’s of big data
volume → scale of data
variety → different forms of data
veracity → uncertainty of data
can you trust it?
are you required to clean data?
velocity → analysis of streaming data
value — what we get out of the data
to answer questions
DATA
5 principles of data ethics for business professionals
ownership — individual has ownership over their personal info
* consent through signed written agreements, digital privacy policies, pop-ups with checkboxes
transparency — subjects have a right to know how you plan to collect, store, and use it
privacy — safeguard personally identifiable information via dual authentication, file encryption
de-identify datasets → removing PII
intention— why are you collecting data?
outcomes — disparate impacts (harmful even if intentions are good)
ex. arrest ads
DATA
data analysis process
- define why you need data analysis
- begin collecting data from sources
- clean through unnecessary data
- begin analyzing the data
- interpret results and apply them
DATA
exploratory data analysis techniques
summary stats: mean, median, mode, min, max, SD
data visualization: charts, graphs
outlier detection: Z-score, box plot, scatter plots
correlation analysis: correlation matrices, scatter plots
data distribution assessment: histograms, density plots
dimensionality reduction: PCA
DATA
business intelligence
tools and techniques that process data and conduct statistical analysis for insight and discovery
used to discover meaningful relationships in the data, detect trends, identify opportunities and risks
DATA
data ethics
moral obligations of gathering, protecting, and using personally identifiable info and how it affects individuals
to protect customers’ safety, save org from legal issues
DATA
where can algorithms have bias?
ethical use of algorithms → bias:
1. training — unrepresentative datasets = favors some outcomes
2. code — might have been written to produce biased results
3. feedback — can be influenced by biased feedback
DATA
data network effect
ex. of companies
growth cycle in which data is used to acquire customers, who create more data, which attracts more customers
* common growth model for ecommerce
* smart companies use the data to inform investment in their operations + build defensible business models
* have to cultivate cultures that facilitate the data network effect
Netflix, Tesla
DATA
do you start with building the infrastructure of the data? what are the issues involved?
start with infrastructure: where do you get the data?
start with data: build the infrastructure over time > hard to store initially
DATA
data integrity
accuracy, consistency, and reliability of data throughout its lifecycle
DATA
data exploration
data analytics process where analysts investigate the dataset to gain insights, identify patterns, and understand the underlying structure of the data
helps understand the data, assess the data quality, select important features of data, detect outliers, and identify relationships and patterns.
DATA
statistics, probability
statistics — branch of mathematics dealing with the collection, analysis, interpretation, presentation, and organization of data
probability — a mathematical tool used to study randomness; the chance of an event occurring
DATA
simple random sampling, stratified sampling, cluster sampling
SRS: take a single random sample
SS: sort into homogenous strata and then take samples from the strata that are proportionate to the actual proportions
CS: sort into heterogenous clusters and take samples from the cluster
DATA
direct network effects
increased users/usage of a product lead to direct increase in the value to existing users
ex. telephones, facebook
DATA
cloud databases vs warehouses
warehouses: expensive and time consuming to build, hard to scale, analytics depends on hardware, intensive interactions between ITs and data scientists
modern cloud solutions: easy setup, minimal upfront cost, extremely scalable, analytics can be done in web browsers anywhere anytime, minimal interactions between ITs and data scientists
DATA
ETL process
data marts vs warehouses?
extract data from different sources
transform the extracted data into desirable formats for further storage
load the transformed data into a data warehouse or data mart for analytics purposes
data warehouses are larger and centralized (whole org), while data marts are usually department-specific
DATA
database, relational database
formatting practices?
database — any collection of related information
relational databases — organize data into 1 or more tables
* each table has columns (fields, attributes) and rows (records, obs)
* a unique key identifies each row
should be lowercase, have no spaces, be singular, be unique + different from table name
DATA
relational database management systems (RDBMS)
help users create and maintain a relational database
* ex. mySQL, Oracle, postgreSQL
* provides access to data using a declarative language, like SQL
SQL
types of joins
left join = all of x, include matching info in y
right join = all of y, include matching info in x
inner join = all matching info in x and y; default if not specified!
full outer join = all info in both x and y
SQL
data types
VARCHAR: variable data type; can store big and small strings
INT
NUMERIC = flexible float
TABLEAU
what is visualization? what is a good chart
visualization: intermingling of scientific and design traditions
good chart: high contextual effectiveness, good design execution
TABLEAU
item hierarchy + how to create
item hierarchy: shows the organizational structure of the objects within the dashboard
* drag an attribute under another
TABLEAU
when would you use the following chart types?
1. scatterplot
2. histogram
3. bar chart
4. line chart
5. treemap
- show rel between 2 measures
- show distribution of 1 measure
- display measures wrt dimension categories
- show a changes in a measure over time/another continuous measure
- show the relative size of measures, where they make up parts of a whole