midterm Flashcards
DATA
big data
extremely large and diverse collections of structured, unstructured, and semi-structured data that continues to grow exponentially over time
DATA
business intelligence
tools and techniques that process data and conduct statistical analysis for insight and discovery
used to discover meaningful relationships in the data, detect trends, identify opportunities and risks
DATA
data ethics
moral obligations of gathering, protecting, and using personally identifiable info and how it affects individuals
to protect customers’ safety, save org from legal issues
DATA
where can algorithms have bias?
ethical use of algorithms → bias:
1. training — unrepresentative datasets = favors some outcomes
2. code — might have been written to produce biased results
3. feedback — can be influenced by biased feedback
DATA
data network effect
ex. of companies
growth cycle in which data is used to acquire customers, who create more data, which attracts more customers
* common growth model for ecommerce
* smart companies use the data to inform investment in their operations + build defensible business models
* have to cultivate cultures that facilitate the data network effect
Netflix, Tesla
DATA
do you start with building the infrastructure of the data? what are the issues involved?
start with infrastructure: where do you get the data?
start with data: build the infrastructure over time > hard to store initially
DATA
data integrity
accuracy, consistency, and reliability of data throughout its lifecycle
DATA
data exploration
data analytics process where analysts investigate the dataset to gain insights, identify patterns, and understand the underlying structure of the data
helps understand the data, assess the data quality, select important features of data, detect outliers, and identify relationships and patterns.
DATA
statistics, probability
statistics — branch of mathematics dealing with the collection, analysis, interpretation, presentation, and organization of data
probability — a mathematical tool used to study randomness; the chance of an event occurring
DATA
simple random sampling, stratified sampling, cluster sampling
SRS: take a single random sample
SS: sort into homogenous strata and then take samples from the strata that are proportionate to the actual proportions
CS: sort into heterogenous clusters and take samples from the cluster
DATA
direct network effects
increased users/usage of a product lead to direct increase in the value to existing users
ex. telephones, facebook
DATA
cloud databases vs warehouses
warehouses: expensive and time consuming to build, hard to scale, analytics depends on hardware, intensive interactions between ITs and data scientists
modern cloud solutions: easy setup, minimal upfront cost, extremely scalable, analytics can be done in web browsers anywhere anytime, minimal interactions between ITs and data scientists
DATA
ETL process
data marts vs warehouses?
**extract **data from different sources
transform the extracted data into desirable formats for further storage
load the transformed data into a data warehouse or data mart for analytics purposes
data warehouses are larger and centralized (whole org), while data marts are usually department-specific
DATA
database, relational database
formatting practices?
database — any collection of related information
relational databases — organize data into 1 or more tables
* each table has columns (fields, attributes) and rows (records, obs)
* a unique key identifies each row
should be lowercase, have no spaces, be singular, be unique + different from table name
DATA
relational database management systems (RDBMS)
help users create and maintain a relational database
* ex. mySQL, Oracle, postgreSQL
* provides access to data using a declarative language, like SQL