Midterm 1 Flashcards
Data Science Lifecycle Step 1
Frame the problem
Data Science Lifecycle Step 2
Collect raw data
Data Science Lifecycle Step 3
Process the data
Data Science Lifecycle Step 4
Explore the data
Data Science Lifecycle Step 5
Perform in depth analysis
Data Science Lifecycle Step 6
Communicate results
Association
any relation or link
Causality
One thing causes the other
Categorical
Each value is from a fixed inventory
Numerical
Each value is a number (not a code)
Values
can be numerical or categorical, and of many subtypes within these
Distribution
For each different value of the variable, the frequency of individuals that have that value
Randomize
If you assign individuals to treatment and control at random, then the two groups are likely to be similar apart from the treatment
Random =/ = Haphazard … regardless of what the dictionary says (in probability theory)
Assignment statements
statements don’t have a value ; they perform an action
f(27)
(f- what function to call) (27-argument to the function) “Call f on 27”
t.select(label)
constructs a new table with just the specified columns
t.drop(label)
constructs a new table in which the specified columns are omitted
t.sort(label)
constructs a new table with rows sorted by the specified column
t.where(label, condition)
constructs a new table with just the rows that match the condition
Integers
an integer of any size,
an int never has a decimal point
Floats
has an optional fractional part
always has a decimal point
may use scientific notation
they have limited size (but the limit is huge)
they have limited precision of 15-16 decimal places
after arithmetic the final few decimal places can be wrong
string
a set of characters of any length - ‘A’
Arrays
A collection of things
sequence of values of the same type (Arrays -> Columns)
Ranges
A range is an array of consecutive numbers
np.arange(end)
An array of increasing integers from 0 up to end
np.arange(start,end)
An array of increasing
np.arange(start,end,step)
A range with step between consecutive values
NOTE: The range always includes start but excludes end
Table.read_table(filename)
reads a table from a spreadsheet
Numerical Attribute types
Each value is from a numerical scale
Numerical measurements are ordered
Differences are meaningful
Categorical Attribute types
Each value is from a fixed inventory
May or may not have an ordering
Categories are the same or different
Use line polts for sequential data if
Your x axis has an order
Sequential differences in y values are meaningful
Theres only one y-value for each x-value
Usually x-axis is time or distance
Use scatter plots for non-sequential data
When you’re looking for associations
Binning
counting the number of numerical values that lie within rages, called bins
Bins
defined by their lower bounds (inclusive)
The upper bound is the lower bound of the next bin
Histogram
Chart that displays the distribution of a numerical value / attribute
Uses bins; there is one bar corresponding to each bin
Uses the area principle
The area of each bar is the percent of individuals in the corresponding bin
Height formula
(% in bin/width of bin)
Area of bar formula
% in bin = Height x width of bin
Scatter plot
relation between numerical variables
Line graph
sequential data (over time)
Bar chart
distribution of categorical data
Histogram
distribution of numerical data
Grouped Table
One combo of grouping variables per row
Any number of grouping variables
Aggregate values of all other columns in table
Missing combos absent
Pivot table
One combo of grouping variables per entry
Two grouping variables: columns and rows
Aggregate values of values column
Missing combos = 0 (or empty string)
Probability
Lowest value: 0
Chance of even that is impossible
Highest value: 1 (or 100%)
Chance of event that is certain
Complement: if an event has chance 70%, then the chance that it doesn’t happen is
100% - 70% = 30%
1 - 0.7 = 0.3
Equally likely outcomes
P(A) = (number of outcomes that make A happen) / (total number of outcomes)
Multiplication Rule
Chance that two events A and B both happen
P(A) = P(A happens) x P(B happens given that A has happened)
Addition Rule
If event A can happen in exactly one of two ways, then
P(A) = P(first way) + P(second way)
Pivot
Cross-classified according to two categorical variables
Produces a grid of counts or aggregated values
Two required arguments
First (A): variable that forms column labels of grid
Second (B): variable that forms row labels of grid
Two optional arguments (include both or neither)
values = column_label_to_aggregate
collect = function_to_aggregate_with
Lists
sequence of values of different types (Lists -> Rows)
Groups
collect rows by some column