PART II: Python For Basic Data Analysis Flashcards
What are the functions you have to import to plot/show a graph?
From pylab (or matplotlib) and numpy
Plot, show, xlabel, title, legend, xlim, etc
-import data using loadtxt
-use linspace to get x values and then calc y
-assign x and y manually with lists
-can also use errorbar()
How can we modify the lines and colours on a plot?
Colours: r,g,b,c,m,y,k,w
Line styles: o(dotted), -(solid), –(dashed)
Scatter plots
Use scatter() from pylab For if you don't want to connect the dots
Density plots
- for chi^2 mostly
- use imshow() from pylab
- y axis top to bottom, x left to right
- adjust y with origin=”lower”
- parameter extent=[xl,xu,yl,yu] gives range of x and y values
- aspect=# specifies aspect ratio of x and y axes
- colorbar() shows range of colours to help read density plot
- also different colour schemes exist! (Ex. spectral() )
What kinds of errors or uncertainties are there?
Systematic: error only goes one way
Random: goes both ways at random
What is discrepancy?
Refers to the difference between results
When is discrepancy significant?
If it’s larger than both error ranges combined (as in, there’s no overlap of error bars)
When are the two measurements not consistent?
If the discrepancy is significant
Which means if it’s larger than both error ranges combined
Error propagation: sums, differences, products, quotients, how???
Sums and differences add in quadrature (sum of squares)
Products and quotients add RELATIVE error in quadrature
Follow BEDMAS
Error propagation: two correlated or dependent conditions
Condition with lowest error is considered in error calculations
What is probability?
The chances of getting a subset N outcomes from a total T possible outcomes:
P = N/T
Properties of probability?
- 0<p></p>
What do we do if we have two independent conditions? (Probability)
Neither can affect the probability of the other and so probabilities are multiplied
How do we represent statistical significance?
- we use n*sig (corresponds to probability of having a result n standard deviations away from mean in Gaussian dist)
- 1sig = 0.31
- 2sig = 0.0456
- outcomes are significant if p-value is equal to sig level
- assume two sides probabilities
Define p-value
Probability that our data matches the null hypothesis
Higher = greater match to “nothing happening”
Lower = more different and significant results compared to control/null hypothesis
How do we calculate mean/average?
List: sum/len
Array: sum/shape (from numpy)
What is the median?
For sorted set of values: numpy.median()
Mode
Most common value of x
Have to count occurance of each quantity
Formula for variance
Sig^2 = 1/N * sum(N,i)(xi-xavg)^2
What is variance?
If we’re looking at the range/spread of values, it means how far from the average we usually are
Error in the mean: what is it, what’s the formula?
Measure of how reliable your mean value is
Delta = sig/sqrt(N)
You want more N to reduce it
You’d have to increase N a lot though since its sqrt
Formula for normal/Gaussian distribution
p(x) = 1/sig*sqrt(2pi) * exp(-(x-xavg)^2/2sig^2)
Central limit theorem
Distribution of sample means approaches a normal distribution as sample size gets larger
How do we calculate the probability of a given result for a normal distribution?
Integrate it
But you can’t integrate it so
We have erfc and erf
P(|x|>nsig) = erfc(n/sqrt(2))
When do you use binomial distribution?
When you want to determine the probability of discrete events occurring, with no range given
Binomial distribution: what do the variables mean?
K is number of discrete events you’re looking for
N is total number of events
P is probability of discrete event occurring
When do you use poisson?
Probability of discrete events occurring
Rate is given (as lambda)
When you’re picking out errors in code, what should you look for?
Float values
Syntax errors
What are the largest/smallest numbers in Python for floats and what happens if you exceed them?
\+/- 10^308 Exceeds max: inf (overflow) Lower min: 0 (underflow) Operations with inf give nan Limited precision due to rounding error
Precision for int values in Python?
Arbitrary precision (as long as needed)
What is maximum likelihood?
A way of finding the most probably model used to reproduce the data you obtained
Method of fitting a curve
How good is a model?
Log values of probability to force it to be monotonic
To max it you use chi^2
How do we test our models?
Guess better parameter values
Use least-squares test to find lowest values of chi^2 (mathematically or density plot)
How do we run the chi^2 process?
Import functions and data
Set different ranges of a,b using set intervals to test
Create array to store all chi^2 values for a,b
Create for loop to calc/store all chi^2
Use density plot or solve mathematically to find lowest chi^2
Solving chi^2 with math
Use chi^2 equation & plug in model for m
Differentiate chi^2 with respect to a, set =0
Solve for a
That looks simple but it’s ugly
What do you do if it isn’t a linear model?
Use a density plot