Thursday, July 2, 2015

Data Fitting

Oh hai there, folks. This one's going to be super heavy on the statistics.  But this blog is supposed to be a record of what I do this summer, and I've spent at least 1/3 of my time this week learning and practicing statistics with a coding twist.  By the end of this blog post, hopefully you'll know more about statistics, too (or, at least a kind of stats that astronomers use a lot).

First, I'll say that for some background on the material covered in this post, you should check out this one.  It talks about sigma and normal distributions, which have been really integral to the work I've been doing this past week.

Now let's get to the new stuff.  I said it had a coding twist, which meant that I had to write Python (learn more about Python here) functions that took in a few input variables and returned useful values and plots. In this post, I'm going to describe those functions and maybe even a little bit of the frustration I felt while I was writing them.

The first function I wrote is used to generate data sets. It does this by returning the dependent values of a polynomial function of any degree. To make this work, the user has to input an array of independent variables (or x values) and the coefficients to each term in the function and they have the option of including a sigma value.  That all sounds kind of abstract, even to me, so let's use a concrete example.

Let's say you want to make fake data that follows the function \(y = x^2+4x-3\), and because you know from the "Sigmas" post that no measurement is perfect, you want to add some "noise" or uncertainty to that data.  Let's say that each measurement could be as much as 2 off from the "true" value.

To do this, the user would input an array of x values: 0, 2, 4, 6, 8, 10
and the coefficients for each term (in order of increasing degree): -3, 4, 1
and the sigma value: 2
and the function produces these y values: -2.380, 6.168, 28.457, 56.444, 94.166, 134.499


Yay! But now we have to actually use that data, which is what the other functions are for.

The second function  is used to calculate how well modeled data matches observed data. In statistics, this is known as the likelihood.  The function takes as input the data (either real or the set you generated using the first function), the array of x values you used, and coefficient values like you used before.

The function works by generating a list of \(\mu\)s using the x values and the coefficients. Those \(\mu\)s are then used in the following equation:

$ln(L) = -\frac{1}{2}\sum ln\left ( 2\pi\sigma_i^2 \right )-\frac{1}{2}\sum\left ( \frac{D_i-\mu_i}{\sigma_i} \right )^2$

Because there are sums, this function returns a single value that basically just tells you how close your model is to your observation. That value us used in the next function.

The third function is my favorite.  It was one of those things that wouldn't work for the loooooongest time, but once I figured it out and looked at it, it was almost embarrassingly simple. It takes as input the "observed" data you have, the x values, and arrays of values that you want to test for each coefficient.  For example, with the values above, I would test out arrays from [-4,-2], [3,5], and [0,2].  This way, it's guaranteed that when the function tests out all these different values, it will test out the true ones.

All it does is create a grid or cube where the axes are the arrays of coefficients you want to test and calculate the likelihood at each point.  By the end, if you make a contour plot of the final grid, you can literally see which coefficients have the highest likelihood of matching your observed data. If you put those coefficients through the first function, you end up with a line that claims to be the best fit to your observed data. It's so cool!  
But, with more than 3 coefficients (higher degree polynomials), you can probably imagine that the grid or cube will get really big and calculating the likelihood that many times would cost a lot of computer time.  Lucky for us, there's another, simpler, more mathematical way to find the best fit to the data.

 Thank you, Adventure Time, for expressing my exact feelings about times like this so well. 

The fourth function  is just a lot of matrix math. This isn't a math blog, so I'm not going to write it all out for you, but I'll give you the gist. 

The second term in the long equation above?  The bit after the summation symbol is Chi Squared (\(\chi ^2\)).  Chi Squared is really important to statisticians, but the most important thing I learned about it this week is that we want it to be minimized.  How do you minimize things in math? You take the derivative and you set it equal to 0.  So that's what we did.  We differentiated \(\chi ^2\) (with respect to the coefficients we're trying to find), translated the summation terms into matrices, and set it equal to 0. 

The function returns the best-fit values for the coefficients, just like the last one.  But this function can do polynomials of any degree and it's so much faster than the last. 


So, that was my week.  Well, that and a little bit of research, but I'll save those stories for another blog post. 

No comments:

Post a Comment