Wednesday, June 24, 2015

Sigmas, and Errors, and Mus, Oh My!

If you do or read about science, you're (probably) familiar with the concept of error bars, but do you really know what they mean?  I'm going to try to explain it to you in this blog post.

Let's say you're reading a scientific paper and you see this value: \( 37.9 \pm 1.5\).  What does that mean? What does that little plus/minus tell you and why is it there?

That plus/minus exists because no measurement taken is ever exact. If someone were to measure your height several times, they wouldn't get the same result every time.  You're not growing and shrinking; there's just a source of error --in this case, human error-- in the measurement.  The plus/minus --or rather the number after it-- tells you how uncertain that particular measurement is.  In fact, it's sometimes called the uncertainty, or the sigma (\(\sigma\)), or the standard deviation.

A sigma is a standardized convention used in statistics to say how far away a certain measurement is from the mean, or mu (\(\mu\)).

Does this look familiar?


This is a gaussian curve, sometimes called a normal distribution. I'll tell you why it probably looks so familiar in a bit, but first I'll tell you what it means. 

The y-axis of the curve above is frequency; it tells you how many times a specific value (different measurement values run along the x-axis) was recorded.  Going back to the height example, the mean, or \(\mu\), is the average of all of the height measurements taken.  If someone asked you how tall you are, you would probably tell them this value, because it's the one that gets repeated the most.  Similarly, when scientists report a value without an error, they're likely reporting \(\mu\).  

I said before that \(\sigma\) refers to how far from the mean a value is, which you can see in the plot above.  If you move more than one \(\sigma\) away from \(\mu\), you are outside of 68.3% of the data.  That is the definition of 1 \(\sigma\): the step away from the mean that contains 68.3% of the data. The more sigmas away from the mean you are, the more data you're cutting out.

There is an equation that relates \(\mu\), \(\sigma\), and these percentages (p):

 $p = \frac{1}{\sigma \sqrt{2\pi}}e^{\left [ -\frac{1}{2}\left ( \frac{A-\mu}{\sigma} \right )^{2} \right ]}$

where A is a specific measurement of, for example, your height. 

In a specific form of this equation where \(\mu = 0\) and \(\sigma = 1\), the probability will tell you the percentages illustrated in the figure above. 

In its most general form, it gives you the probability that you will take a certain measurement given a specific \(\mu\) and \(\sigma\).  For example, the probability of a nurse telling you that you're 4' tall when you're usually measured to be 6'2" (\(\pm\) an inch or two) is really low.  

You might be able to tell by looking at the plot that the values of \(\mu\) and \(\sigma\) really affect the shape of the curve.  (Increasing \(\sigma\) makes the curve fatter, and this makes sense because more data will fall within that 68.3% cutoff.)  Because of this, and because the area under a gaussian is so easy to measure, gaussians are often used to "fit" or model data sets. 

This might seem kind of naive, assuming that a shape so simple can be used to model complex natural systems, but it's really not!  Gaussians actually occur all over the place in nature because of something called the Central Limit Theorem.  The theorem states that, given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed.  

In other, simpler words, let's say a nurse measures your height 50,000 times. Each measurement is independent of all the others.  If you were to plot those height measurements against their frequency (how many times that specific value was recorded), it would look like a gaussian.

That's why the gaussian probably looked so familiar to you!  Because it exists all around you.  I mean, it's actually because gaussians are in all of the books on math, science, and statistics, but I like the first reason more.

No comments:

Post a Comment