Home
Introduction and     Objectives
Library Skills
Scientific Process
Statistical Analysis
   Introduction to        Statistics
    Handling Data in         the Sciences
      Background
      Behavior Of          Uncertainties
      Discussion Of           Errors
      Expressing Number
      Graphical Analysis
      Linear Regression
      Preface
      Propagation Of           Errors
Technical Writing     and Evaluation
Poster Board     Design and Use
Academic Integrity
Resources
Click Here to Go Home An Introduction to Basic Statistics | The Behaviors of Experimental Uncertainties

The Behaviors of Experimental Uncertainties

The Behaviors of Experimental Uncertainties

Histograms

Let’s take a look at how numerical values are distributed in most sets of measurements.  We already mentioned that several measurements of a value will produce varying results when the measuring instrument is sensitive enough.  This also occurs when we measure some particular trait occurring in individuals selected at random from a larger population.  The values that actually occur and the frequency of their occurrence depends on the type of variable being measured; however, we will concentrate on the most common case of the normal or Gaussian distribution.

To illustrate just how a sequence of measurements behaves, consider the following plots of the lengths of a fictitious set of animals.  Figure 13 shows a histogram of these lengths.  A histogram is a plot of the number of occurrences per category (frequency) versus the category.  This kind of plot allows one to see quickly how often a member of each category occurs.  In the case shown, 1000 animals were measured, and their lengths were grouped into 24 categories, consisting of one-half centimeter wide bins, ranging from 4 centimeters to 16 centimeters.  The number of occurrences of lengths belonging to each bin is then plotted against the increasing sequence of animal lengths.  (This sequence of categories is ordered because, in this case, the categories are number ranges, and numbers are ordered)  For example, there are approximately 70 animals with lengths in the range of 8.0 to 8.5 centimeters, 87 with lengths in the range of 11.0 to 11.5 centimeters, 28 in the range 13.5 to 14.0, etc.  If we were to total the occurrences for all bins, we would get 1000.  The plot shows that the more that animal lengths deviate from the 9 to 10 centimeter range near the center of this distribution, the less frequently they occur in the sample of 1000.  This group of animals seems to have lengths clustered near 9 to 10 centimeters.

Figure 14 is similar to Figure 13, with one major difference.  The number of occurrences for each bin is divided by the total number of individuals measured (1000).  This gives the relative frequency, or more appropriately, the probability of obtaining a length in a given range.  (Multiplying these numbers by 100 gives the percentage of individuals having lengths in a given range.)  If, for example, we want to know the probability that we would encounter “abnormal” lengths within the ranges of 4 to 7 cm or 13 to 16 cm, we simply total the probabilities in all of the bins that cover these ranges.  The dashed line on Figure 14 is a plot of what is called the normal curve.  Other names for this curve are the “bell curve”, or the Gaussian distribution.

The shape of these curves is characteristic of most measurements containing random fluctuations about some average value.  It is this characteristic shape that allows experimenters to make predictions from measurements made on populations of individuals (and from measurements of other physical quantities as well).  Additional kinds of distributions include the binomial distribution, Lorentzian distribution, Poisson distribution, uniform distribution, Weibull distribution, and several others.  But it is the normal or Gaussian distribution that is somewhat special, because the other distributions become increasingly similar to the normal distribution as the sample size (size of the measured group) increases.  Many phenomena produce measured values that follow this distribution.


Figure 13.


Figure 14.


The Gaussian or “Bell-Shaped” Normal Distribution

The Gaussian or normal distribution has a particular mathematical form called the normal probability density function.  It is given by the expression

( note: exp(z) is the same as e z )  In this expression, x is the variable being measured, m is the mean of the distribution of the x’s, and s  is the standard deviation of the distribution.  This function gives the probability per unit interval of x that a particular value of x will occur.  Because it defines a probability per unit interval, it is called a density function.  Figure 15 shows several curves having the same mean of 10 but each having different standard deviation, s.  Notice that the curve gets wider and lower for larger standard deviations.  This is showing that the probability per unit interval is getting smaller because the measurements are spread over a wider range of values.  The curve is also symmetrical about the mean, which is the x-value where the peak occurs.

A very important property of any normal curve is that the standard deviation, s, measures the distance from the mean to the x-coordinates of the inflection points of the curve (no matter how wide or narrow the curve).  The inflection points are the places on the curve where the curve changes from being concave downward to being concave upward.  There are two of these on a normal curve, and these are equidistant above and below the mean.  Since the probability density function gives the probability per unit interval that a given value of  x will occur, we need to multiply it by the length of the interval to find the probability that x will fall within a specified range.  Thus

is the probability that a measurement takes on a value in the range of x to x + dx, where dx is an infinitesimally small length.  To find the probability of a measurement falling within a larger interval, say a to b, we must integrate the probability density function between these two limits.  Thus, the probability that a measured value of x falls within the interval (a, b) is given by

This integral is just the area under the curve between a and b.  Figures 16 and 17 show some examples.

It has become customary and useful to talk about areas under the curve (probabilities) between multiples of s, or areas (probabilities) outside of multiples of s.  For all normal curves, there is a 68.3% probability that a measured value would fall between (m.- s) and (m.+ s).  There is a 95.4% probability that a measurement would fall between (m.- 2s) and (m.+ 2s).  There is a 99.7% probability that a measurement would fall between (m.- 3s) and (m.+ 3s).  We can also find the probability that a measured value would fall outside these intervals; in fact these probabilities are just 100% minus the corresponding probabilities of falling within the intervals.

Figure 16 shows three normal curves with the same mean but different standard deviations.  These are shaded outside 1s of the mean to make the various regions easier to see on a single graph.  The unshaded regions for each curve represent the area (probability) within 1s of the mean.  Even though the curves have different heights and widths, this unshaded area constitutes 68.3% of the total area under each curve (the shaded area is 100% - 68.3% = 31.7% of the total area).  The narrowest curve represents the highest precision in measurement, and the widest curve represents the lowest precision measurement.  In the highest precision case, 1s is much closer to the mean than in the lowest precision case, but for each case, there is a 68.3% probability of obtaining a measurement that has a value within the unshaded area and a 31.7% probability of obtaining a value within the shaded areas.

Physicists often express important measurement uncertainties as integer multiples of s.  Thus when quoting an uncertainty, they will sometimes specify it is a 2s or 3s uncertainty. This is done by appending to the number and its uncertainty a parentheses such as (2s), or (3s), or whatever is intended.  When they do this, others implicitly understand that the value resulting from a 2s measurement has a 95.4% probability of lying within the range quoted, and that the value from a 3s measurement has a 99.7% probability of lying within the range quoted.  Simply stating the uncertainty in a measurement without specifying whether it is a 2s or 3s uncertainty usually means that the measurement is a 1s measurement.  (Some important high-precision measurements have been quoted as 6s measurements, meaning that the probability of the number being within the given range is very close to 100%)  Another way of interpreting this kind of reporting is that,, for a 2s measurement, the quoted result has a less than 5% probability of being wrong.  A 3s measurement has less than a 0.3% chance of being wrong.

Biologists, social scientists, quality control engineers, and others use a slightly different method of quoting the probabilities of a measurement being within a given range.  The most common is the 95% interval.  For a series of measurements made on a variable whose uncertainties follow a normal distribution, 95% of the measured values occur between (m - 1.96s) and (m + 1.96s).  (This is pretty close to the 95.4% that lie between (m - 2s) and (m + 2s). )  A number being quoted at this level has a 95% chance of being correct, or conversely, a 5% chance of being wrong.  Other intervals are used as well, but the main difference between biologists and physicists in their reporting, is that biologists prefer the integer percentages, and physicists seem to prefer the integer multiples of s.  They are really speaking the same language.  (As an aside, physicists don’t usually think immediately of the percentages when someone quotes a 2s, 3s, or Ns  measurement. They know from experience that 2s measurements are reasonably high quality, 3s measurements are impressive, and 6s measurements are heroic!)

Figure 17 shows several shaded areas indicating the probabilities of measurements falling within specified intervals along the x-axis.  The curves have a mean of 10 and a standard deviation of 2 (m = 10, and s = 2).  Therefore 1s  below m takes us down to 8, and 1s above m takes us up to 12.  For this distribution, there is a 68.3% probability of obtaining a measured value between 8 and 12.

Figure 15.


Notes for Figure 15:

This figure shows several normal curves, each with a mean of 10 and standard deviations ranging from 0.25  to 2.0.  The smaller the standard deviation, the taller and narrower the curve.  For narrower curves, the measurements are clustering more tightly near the mean.  The standard deviation for any normal curve measures the x-distance from the mean to the inflection points of the curve. (The inflection points are where the curve changes from being concave downward to being concave upward.)  The area under the curve between these inflection points (from one standard deviation below the mean to one standard deviation above the mean) contains about 68.3% of the total number of measurements.  Said another way, the probability of a measurement having a value between ( m - s ) and ( m + s ) is about 68.3%.  The probability of a measurement having a value between ( m - 2s ) and  ( m + 2s ) is about 95.4%.  The probability of a measurement having a value between ( m - 3s ) and  ( m + 3s ) is about 99.7%.  The probability of having a measurement produce a value outside of two sigma’s from the mean is about 4.6%.  The probability of measuring a value that is more than three sigma’s away from the mean is about 0.3%.  These different ways of stating the probability of obtaining a given measurement indicate the important properties of the normal curve that allow experimenters to make quantitative estimates of how good their measurements are.


Figure 16.

Notes for Figure 16.

These three normal curves have the same mean but different standard deviations.  The shaded region for each curve represents the area (probability of obtaining a measured value) outside 1s of the mean (the shading was done this way to make these three plots easier to read).  The unshaded regions represent 68.3% of the area in each case.  If each of these curves represents the distribution of measurements of a length, for example, then the narrowest curve represents the highest precision set of measurements, and the widest curve represents the lowest precision set.  In the high precision case, 68.3% of the measured values lie much closer to the mean that they do for the low precision case.  Since measurements with random variations are often distributed according to a normal curve, we can see why quoting the mean and standard deviation is so descriptive of the measuring process.  For low, wide curves, the measurements are spread out over a wide range, meaning that measurements within a unit interval occur with a smaller probability for a wide distribution than they do for a narrow distribution.


Figure 17.


Area equals the probability that a measured value is more that 2s above the mean.


Area equals the probability that a measured value is more than 1s below the mean.


Area equals the probability of a measured value being between 1s and 2s above the mean.

 

Equivalent Descriptions of Probabilities of Normal Variables

We said that, for a set of measurements following a normal distribution, the probability of a measurement taking on a value within 1s of the mean is 68.3%, and that the probability of obtaining a value within 1.96s of the mean is 95%, etc.  Let’s look at some equivalent mathematical statements of these notions.

For example, another way to say this for the 95% range symmetric about m is

Pr[(m - 1.96s) < x < (m + 1.96s)] = 0.95,

which reads, “The probability that x takes on a value between m - 1.96s and m + 1.96s equals 0.95”.  Another perspective is obtained when we subtract m from both sides of both inequalities.

Pr[ - 1.96s < x - m < + 1.96s ] = 0.95. 

This says that the probability that the difference between x and m is greater that - 1.96s and less than + 1.96s  is 0.95.  A third way focuses on the distance between x and the mean, m.

Pr[|x - m| < 1.96s] = 0.95,

which says “The probability that the distance between x and m is less than 1.96s is 0.95”.  And a fourth way is obtained when we divide by s.

,

or

Pr[|z| < 1.96] = 0.95, where

This last statement introduces something new and very convenient, namely the variable z.  When z is defined in this way, it becomes a variable that has a normal distribution with a mean of zero and a standard deviation of one.  We can see this a little more easily if we express the probability in the form of an integral as follows

Now if we make the substitution z = (x - m)/s, then dz = dx/s.  At the lower limit of the integral, where x starts at  x = m -1.96s, we find that z = -1.96, and when x = m + 1.96s, then z = + 1.96. 

Making these substitutions in the integral over the variable x, we get an equivalent integral over the new variable z as follows:

The last integral over the variable z is called the standard normal form of the normal distribution.  It is just the form of the normal integral with m = 0 and s = 1.  Only integrals of the standard normal form are tabulated in mathematical handbooks because, no matter what mean and standard deviation one is dealing with, one can always calculate z = (x - m)/s  for each of the x’s bounding the intervals of interest in the original x distribution.  The z values are then looked up in the tables for the standard normal distribution to find the probabilities for the corresponding intervals in the x distribution.  One can also use the tables to go in the inverse direction from specified probabilities to the intervals that give them.  We just illustrated this for a 95% probability interval between the lower boundary m - 1.96s and the upper boundary m + 1.96s, but the same technique works for any interval.   In general, for a distribution with a mean, m, and a standard deviation, s, the probability that x has a value in the interval from a to b is

,

where  and .

As an example, suppose we have a distribution of x-values that has a mean of 10 and a standard deviation of 0.6 (m = 10, and s = 0.6).  Suppose also that we want to find the probability that x is greater than 11, which is the same as asking for the area under the distribution curve of x between the lower limit of 11 and the upper limit of ¥.  Notice that 11 is 1.67 standard deviations away from the mean [(11 - 10)/0.6 = 1.67].  This is equivalent to asking for the probability of  z  being between z = (11 - 10)/0.6 = 1.67 and z = ¥.  So, in effect, we are asking for Pr[z > 1.67] on the standard normal distribution. Figure 18 shows the equivalent areas representing the probability we are seeking.  Since the entire area under the normal curve is 1 and the curve is also symmetric, this is can also be expressed variously as

Pr[x > 11] = Pr[z > 1.67] = 1.0 - Pr[z < 1.67] = 1 - (0.5 + Pr[0 < z < 1.67]).

The probability that z is less than 1.67 is the area under the standard normal curve from - ¥  to 1.67.  Some standard normal tables give cumulative probabilities between - ¥   and z.  In this case we look up z in the table, find the cumulative probability between - ¥ and z, and subtract this from 1.0.  Other standard normal tables give the cumulative probability between 0 and z, in which case, look up z to find the cumulative probability, add 0.5 (the area of the lower half of the distribution curve), and subtract the result from 1.0 (or equivalently for this problem, subtract the cumulative probability from 0.5).  For this example we get Pr[x > 11] = Pr[z > 1.67] = 0.0475 or 4.75%.

Figure 18.



Notes for Figure 18:

The upper graph shows a distribution for a variable, x, that has a mean of 10 and a standard deviation of 0.6.  The shaded portion is the area that represents Pr[x > 11].  By making the transformation z = (x - m)/s, we obtain a new variable, z, that has a distribution with a mean of 0 and a standard deviation of 1.  This distribution is called the standard normal distribution, and is shown in the lower graph.  This is the distribution that is plotted in tables.  When x = 11, then z = (11 - 10)/0.6 = 1.67.  Thus we can get Pr[ x > 11] = Pr[z > 1.67] =  0.0475 from a standard table.

Home - Intro - Lib. Skills - Sci. Process - Stat. Analysis
Tech. Writing - Poster Board - Resources
© 2002 Biological Science Institute
All Rights Reserved.
This website is optimized for
 Internet Explorer