To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Probability theory models uncertainty. Observational scientists often come across events whose outcome is uncertain. It may be physically impossible, too expensive or even counterproductive to observe all the inputs. The astronomer might want to measure the location and motions of all stars in a globular cluster to understand its dynamical state. But even with the best telescopes, only a fraction of the stars can be located in the two dimensions of sky coordinates with the third distance dimension unobtainable. Only one component (the radial velocity) of the three-dimensional velocity vector can be measured, and this may be accessible for only a few cluster members. Furthermore, limitations of the spectrograph and observing conditions lead to uncertainty in the measured radial velocities. Thus, our knowledge of the structure and dynamics of globular clusters is subject to considerable restrictions and uncertainty.
In developing the basic principles of uncertainty, we will consider both astronomical systems and simple familiar systems such as a tossed coin. The outcome of a toss, heads or tails, is completely determined by the forces on the coin and Newton's laws ofmotion. Butwe would need to measure too many parameters of the coin's trajectory and rotations to predict with acceptable reliability which face of the coin will be up. The outcomes of coin tosses are thus considered to be uncertain even though they are regulated by deterministic physical processes.
Whenever an astronomer is faced with a dataset that can be presented as a table — rows representing celestial objects and columns representing measured or inferred properties — then the many tools of multivariate statistics come into play. Multivariate datasets also arise in other situations. Astronomical images can be viewed as tables of three variables: right ascension, declination and brightness. Here the spatial variables are in a fixed lattice while the brightness is a random variable. An astronomical datacube has a fourth variable that may be wavelength (for spectro-imaging) or time (for multi-epoch imaging). High-energy (X-ray, gamma-ray, neutrino) detectors give tables where each row is a photon or event with columns representing properties such as arrival direction and energy. Calculations arising from astrophysical models also produce outputs that can be formulated as multivariate datasets, such as N-body simulations of star or galaxy interactions, or hydrodynamical simulations of gas densities and motion.
For multivariate datasets, we designate n for the number of objects in the dataset and p for the number of variables, the dimensionality of the problem. In traditional multivariate analysis, n is large compared to p; statistical methods for high-dimensional problems with p > n are now under development. The variables can have a variety of forms: real numbers representing measurements in any physical unit; integer values representing counts of some variable; ordinal values representing a sequence; binary variables representing “Yes/No” categories; or nonsequential categorical indicators.
We address multivariate issues in several chapters of this volume. The present chapter on multivariate analysis considers datasets that are commonly displayed in a table of objects and properties.
Spatial data consists of data points in p dimensions, usually p = 2 or 3 dimensions, which can be interpreted as spatial variables. The variables might give locations in astronomical units or megaparsecs, location in right ascension and declination, or pixel locations on an image. Sometimes nonspatial variables are treated as spatial analogs; for example, stellar distance moduli based on photometry or galaxy redshifts based on spectra are common proxies for radial distances that are merged with sky locations to give approximate threedimensional locations.
The methods of spatial point processes are not restricted to spatial variables. They can be applied to any distribution of astronomical data in low dimensions: the orbital distributions of asteroids in the Kuiper Belt; mass segregation in stellar clusters; velocity distributions across a triaxial galaxy or within a turbulent giant molecular cloud; elemental abundance variations across the disk of a spiral galaxy; plasma temperatures within a supernova remnant; gravitational potential variations measured from embedded plasma or lensing distortions of background galaxy shapes; and so forth.
The most intensive study of spatial point processes in astronomy has involved the distribution of galaxies in the two-dimensional sky and in three-dimensional space. One approach, pioneered by Abell (1958), is to locate individual concentrations or “clusters” of galaxies. The principal difficulty is the overlapping of foreground and background galaxies on a cluster, diluting its prominence in two-dimensional projections. The greatest progress is made when spectroscopic redshifts are obtained that, due to Hubble's law of universal expansion, allows the third dimension of galaxy distances to be estimated with reasonable accuracy.
Time-domain astronomy is a newly recognized field devoted to the study of variable phenomena in celestial objects. They arise from three basic causes. First, as is evident from observation of the Sun's surface, the rotation of celestial bodies produces periodic variations in their appearance. This effect can be dramatic in cases such as beamed emission from rapidly rotating neutron stars (pulsars).
Second, as is evident from observation of Solar System planets and moons, celestial bodies move about each other in periodic orbits. Orbital motions cause periodic variations in Doppler shifts and, when eclipses are seen, in brightness. One could say that the birth of modern time series analysis dates back to Tycho Brahe's accurate measurement of planetary positions and Johannes Kepler's nonlinear models of their behavior.
Third, though less evident from naked eye observations, intrinsic variations can occur in the luminous output of various bodies due to pulsations, explosions and ejections, and accretion of gas from the environment. The high-energy X-ray and gamma-ray sky is particularly replete with highly variable sources. Classes of variable objects include flares from magnetically active stars, pulsating stars in the instability strip, accretion variations from cataclysmic variable and X-ray binary systems, explosions seen as supernovae and gamma-ray bursts, accretion variations in active galactic nuclei (e.g. Seyfert galaxies and quasi-stellar objects, quasars and blazars), and the hopeful detection of gravitational wave signals. A significant fraction of all empirical astronomical studies concerns variable phenomena; see the review by Feigelson (1997) and the symposium New Horizons in Time Domain Astronomy (Griffin et al. 2012).
Spectrometers divide the light centered at wavelength λ into narrow spectral ranges, Δλ. If the resolution R = λ/Δλ > 10, the goals of the observation are generally different from those in photometry, including both measuring spectral lines and characterizing broad features.
There are three basic ways of measuring light spectroscopically:
Differential-refraction-based, in which the variation of refractive index with wavelength of an optical material is used to separate the wavelengths, as in a prism spectrometer.
Interference-based, in which the light is divided so a phase-delay can be imposed on a portion of it. When the light is re-combined, interference among components is at different phases depending on the wavelength, allowing extraction of spectral information. The most widely used examples are diffraction grating, Fabry–Perot, and Fourier spectrometers. Heterodyne spectroscopy also falls into this category, but we will delay discussing it until we reach the submillimeter and radio regimes in Chapter 8.
Bolometrically, in which the signal is based on the energy of the absorbed photon. This method is applied in the X-ray, for example, using CCDs or bolometers, and will be discussed in Chapter 10.
Today, the term “astronomy” is best understood as shorthand for “astronomy and astrophysics”. Astronomy (astro = star and nomen = name in ancient Greek) is the observational study of matter beyond Earth: planets and bodies in the Solar System, stars in the Milky Way Galaxy, galaxies in the Universe, and diffuse matter between these concentrations of mass. The perspective is rooted in our viewpoint on or near Earth, typically using telescopes on mountaintops or robotic satellites to enhance the limited capabilities of our eyes. Astrophysics (astro =star and physis =nature) is the study of the intrinsic nature of astronomical bodies and the processes by which they interact and evolve. This is an indirect, inferential intellectual effort based on the (apparently valid) assumption that physical processes established to rule terrestrial phenomena – gravity, thermodynamics, electromagnetism, quantum mechanics, plasma physics, chemistry, and so forth – also apply to distant cosmic phenomena. Figure 1.1 gives a broad-stroke outline of the major fields and themes of modern astronomy.
The fields of astronomy are often distinguished by the structures under study. There are planetary astronomers (who study our Solar System and extra-solar planetary systems), solar physicists (who study our Sun), stellar astronomers (who study other stars), Galactic astronomers (who study our Milky Way Galaxy), extragalactic astronomers (who study other galaxies), and cosmologists (who study the Universe as a whole).
The development of R (R Development Core Team, 2010) as an independent public-domain statistical computing environment was started in the early 1990s by two statisticians at the University of Auckland, Ross Ihaka and Robert Gentleman. They decided to mimic the Ssystem developed at AT&T during the 1980s by John Chambers and colleagues. By the late 1990s, R development was expanded to a larger core group, and the Comprehensive R Archive Network (CRAN) was created for specialized packages. The group established itself as a non-profit R Foundation based in Vienna, Austria, and began releasing the code biannually as a GNU General Public License software product (Ihaka & Gentleman 1996).
R grew dramatically, both in content and in widespread usage, during the 2000s. CRAN increased exponentially with ∼100 packages in 2001, ∼600 in 2005, ∼2500 in 2010, and ∼3,300 by early 2012. The user population is uncertain but was estimated to be ∼2 million people in 2010.
R consists of a collection of software for infrastructure analysis, and about 25 important packages providing a variety of important data analysis, applied mathematics, statistics, graphics and utilities packages. The CRAN add-on packages are mostly supplied by users, sometimes individual experts and sometimes significant user communities in biology, chemistry, economics, geology and other fields. Tables B.1 and B.2 give a sense of the breadth of methodology in R as well as CRAN packages (up to mid-2010).
As demonstrated throughout this volume, astronomical statistical problems are remarkably varied, and no single dataset can exemplify the range of methodological issues raised in modern research. Despite the range and challenges of astronomical data analysis, few astronomical datasets appear in statistical texts or studies. The Zurich (or Wolff) sunspot counts over ∼200 years showing the 11 year cycle of solar activity is most commonly seen (Section C.13).
We present 20 datasets in two classes drawn from contemporary research. Thirteen datasets are used for R applications in this volume; they are listed in Table C.1 and described in Sections C.1–C.13. The full datasets are available on-line at http://astrostatistics.psu.edu/MSMA formatted for immediate use in R. Six additional datasets that, as of this writing, are dynamically changing due to continuing observations are listed in Table C.2 and described in Sections C.14–C.19. Most of these are time series of variable phenomena in the sky.
Tables C.1–C.2 provide a brief title and summary of statistical issues treated in each dataset. Here Nd is the number of datasets, n is the number of datapoints, and p is the dimensionality. In the sections below, for each dataset we introduce the scientific issues, describe and tabulate a portion of the dataset, and outline appropriate statistical exercises.
The datasets presented here can be used for classroom exercises involving a wide range of statistical analyses. Some problems are straightforward, others are challenging but within the scope of R and CRAN, and yet others await advances in astrostatistical methodology and can be used for research purposes.
For many years, X-ray astronomy depended on gaseous detectors: basically, capacitors or series of capacitors with a voltage across them and filled with gas. Depending upon the value of the voltage, these devices: (1) just collect the charge freed when an energetic particle interacts with the gas (ionization chamber); or (2) provide gain [in order of increasing voltage, exciting the gas to produce ultraviolet light (scintillation proportional counter); creating modest-sized ionization avalanches to provide gain but with signals still in proportion to the original number of freed electrons (proportional counter); providing gain to saturation (Geiger counter); yielding a visible spark along the path of the avalanche (spark chamber)]. The absorbing gas is typically argon or xenon, for which high absorption efficiency in the 0.1–10 keV range requires a path of order 1 cm. Very thin windows are required to admit the X-rays to the sensitive volume – for example, 1 μm of polypropylene to provide > 80% transmission for energies above 0.9 keV. Proportional counters with multiple anode wires provide spatial resolution of a few hundreds of microns.
Because the atmosphere is opaque to them, X-rays and gamma rays require telescopes and detectors to operate from balloons, or more commonly from space (with the exception of the highest-energy gamma rays). Initially, the detectors were used without collecting optics; the large detector areas then resulted in high spurious detection rates due to cosmic rays. Anti-coincidence counters were required to identify charged particles coming from random directions, allowing probable X-ray events to be isolated. The necessity to operate at the top of or above the atmosphere plus these requirements on the detector systems were very limiting in terms of the angular resolution and sensitive areas that could be achieved.
Most of what we know about astronomical sources comes from measuring their spectral energy distributions (SEDs) or from taking spectra. We can distinguish the two approaches in terms of the spectral resolution, defined as R = λ/Δλ, where λ is the wavelength of observation and Δλ is the range of wavelengths around λ that are combined into a single flux measurement. Photometry refers to the procedures for measuring or comparing SEDs and is typically obtained at R ~ 2–10. It is discussed in this chapter, while spectroscopy (with R ≥ 10) is described in the following one.
In the optical and near-infrared, nearly all the initial photometry was obtained on stars, whose SEDs are a related family of modified blackbodies with relative characteristics determined primarily by a small set of free parameters (e.g., temperature, reddening, composition, surface gravity). Useful comparisons among stars can be obtained relatively easily by defining a photometric system, which is a set of response bands for the [(telescope)-(instrument optics)-(defining optical filter)-(detector)] combination. Comparisons of measurements of stars with such a system, commonly called colors, can reveal their relative temperatures, reddening, and other parameters. Such comparisons are facilitated by defining a set of reference stars whose colors have been determined accurately and that can be used as transfer standards from one unknown star to another. This process is called classical stellar photometry. It does not require that the measurements be converted into physical units; all the results are relative to measurements of a network of stars. Instead, its validity depends on the stability of the photometric system and the accuracy with which it can be reproduced by other astronomers carrying out comparable measurements.
Statistical inference helps the scientist to reach conclusions that extend beyond the obvious and immediate characterization of individual datasets. In some cases, the astronomer measures the properties of a limited sample of objects (often chosen to be brighter or closer than others) in order to learn about the properties of the vast underlying population of similar objects in the Universe. Inference is often based on a statistic, a function of random variables. At the early stages of an investigation, the astronomermight seek simple statistics of the data such as the average value or the slope of a heuristic linear relation. At later stages, the astronomer might measure in great detail the properties of one or a few objects to test the applicability, or to estimate the parameters, of an astrophysical theory thought to underly the observed phenomenon.
Statistical inference is so pervasive throughout these astronomical and astrophysical investigations that we are hardly aware of its ubiquitous role. It arises when the astronomer:
– smooths over discrete observations to understand the underlying continuous phenomenon
– seeks to quantify relationships between observed properties
– tests whether an observation agrees with an assumed astrophysical theory
– divides a sample into subsamples with distinct properties
– tries to compensate for flux limits and nondetections
– investigates the temporal behavior of variable sources
– infers the evolution of cosmic bodies from studies of objects at different stages
– characterizes and models patterns in wavelength, images or space