Hopefully by now you've run Rivet a few times and got the hang of the command
line interface and viewing the resulting analysis data files. Maybe you've got
some ideas of analyses that you would like to see in Rivet's library. If so,
then you'll need to know a little about Rivet's internal workings before you can
start coding: with any luck by the end of this section that won't seem
particularly intimidating.
The core objects in Rivet are ``projections'' and ``analyses''. Hopefully
``analyses'' isn't a surprise --- that's just the collection of routines that
will make histograms to compare with reference data, and the only things that
might differ there from experiences with HZTool\cite{Bromley:1995np} are the new histogramming system
and the fact that we've used some object orientation concepts to make life a bit
easier. The meaning of ``projections'', as applied to event analysis, will
probably be less obvious. We'll discuss them soon, but first a
semi-philosophical aside on the ``right way'' to do physics analyses on and
involving simulated data.
\section{The science and art of physically valid MC analysis}
The world of MC event generators is a wonderfully convenient one for
experimentalists: we are provided with fully exclusive events whose most complex
correlations can be explored and used to optimise analysis algorithms and some
kinds of detector correction effects. It is absolutely true that the majority of
data analyses and detector designs in modern collider physics would be very
different without MC simulation.
But it is very important to remember that it is just simulation: event
generators encode much of known physics and phenomenologically explore the
non-perturbative areas of QCD, but only unadulterated experiment can really tell
us about how the world behaves. The richness and convenience of MC simulation
can be seductive, and it is important that experimental use of MC strives to
understand and minimise systematic biases which may result from use of simulated
data, and to not ``unfold'' imperfect models when measuring the real world. The
canonical example of the latter effect is the unfolding of hadronisation (a
deeply non-perturbative and imperfectly-understood process) at the Tevatron (Run
I), based on MC models. Publishing ``measured quarks'' is not physics --- much
of the data thus published has proven of little use to either theory or
experiment in the following years. In the future we must be alert to such
temptation and avoid such gaffes --- and much more subtle ones.
These concerns on how MC can be abused in treating measured data also apply to
MC validation studies. A key observable in QCD tunings is the \pT of the \PZ
boson, which has no phase space at exactly $\pT = 0$ but a very sharp peak at
$\mathcal{O}(\unit{1-2}{\GeV})$. The exact location of this peak is mostly
sensitive to the width parameter of a nucleon ``intrinsic \pT'' in MC
generators, plus some soft initial state radiation and QED
bremsstrahlung. Unfortunately, all the published Tevatron measurements of this
observable have either ``unfolded'' the QED effects to the ``\PZ \pT'' as
attached to the object in the HepMC/HEPEVT event record with a PDG ID code of
23, or have used MC data to fill regions of phase space where the detector could
not measure. Accordingly, it is very hard to make an accurate and portable MC
analysis to fit this data, without similarly delving into the event record in
search of ``the boson''. While common practice, this approach intrinsically
limits the precision of measured data to the calculational order of the
generator --- often not analytically well-defined. We can do better.
Away from this philosophical propaganda (which nevertheless we hope strikes some
chords in influential places\dots), there are also excellent pragmatic reasons
for MC analyses to avoid treating the MC ``truth'' record as genuine truth. The
key argument is portability: there is no MC generator which is the ideal choice
for all scenarios, and an essential tool for understanding sub-leading
variability in theoretical approaches to various areas of physics is to use
several generators with similar leading accuracies but different sub-leading
formalisms. While the HEPEVT record as written by HERWIG and PYTHIA has become
familiar to many, there are many ambiguities in how it is filled, from the
allowed graph structures to the particle content. Notably, the Sherpa event
generator explicitly elides Feynman diagram propagators from the event record,
perhaps driven by a desire to protect us from our baser analytical
instincts. The Herwig++ event generator takes the almost antipodal approach of
expressing different contributing Feynman diagram topologies in different ways
(\emph{not} physically meaningful!) and seamlessly integrating shower emissions
with the hard process particles. The general trend in MC simulation is to blur
the practically-induced line between the sampled matrix element and the
Markovian parton cascade, challenging many established assumptions about ``how
MC works''. In short, if you want to ``find'' the \PZ to see what its \pT or
$\eta$ spectrum looks like, many new generators may break your honed PYTHIA
code\dots or silently give systematically wrong results. The unfortunate truth
is that most of the event record is intended for generator debugging rather than
physics interpretation.
Fortunately, the situation is not altogether negative: in practice it is usually
as easy to write a highly functional MC analysis using only final state
particles and their physically meaningful on-shell decay parents. These are,
since the release of HepMC 2.5, standardised to have status codes of 1 and 2
respectively. \PZ-finding is then a matter of choosing decay lepton candidates,
windowing their invariant mass around the known \PZ mass, and choosing the best
\PZ candidate: effectively a simplified version of an experimental analysis of
the same quantity. This is a generally good heuristic for a safe MC analysis!
Note that since it's known that you will be running the analysis on signal
events, and there are no detector effects to deal with, almost all the details
that make a real analysis hard can be ignored. The one detail that is worth
including is summing momentum from photons around the charged leptons, before
mass-windowing: this physically corresponds to the indistinguishability of
collinear energy deposits in trackers and calorimeters and would be the ideal
published experimental measurement of Drell-Yan \pT for MC tuning. Note that
similar analyses for \PW bosons have the luxury over a true experiment of being
able to exactly identify the decay neutrino rather than having to mess around
with missing energy. Similarly, detailed unstable hadron (or tau) reconstruction
is unnecessary, due to the presence of these particles in the event record with
status code 2. In short, writing an effective analysis which is automatically
portable between generators is no harder than trying to decipher the variable
structures and multiple particle copies of the debugging-level event
objects. And of course Rivet provides lots of tools to do almost all the
standard fiddly bits for you, so there's no excuse!\\[\lineskip]
\noindent
Good luck, and be careful!
% While the event record "truth" structure may look very
% compellingly like a history of the event processes, it is extremely important to
% understand that this is not the case. For starters, such a picture is not
% quantum mechanically robust: it is impossible to reconcile such a concept of a
% single history with the true picture of contributing and interfering
% amplitudes. A good example of this is in parton showers, where QM interference
% leads to colour coherence. In the HERWIG-type parton showers, this colour
% coherence is implemented as an angular-ordered series of emissions, while in
% PYTHIA-type showers, an angular veto is instead applied. The exact history of
% which particles are emitted in which order is not physically meaningful but
% rather an artefact of the model used by the generator --- and is primarily
% useful for generator authors' debugging rather than physics analysis. This is in
% general true for all particles in the event without status codes of 1, 2 or 4.
% Another problem is that the way in which the event internals is documented is
% not well defined: it is for authors' use and as such they can do anything they
% like with the "non-physics" entities stored within. Some examples:
% * Sherpa does not write matrix element particles (i.e. W, Z, Higgs, ...) into the event record in most processes
% * Herwig++ uses very low-mass Ws in its event record to represent very off-shell weak decay currents of B and D mesons (among others)
% * In Drell-Yan events, Herwig++ sometimes calls the propagating boson a Z, and sometimes a photon, probabilistically depending on the no-mixing admixture terms
% * Sherpa events (and maybe others) can have "bottleneck" particles through which everything flows. Asking if a particle has e.g. a b-quark (NB. an unphysical degree of freedom!) ancestor might always give the answer "yes", depending on the way that the event graph has been implemented
% * Different generators do not even use the same status codes for "documentation" event record entries: newer ones tend to represent all internals as generator-specific particles, and emphasise their lack of physical meaning
% * The generator-level barcodes do not have any reliable meaning: any use of them is based on HEPEVT conventions whihc may break, especially for new generators which have never used HEPEVT
% * Many (all?) generators contain multiple copies of single internal particles, as a bookkeeping tools for various stages of event processing. Determining which (if any) is physically meaningful (e.g. which boosts were or weren't applied, whether QED radiation was included, etc.) is not defined in a cross-generator way.
% * The distinction between "matrix element" and "parton shower" is ill-defined: ideally everything would be emitted from the parton shower and indeed the trend is to head at least partially in this direction (cf. CKKW/POWHEG). You probably can't make a physically useful interpretation of the "hard process", even if particular event records allow you to identify such a thing.
% * Quark and gluon jets aren't as simple as the names imply on the "truth" level: to perform colour neutralisation, jets must include more contributions than a single hard process parton. When you look at event graphs, it becomes hard to define these things on a truth level. Use an observable-based heuristic definition instead.
% Hence, any truth-structure assumptions need to checked and probably modified
% when moving from one generator to another: clearly this can lead to
% prohibitively large maintenance and development hurdles. The best approach,
% whenever possible, is to only use truth information to access the particles
% with status code 1 (and those with status = 2, if decays of physically
% meaningful particles (i.e. hadrons) are being studied. In practice, this
% adds relatively little to most analyses, and the portability of the analyses
% is massively improved, allowing for wider and more reliable physics
% studies. If you need to dig deeper, be very careful!
% A final point of warning, more physical than the technicalities above, is that
% the bosons or similar that are written in the event record are imperfect
% calculational objects, rather than genuine truth. MC techniques are improving
% all the time, and you should be extremely careful if planning to do something
% like "unfolding" of QED radiation or initial state QCD radiation on data, based
% on event record internals: the published end result must be meaningfully
% comparable to future fixed order or N^kLL resummation calculations, particularly
% in precision measurements. If not handled with obsessive care, you may end up
% publishing a measurement of the internals of an MC generator rather than
% physical truth! Again, this is not an absolute prohibition --- there are only
% shades of grey in this area --- but just be very careful and stick to statuses