ARCHITECTURE AND ALGORITHMS FOR TRACKING
FOOTBALL PLAYERS WITH MULTIPLE CAMERAS
Ming Xu Liam Lowey James Orwell
{m.xu, j.orwell, l.lowey}@king.ac.uk,
Digital Imaging Research Centre,
Kingston University, UK.
Abstract ways to use multi-view data, e.g. hand-off between best-view
cameras, homography transform between the images of
A system architecture and method for tracking people is uncalibrated cameras, or using calibrated cameras able to
presented for a sports application. The system input is video determine the 3D world coordinate with the cooperation of
data from static cameras with overlapping fields-of-view at a two or more cameras.
football stadium. The output is the real-world, real-time
positions of football players for during a match. The system Our system uses eight digital video cameras statically
comprises two processing stages, operating on data from first positioned around the stadium, and calibrated to a common
a single camera and then multiple cameras. The organisation ground-plane co-ordinate system using Tsai's algorithm. A
of processing is designed to achieve sufficient two-stage processing architecture is used. Details of this
synchronisation between cameras, using a request-response architecture are provided in Section 2. The first processing
pattern, invoked by the second stage multi-camera tracker. stage is the extraction of information from the image streams
The single-view processing includes change detection against about the players observed by each camera. This is described
an adaptive background and image-plane tracking to improve in Section 3. The data from each camera is input to a central
the reliability of measurements of occluded players. The tracking process, described in Section 4, to update the state
multi-view process uses Kalman trackers to model the player estimates of the players. This includes the estimate of which
position and velocity, to which the multiple measurements of the five possible uniforms each player is wearing (two
input from the single-view stage are associated. Results are outfield teams, two goal-keepers, and the three referees. In
demonstrated on real data. this paper, ‘player’ includes the referees). The output from
this central tracking process is the 25 player positions per
time step. The tracker indicates the category (team) of each
1. Introduction player, and maintains the correct number of players in each
This paper presents an architecture, and method, to allow category. The identification of individual players is not
multiple people to be tracked with multiple cameras. The possible, given the resolution of input data, so only team is
application output is the positions of players, and ball, during recognised. The ball tracking methods are outside the scope
a football match. This output can be used for entertainment – of this paper.
augmenting digital TV, or low-bandwidth match play
animations for web or wireless display; and also for analysis 2. System Architecture
of fitness and tactics of the teams and players.
In the video processing stage, a three step approach is used to
Several research projects on tracking soccer players have generate the Features. Each Feature consists of a 2D ground-
published results. Intille and Bobick [4] track players, using plane position, its spatial covariance, and a category estimate.
the concept of closed-world, in the broadcast TV footage of Every camera will be connected to a processing unit called a
American football games. A monocular TV sequence is also ‘Feature Server’, reflecting its position in the overall
the input data for [6], in which panoramic views and player architecture.
trajectories are computed. The SoccerMan [1] project
analyses two synchronised video sequences of a soccer game Features are collected and synchronised by the centralised
and generates an animated virtual 3D view of the given scene. ‘Tracker’ and are duly processed to generate a single model
Similarly to TV data, these projects use one (or two) pan-tilt- of the game (state) at a given time. This game-state is finally
zoom camera to improve the image resolution of players, and passed through a phase of marking-up which will be
the correspondence between frames has to be made on the responsible for generating the XML output that is used by
basis of matching field lines or arcs. An alternative approach third party applications to deliver results to their respective
to improving players' resolution is to use multiple stationary target audiences.
cameras. Although this method requires dedicated static
cameras, it increases the overall field of view, minimizes the 2.1 Physical location of system components
effects of dynamic occlusion, provides 3D estimates of ball The cameras are positioned at relevant locations around the
location, and improves the accuracy and robustness of stadium and are connected to a series of eight ‘Feature
estimation due to information fusion. There are different Servers’ through a network of fibre optics, see Fig 1. The
position of the cameras is governed by the layout of the included, e.g. the single-view tracker ID tag, so that the multi-
chosen stadium and the requirement to achieve an optimal view tracker can implement track-to-track data association
view of the football pitch (Good resolution of each area, [2]. Common software development design patterns are used
especially the goal-mouths, is more important than multiple to manage the process of transmitting these Features to the
views of each area). ‘Tracker’ hardware by managing the process of serialising to
and from a byte stream (UDP Socket). This includes the task
Each optical fibre terminates in a location which houses all of
of ensuring compatibility between different platforms.
the processing hardware (eight Feature Servers and a single
Tracker) where the digital video is interpreted into useable
2.4 Configuring the Feature Server
image streams. The ‘Feature Servers’ are interconnected with
the ‘Tracker’ hardware using an IP/Ethernet network which is Each of the many software components requires a degree of
used to communicate and synchronise the generated Features. configuration, whether it is a simple instruction or a complex
This configuration of physical location of components is data file (e.g. Camera to ground calibration). Coupled with
influenced by the requirement to minimise the profile of the the combination of eight Feature Servers, eight Cameras and a
installations above the stadium. If that requirement were not single Tracker this process will be difficult for the individual.
so important, the overall bandwidth requirements could be Therefore, a centrally managed system is necessary.
considerably reduced by locating the Feature Servers A message based network protocol has been devised to
alongside the cameras. Then, only the Features would need control and configure the operations of the various
transporting along the long distance to the ‘Tracker’ components. This protocol will also provide support for other
processing stage: this could be achieved with regular or even operations, such as image retrieval for camera calibration.
wireless Ethernet, rather than the optic fibre presently needed.
2.2 Request-Response architecture FireWire
A ‘request-response’ mechanism is selected to communicate (IEEE1394)
the Features from the Feature Servers to the Tracker. This
solves several of the problems inherent with managing eight Optical Fibre
simultaneous streams of data across a network.
The Tracker is responsible for orchestrating the process by Feature Server
which Feature Servers generate their Features. Each iteration
• Background Modelling
(frame) of the process takes the form of a single (broadcast)
• Region Segmentation
request issued by the Tracker at a given time. The Feature x8
• Shadow Suppresion
Servers then respond by taking the latest frame in the video
stream, processing it and transmitting the resultant Features
Single-view box tracker
back to the Tracker. Synchronisation of the Features is
implied as the Tracker will record the time at which the
request was made. Player Ball
Classification Filtering
The Request-Response action repeats continually while the
time-stamped Features are passed on to be processed by
IP Data Layer
components running in parallel inside of the Tracker. The
results of the Tracking components are then marked up (into
XML) and delivered through a peer-to-peer connection with
any compliant third party application.
Broadcast
Responses … Request
2.3 Format of Feature Data (Features)
The second stage (‘Tracker’) process does not have access to Tracker
video data processed in the first stage. Therefore, the
‘Feature’ data must include all that is required by the second IP Data Layer
stage components to generate reliable estimates position for
the people (and ball) present on the football pitch. The
composition of the Feature is thus dictated by the Multi-view Multi-view
requirements of the second stage process. The process Player Ball
Tracker Tracker
described in section 4 requires the bounding box, estimated
ground-plane location and covariance, and the category
estimate (defined as a seven-element vector, the elements Game State Mark-up (XML)
summing to one, and corresponding to the five different
uniforms, the ball, and ‘other’). Further information is Figure 1: System Architecture.
3. Feature Server processing steps measurement covariance is constant because foreground
detection is a pixelwise operation:
The Feature Server uses a three step approach is used to u = [r2 cc ]
T
Λu = Λ0
generate the Features, as indicated in Fig. 1. Each Feature
consists of a 2D ground-plane position, its spatial covariance, For a grouped player, the measurement is calculated from the
and a category estimate. estimate and the covariance increases to Λ u = βΛ 0 ( β > 1 ).
Writing the homography transform from i-th image plane to
3.1 Foreground Detection ground plane as H i , and the ground plane measurement as
z i = [X w Yw ] , the homogeneous coordinates for the ground
T
The first step is ‘Change Detection’ based on image
differencing: its output is connected foreground regions (Fig. plane and i-th image plane can be written as
~ ~
2, top). An initial background is modelled using a mixture of X = [ (z i ) T W ]T , ~
x = [u T 1]T respectively, and X = H i ~ x.
Gaussians and learned beforehand. The initial background is The measurement covariance in the ground plane in
then used by the running average algorithm for fast updating. homogeneous coordinates is [3]:
If Fk is the foreground binary map at time k, then the
Λ X~ = BΛ h B T + HΛ ~x H T
background is updated with image y as:
where B is the matrix form of ~ x , h is the vector form of H,
µ k = [α L y + (1 − α L )µ k −1 ]Fk + [α H y + (1 − α H )µ k −1 ]Fk with covariance Λ h , Λ ~x is the conversion of Λ u to
homogeneous coordinates. While the second item is the
where 0 < α L << α H << 1 .
propagation of image measurement errors, the first item
explains the errors in the homography matrix, which depends
3.2 Single View Tracking on the accuracy, number and distribution of the landmark
The second step is a local tracking process [8] to split
Features of grouped people. The bounding box and centroid
co-ordinates of each player are used as state and measurement
variables in a Kalman filter:
x = [rc cc r&c c&c ∆r1 ∆c1 ∆r2 ∆c2 ]
T
z = [rc cc r1 c1 r2 c2 ]
T
where (rc, cc) is the centroid, r1, c1, r2, c2 represent the top,
left, bottom and right bounding edges, respectively. (∆r1 , ∆c1 )
and (∆r2 , ∆c2 ) are the relative positions of the bounding box
corners to the centroid.
As in [8], it is assumed that each target has a slowly varying
height and width. Once some bounding edge of a target is
decided to be observable, its opposite, unobservable bounding
edge can be roughly estimated (Fig. 2). Because the estimate
is updated using partial measurements whenever available, it
is more accurate than using prediction only.
Figure 2: Two players merge and split: (top) foreground
regions and (bottom) single view tracking. In the middle
lower diagram, the bounding boxes are Kalman estimates.
For an isolated player, the image measurement comes from
foreground region directly, which prevents estimation errors
accumulating from a hierarchy of Kalman filters. We assume
Figure 3: Single-view tracker output for all eight cameras.
points used to compute the matrix. The conversion of Λ X~ to
inhomogeneous coordinates is the ground plane covariance
Rt = [∑ β
(R ) ]i
i
jt
i −1
j
−1
= R ∑ β (R )
R i . Fig. 4 shows the ground plane measurement covariance. i i −1
i
zt t jt j z
i j
3.3 Category Estimation ct = ∑β i
i
jt wti c ij
The final step adds to each measurement an estimate of the
i
=
∑ β tr (R )j
i
jt
i
j
∑∑ β tr (R )
category (or player’s uniform). This is implemented using a w t i i
histogram-intersection method [5]. The result for each player i j jt j
is a seven-element vector c ij (k ) , indicating the probability where tr() represents the trace of a matrix. Each target is then
updated using the integrated measurement:
that the Feature is a player wearing one of the five categories
of uniform, or the ball, or clutter. (
K t = Pt − H gT H g Pt − H gT + Rt )
−1
4. Multi View, Multi Person Tracking xˆ t+ = xˆ t− + K (z t − H g xˆ t− )
For the multi-view tracking process, a three-step player
+
(
Pt = I − K t H g Pt ) −
tracking method is proposed. The first step is to associate eˆ t (k ) = (1 − η ) eˆ t (k − 1) + η c t (k )
measurements to established tracks, and update these tracks.
where 0 < η < 1 , and {x t− , x t+ , etc} are the shorthand notation
The second step is to initiate tracks for the measurements
unmatched to any existing tracks. Finally, the fixed for prior and posterior state estimates, respectively.
population of each category of players (ten outfield players
and one goalkeeper per team, three referees) is used to 4.2 Associating Features with Features
recognize the members in each category.
After checking measurements against existing tracks, there
may be some measurements unmatched. Then those
4.1 Associating Features and Targets
measurements, each from a different camera, are checked
Each player is modelled as a target xt , represented by the against each other to find potential new targets. Supposing
measurements z1 and z 2 are from different camera views and
[
state x t = X w Yw X& w Y&w
T
]
at time k, as well as a
have covariance R1 and R2 , respectively, they are associated
covariance Pt (k ) and a category estimate e t (k ) . The state is to establish a new target if the distance:
D12 = (z 1 − z 2 ) (R1 + R2 ) (z 1 − z 2 )
−1
updated, if possible, by a measurement mt : this is the fusion T
of at most one feature from each camera. The mt comprises a is within some threshold.
measured position z t = [X w Yw ] , an overall covariance
T
4.3 Selection of Tracks
Rt (k ) , and overall category measurement c t (k ) . If no fused
measurement is available, then the state is updated using only If there are more than 25 targets in the model, then the 25
its prior estimate. most likely tracks are selected to be output as the reported
positions of the players. The target likelihood measure is
The state transition and measurement matrices are: calculated using the target longevity, category estimate, and
I TI 2 duration of occlusion with other targets. A fast sub-optimal
Ag = 2 H g = [I 2 O2 ] search method gives reasonable results.
O2 I 2
The creation of the fused measurement is as follows. The set
5. Results and Conclusions
of players {xt } is associated with the measurements
{m ij } from the ith camera, the result of which is expressed as The two-stage method outlined in this paper can be
successfully demonstrated in several recorded matches. A live
an association matrix β i . From the several association system is being installed and will be available for testing
methods available, e.g. Nearest Neighbour, Joint Probabilistic soon. The output from the second stage tracking process is
[2], the first method is used here: each element β ijt is 1 or 0, shown in Fig. 5; given the Feature input from the eight
cameras shown in Fig. 3.
according to the Mahalanobis distance between the
measurement and the target prediction: The system works as planned and gives reasonably reliable
and accurate results. Work is currently being undertaken to
[ ](
d jt = z ij − H g xˆ t− H g Pt − H gT + R j) [z
i −1 i
j − H g xˆ t]
− T
provide a quantitative evaluation of these results.
Then a single measurement for each target integrates the
individual camera measurements weighted by measurement
uncertainties, as shown in Fig. 4:
Figure 5: Trajectories of automatically tracked players.
Acknowledgements
This work is part of the INMOVE project, supported by the
European Commission IST 2001-37422. The project partners
are: INMOVE Consortium: Technical Research Centre of
Finland, Oy Radiolinja Ab, Mirasys Ltd, Netherlands
Organization for Applied Scientific Research TNO,
University of Genova, Kingston University, IPM
Management A/S and ON-AIR A/S
References
Figure 4: Ground plane covariance of features (top, and
left) from single cameras. The dark line denotes extent of
field of view. Bottom right: Fused measurements; the [1] T. Bebie and H. Bieri, ‘SoccerMan: reconstructing
dark ellipses represent the uncertainty thereof. soccer games from video sequences’, Proc. ICIP, pp.
898-02, (1998).
Match situations involving many tightly packed players give [2] Y. Bar-Shalom and X. R. Li, Multitarget-Multisensor
inaccurate estimates, and re-initialisation errors as the players Tracking: Priciples and Techniques, YBS, (1995).
re-disperse. In the limiting case, these situations are probably
insoluble. In general, several system components, discussed [3] A. Criminisi, I. Reid, and A. Zisserman, “A plane
below, critically affect the performance of the system. measuring device,” Proc. BMVC, (1997).
Firstly, the consistency of the camera homography is very [4] S. S. Intille and A. F. Bobick, ‘Closed-world tracking’,
important. If there are systematic errors between specific Proc. ICCV, pp. 672-678, (1995).
cameras, the data association step is less likely to be correct. [5] T. Kawashima, K. Yoshino, and Y. Aoki, ‘Qualitative
If the error cannot be removed then an empirical correction is Image Analysis of Group Behaviour’, CVPR, pp. 690-3,
acceptable. Secondly, the single view tracker is designed to (1994).
output one measurement per player observed in that camera.
Currently, if two players enter its field of view as an occluded [6] Y. Seo, S. Choi, H. Kim and K. S. Hong, ‘Where are the
group, there is no mechanism for the single-view tracker to ball and players?: Soccer game analysis with color-
recognise the correct number of players. This could be based tracking and image mosaick’, Proc. ICIAP, pp.
facilitated by feedback from the multi-view tracker. This 196-203, (1997).
could be integrated into our architecture quite easily. Finally, [7] R. Tsai, ‘An efficient and accurate camera calibration
there are several ways to improve the data association step, technique for 3D Machine Vision’, Proc. CVPR, pp.
e.g. allowing for a probabilistic association matrix β i , more 323-344, (1986).
informed covariance estimates R, and also incorporating the
category estimate of the measurement, c. [8] M. Xu and T. Ellis, ‘Partial observation vs. blind
tracking through occlusion’, Proc. BMVC, pp. 777-786,
To conclude, an architecture has been presented to facilitate (2002).
the modelling of football players using object detection and
tracking with single and then multiple cameras.