Audio Watermarking for Monitoring and Copy Protection
Jaap Haitsma, Michiel van der Veen, Ton Kalker and Fons Bruekers
Philips Research Laboratories
Prof. Holstlaan 4
5656 AA Endhoven, The Netherlands
[jaap.haitsma][michiel.van.der.veen][ton.kalker][fons.bruekers]@philips.com
ABSTRACT
Based on existing technology used in image and video watermarking, we have developed a robust audio watermarking technique. The embedding algorithm operates in frequency domain, where the magnitudes of the Fourier
coefficients are slightly modied. In the temporal domain, an additional scale parameter and gain function are necessary to refine the watermark and achieve perceptual transparency. Watermark detection relies on the Symmetrical
Phase Only Matched Filtering (SPOMF) cross-correlation
approach. Not only the presence of a watermark, but also
its cyclic shift is detected. This shift supports a multi-bit
payload for one particular watermark sequence. The watermarking technology proved to be very robust to a large number of signal processing "attacks" such as MP3 (64 kb/s),
all-pass
filtering, echo addition, time-scale modication, resampling, noise addition, etc. It is expected that this approach may contribute in a wide variety of existing (e.g.
monitoring and copy protection) and future applications.
Keywords
audio, broadcast monitoring, copy protection, watermark
embedding, watermark detection
1. INTRODUCTION
A digital audio watermark is an information label, which
is embedded in an audio signal in an imperceptible manner.
During the past few years a number of new audio watermarking
techniques have been developed to support applications
such as copy control [1] [2] or broadcast monitoring [3]. Most
of these operate in time domain and employ methods such
as echo-hiding [4] or some kind of noise addition, exploiting temporal and/or spectral masking models of the human
auditory system [5] [6].
Based on image and video watermarking techniques [3] [7]
we have developed an alternative approach to audio watermarking. Similar to the work of Piva et al.[2], watermark
embedding is performed in frequency domain. The principles of spectral masking are exploited in a relatively simple
manner by slightly modifying magnitudes of the Fourier coefficients. The embedding algorithm is complemented with
a detection procedure adapted from cross-correlation techniques used in image registration [9] and video watermarking [3] [8].
The combination of both algorithms offers several advantages in terms of robustness to some trivial signal
processing "attacks" (e.g. all-pass ltering).
In this paper,
we introduce both embedding and detection algorithms and
discuss briefly some key aspects such as payload, perceptual
transparency, robustness and detection reliability.
2. EMBEDDING
A sketch of our watermark embedding algorithm is displayed
in Figure 1. A random watermark sequence W (k) is drawn
from a normal distribution with mean and standard devia-
tion of 0 and 1, respectively. A cyclic shifted version Ws(k) is
used to achieve a multi-bit payload for one particular water-
mark sequence W (k). Every possible shift may be associated
with a different information label. Therefore, payload is directly proportional to the watermark size (e.g. 1024-sample
watermark corresponds to payload of maximum 10 bit).
The dominant part of the perceptually weighted watermark
w(n) is derived in the Fourier domain, where spectral masking is exploited in a relatively simple manner. First, the
audio signal x(n) is segmented into frames and transformed
to the frequency domain. Here, the magnitude of its Fourier
coefficients are slightly modified by utilizing the shifted watermark sequence Ws(k):

where i indicates the frame number, X i (k) the spectral representation of the frame x i (n), and
W´i (k) the resulting frequency domain watermark. Note that the frame size is
a trade-off
between perceptual transparency (small frame
sizes) and detection reliability (large frame sizes). Several
experiments have demonstrated that, in general, frame sizes
of 2048-samples provide a good compromise in this trade-off
.
Inverse Fourier transforms F -1
frequency domain (Equation 1) is not sufficient to assure perceptual transparency. Since
fixed
length
Fourier transforms do not provide accurate time-localization,
watermarks computed in frequency domain will spread in
time over the entire analysis window. This may result in
perceptual distortions such as pre-echos. Therefore, an ad-
ditional scale parameter
and gain function g(n) are intro-
duced to re
ne the watermark in the temporal domain:
y(n) = x(n) + ag(n)w(n); (2)
where a
is the global scale parameter, g(n) a data dependent
gain function with values between 0 and 1, and y(n) the
watermarked audio.
Analog to the frame size, also
is a parameter that influences the trade-off
between perceptual transparency and
detection reliability: very small/large values of
a may result
in perceptual transparency/distortions and low/high watermark detection reliabilities. Several informal adaptive up-
down listening tests [10] were performed on a variety of watermarked audio excerpts to extract critical values of
. We
found perceptual transparency was achieved by selecting
a
between 0.15 and 0.25, depending on the audio excerpt.

Figure 1: Overview of watermark embedding algo-
rithm for digital audio. F and F-1 indicate Fourier and inverse Fourier transforms, respectively.
3. DETECTION
Figure 2 gives an overview of the watermark detection algorithm. It relies on a cross-correlation procedure between the
watermark sequence W (k) and the audio. Experiments revealed that
filtering
prior to cross-correlation may improve
detection reliabilities significantly. In our detection algorithm, y(n) is
filtered with the "equalization"
filter
d(n)
according to:

with filter coefficients d(n) = [-1 2 -1]. This signal is segmented into frames and transformed to frequency domain to obtain the magnitude of the Fourier coefficients:

where F indicates a Fourier transform operation. For each
individual frame, the magnitude of Fourier coefficients
Y i (k)
need to be cross-correlated with every possible shifted ver-
sion of W (k) to extract the payload. Such a cross-correlation
is calculated most efficiently using Fourier transformed sig-
nals:

The traditional cross-correlation may then be written as:

where Ci is teh cross-correlation function. Similar to
detection procedures in video watermarking [3], the detection
performance may be enhanced by using the Symmetrical
Phase Only Matched Filtering approach (SPOMF; [9]). In
this cross-correlation procedure, only phase information of
the signals ^ Y i, F and WF is used:

Where P is a phase-only operation and P(x) = x/|x| for x n.e.
and P(0) = 1. To improve detection reliability even further, C`i
is accumulated over a period of time C`sumiC`i.
Since C`sum is distributed normally its components may be normalized to the standard deviation :

where C`n is the normalized cross-correlation function. Its
peak value, expressed in standard deviation o, is related
directly to the detection reliability, whereas its position cor-
responds to the cyclic shift (payload).
The detection reliability depends strongly on the number of
accumulated frames. In general, cross-correlation functions
C`i
need to be added over a period of 2 to 5 sec to exceed a
detection threshold of 5o. This corresponds to a false alarm
probability of 2.9 * 10-4.
Figure 3 displays a typical cross-correlation function C`n.
In this example, a peak value of ~ 13o (false alarm probability of 6.6 * 10-36)
is detected at position 512.

Figure 2: Overview of watermark detection
4. EXPERIMENTAL RESULTS
In a number of experiments we have examined the robust-
ness of our audio watermark to a wide variety of signal "at-
tacks". The following audio excerpts were used: (i) O For-
tuna from Carl Orff
, (ii) Success has made a failure of our
home from Sinead O'Connor, (iii) Say what you want from
Texas and (iv) She works hard for the money from Donna
Summer. The 20 sec. audio fragments were sampled at
44.1 kHz (16 bit, mono). Based on up-down listening tests
(section 2) we selected
a = 0:2 for watermark embedding
(Equation 2). All audio excerpts were subjected to the fol-
lowing processing "attacks":
- MP3 Encoding/Decoding at 64 kb/s and 32 kb/s.

Figure 3: Example of cross-correlation function C`n
accumulated over a period of 5 sec. Dashed line
indicates detection threshold of 5o.

Processing was performed in MatLab and CoolEdit Pro 1.2.
The detection results were calculated by accumulating cross-
correlation functions C`i
i (Equation 7) over periods of 5 sec
and averaging the four detection reliabilities.
The results are displayed in Table 1. Unprocessed water-
marked audio excerpts result in typical detection reliabilities
between ~ 13o and ~ 17o. MP3 compression at very low
bit-rates (e.g. 32 kb/s) results in measurements close to the
detection threshold of 5o. The data reveal that detection
reliability is affected only marginally by other signal attacks
including MP3 compression at 64 kb/s and all-pass
filtering.
In general, reliabilities are in the range 11o
- 17 o, corresponding to a false alarm probability of at least 1.9 * 10 -25.
Table1: Detection reliabilities expressed in standard deviation o.

5. CONCLUSIONS
Based on existing technology in image and video watermarking, we have developed new algorithms for embedding and
detecting watermarks in digital audio. Important character-
istics of this new technique were discussed. Key results of
this study are:
1. Embedding: The dominant part of the perceptu-
ally weighted watermark is derived in frequency domain by slightly modifying the magnitude of Fourier
coeffcients. An additional scale parameter and time- domain gain function were necessary to re
ne the watermark. The scale parameter may also be utilized to
tune system characteristics such as perceptual transparency and detection reliability.
2. Detection:
The SPOMF cross-correlation approach
offered a robust technology for blind detection of watermarks in digital audio.
3. Robustness:
Our watermark algorithm proved to be
robust to a wide variety of signal processing "attacks"
such as MP3 (64 kb/s), all-pass filtering,
echo addition,
speed change, resampling, noise addition, etc.
With the accomplishments described in paper, and possi-
ble future developments, it is expected that our audio wa-
termarking strategy can support a wide variety of existing
(monitoring and copy control) and future applications.
6. REFERENCES
[1] E. Koch, and J. Zhao, 1995, "Towards robust and
hidden image copyright labeling", in Nonlinear Signal
Processing Workshop, Thessaloniki, Greece, pp.
452-455.
[2] A. Piva, M. Barni, and F. Bartolini, 1998, "Copyright
protection of digital images by means of frequency
domain watermarking", Proceedings of SPIE, vol.
3456, pp. 25-35.
[3] T. Kalker, G. Depovere, J. Haitsma, and M. Maes,
1999, "A video watermarking system for broadcast
monitoring", Proceedings of IS&T/SPIE/EI25,
Security and Watermarking of Multimedia Content,
vol. 3657, pp. 103-112.
[4] D. Gruhl, W. Bender, and A. Lu, 1996,
"Echo-hiding", Information hiding: 1st International
Workshop, R.J. Anderson, Ed., vol. 1174 of Lecture
Notes in Computer Science, Isaac Newton Institute,
England, pp. 295-315.
[5] P. Bassia, and I. Pitas, 1998, "Robust audio
watermarking in the time domain", 9th European
Signal Processing Conference (EUSIPCO98), Greece,
pp. 25-28.
[6] M.D. Swanson, B. Zhu, A.H. Tew
k, and L. Boney,
1998, "Robust audio watermarking using perceptual
masking", Signal Processing, vol. 66, 337-355.
[7] I. Cox, J. Kilian, F.T. Leighton, and T. Shamoon,
1996, "A secure, robust watermark for multimedia",
In Proc. of the Information Hiding: First Int.
Workshop, Lecture Notes in Computer Sciences, vol.
1174, R. Anderson, ed., Springer-Verlag, pp. 183-206.
[8] G.F.G. Depovere, T. Kalker, and J.P.M.G. Linnartz,
1998, "Improved watermark detection reliability using
filtering
before correlation", Int. Conf. on Image
Processing, ICIP, Chicago IL.
[9] L.G. Brown, 1992, "A survey of image registration
techniques", ACM Computing Surveys, vol. 24, pp.
325-376.
[10] H. Levit, 1970, "Transformed up-down methods in
psychoacoustics", The Journal of the Acoustical
Society of America, vol. 49, pp. 467-477.