 Research
 Open Access
 Published:
Incorporation of perceptually adaptive QIM with singular value decomposition for blind audio watermarking
EURASIP Journal on Advances in Signal Processing volume 2014, Article number: 12 (2014)
Abstract
This paper presents a novel approach for blind audio watermarking. The proposed scheme utilizes the flexibility of discrete wavelet packet transformation (DWPT) to approximate the critical bands and adaptively determines suitable embedding strengths for carrying out quantization index modulation (QIM). The singular value decomposition (SVD) is employed to analyze the matrix formed by the DWPT coefficients and embed watermark bits by manipulating singular values subject to perceptual criteria. To achieve even better performance, two auxiliary enhancement measures are attached to the developed scheme. Performance evaluation and comparison are demonstrated with the presence of common digital signal processing attacks. Experimental results confirm that the combination of the DWPT, SVD, and adaptive QIM achieves imperceptible data hiding with satisfying robustness and payload capacity. Moreover, the inclusion of selfsynchronization capability allows the developed watermarking system to withstand timeshifting and cropping attacks.
1 Introduction
In recent years, copyright protection of multimedia data has been of great concern to content owners and service providers. Digital watermarking technology received much attention for resolving such a concern because this technology could hide information into the multimedia object (e.g., images, audio, and video) for applications like intellectual property protection, content authentication, and fingerprinting.
An audio watermarking scheme generally takes into consideration four aspects, namely, imperceptibility, security, robustness, and capacity. The developed schemes shall ensure the security and inaudibility of the embedded information, but still possess the ability of withstanding malicious attacks. The payload capacity must be large enough to accommodate necessary information. Different methods were attempted on various domains, such as time [1–5], Fourier transform [6–8], cepstral transform [9–13], discrete cosine transform (DCT) [14–17], and discrete wavelet transform (DWT) [14, 16, 18–23].
Compared with transform domain methods, the timedomain approach is rather easier to implement and requires less computation. The watermark is usually a pseudo noise added to the host signal. Alternatively, binary information can be converted to a noiselike signal through the spread spectrum technique. The existence of the watermark can be verified by measuring the correlation function between the pseudo noise and watermarked signal. The timedomain methods are usually less robust to digital signal processing attacks unless a long segment along with adequate embedding strength is adopted. In contrast, quantization index modulation (QIM) has been proven to be a promising technique [24]. The timedomain data embedding is achieved by quantizing the parameters derived from the time series. Though the QIM generally outperforms the spread spectrum in the time domain, it still needs a long segment for reliable detection. As a consequence, the timedomain QIM was mainly used for frame synchronization in many watermarking systems [14, 20, 21, 24]. Being aware of the limitation of the timedomain approach, many researchers thus turned to the transform domains where signal characteristics could be better explored. The embedding intensity as well as position of the watermark can be selected based upon the features extracted in the transform domains [1, 14, 21].
Singular value decomposition (SVD) is a powerful tool for image processing applications [25, 26]. Because the SVD can adapt to various transform domains, it has been extensively applied in audio watermarking [5, 8, 17, 22, 27]. For instance, Abd ElSamie [5] utilized a twofold strategy to embed the watermark. After applying the first SVD to a 2D matrix formed by the audio signal, he blended the intended watermark with the diagonal matrix holding singular values and then performed the second SVD on the modified matrix. In his design, the matrices containing left and rightsingular vectors must be conserved in order to extract the watermark. AlNuaimy et al. [27] further extended the twofold strategy and applied it to the audio signals transmitted over network systems on a segmentbysegment basis.
Bhat et al. [22] presented a SVDbased blind watermarking scheme operated in the DWT domain. The watermark bits were embedded into the audio signals using QIM, of which the quantization steps were adaptively determined according to the statistical properties of the involved DWT coefficients. The authors claimed that their scheme was the first adaptive audio watermarking scheme exploring both DWT and SVD and had a high payload and superior performance against MP3 compression. Lei et al. [17] also attempted to embed a binary watermark into the highfrequency band of the SVDDCT block. They attained a performance generally better than the previous SVDbased methods. Most recently, Lei et al. [28] integrated lifting wavelet transform (LWT), SVD, and QIM to achieve a very good tradeoff among the robustness, imperceptibility, and payload. Apart from the abovementioned methods, there are other audio watermarking schemes applicable to different domains in the literature [29, 30].
Audio watermarks are supposed to be transparent to human ears, by what means the modification due to watermarking is virtually inaudible. One way to enhance the embedding efficiency is to exploit the auditory characteristics so that the embedding strength is sufficiently high to withstand attacks without introducing audible distortion. The methods presented in [16, 17, 22] demonstrated the benefit of exploiting the signal characteristics, but they relied on heuristic rules to decide the embedding strength. In these methods, even though some attention was paid to adjust relevant parameters to reach optimal performance, the connection between multiple transform domains and human auditory properties has not been thoroughly addressed.
Because the DWPT possesses multiresolution capacity and is more computationally efficient than the Fourier transform, it may cooperate with the psychoacoustic model to render an estimate of auditory masking thresholds [31, 32]. Hence, our aim in this study is to explore all useful properties of the DWPT, SVD, and QIM for audio watermarking such that the issues of robustness, imperceptibility, and payload capacity can be resolved altogether. In particular, the primary interest is placed on the blind watermarking, which does not require the original audio signal to extract the watermark.
2 Derivation of auditory masking threshold in the DWPT domain
Auditory masking is the effect when a sound is inaudible due to the presence of a louder sound. There are two types of auditory masking. One is spectral masking (sometimes referred to as simultaneously masking), which is the characteristic of the human auditory system when a sound signal is masked by a masker with a different frequency. The other is temporal masking (or nonsimultaneous masking), which is the masking effect occurring before and after a sudden stimulus sound.
While studying spectral masking, critical bands are of great importance because they can be employed to elucidate the properties of frequency selectivity [32, 33]. Based upon the theory of perceptual entropy [31–35], this study derives the auditory masking threshold in terms of signal power for each critical band. The derivation begins with the utilization of the DWPT to approximate the critical bands. The procedures for deriving spectral masking thresholds are briefly summarized as follows:

1.
Segment the host audio signal into frames, each of 4,096 samples in length.

2.
Decompose the audio signal using the DWPT according to the specification given in Table 1, in which each packet node approximately corresponds to a critical band. The decomposition is carried out using the Daubechies8 wavelet. Let c _{ i } ^{(n)} denote the i th DWPT coefficient in the n th band with a length of N ^{(n)}.

3.
Compute the shortterm spectrum X _{ i } ^{(n)} in each band by applying the fast Fourier transform (FFT) to c _{ i } ^{(n)}, i.e., X _{ i } ^{(n)} = FFT{c _{ i } ^{(n)}}.

4.
Estimate the tonality factor τ to see whether the band is noiselike or tonelike.
$$\tau =min\left\{\frac{10{log}_{10}\left(P{M}_{\mathrm{g}}\left({\left{X}_{i}^{\left(n\right)}\right}^{2}\right)/P{M}_{\mathrm{a}}\left({\left{X}_{i}^{\left(n\right)}\right}^{2}\right)\right)}{25},\phantom{\rule{0.5em}{0ex}}1\right\},$$(1)
where PM_{g}(X_{ i }^{(n)}^{2}) and PM_{a}(X_{ i }^{(n)}^{2}) stand for the geometric and arithmetic means of X_{ i }^{(n)}^{2}, respectively.

5.
Adjust the masking level according to the tonality factor.
$${D}_{z}\left(n\right)=\left(\frac{1}{{N}^{\left(n\right)}}{\displaystyle \sum _{i=0}^{{N}^{\left(n\right)}1}{\left({c}_{i}^{\left(n\right)}\right)}^{2}}\right){10}^{\frac{a\left(n\right)}{10}},$$(2)
where a(n) signifies the permissible noise floor relative to the signal in the n th band, and it is formulated as

6.
Extend the masking effect to the adjacent bands by convolving the adjusted masking level with a spreading function SF(n), namely C _{ z }(n) = D _{ z }(n) ⊗ 10^{SF(n)/10}, with SF(n) defined as
$$\begin{array}{ll}\mathrm{SF}\left(n\right)& =p+\frac{u+v}{2}\left(n+y\right)\\ \phantom{\rule{1.3em}{0ex}}\frac{vu}{2}\sqrt{h+{\left(n+y\right)}^{2}}\phantom{\rule{1.75em}{0ex}}\left(\mathrm{expressed}\phantom{\rule{0.25em}{0ex}}\mathrm{in}\phantom{\rule{0.25em}{0ex}}\mathrm{dB}\right),\end{array}$$(4)
where p = 15.242, y = 0.15, h = 0.3, u = −25, and v = 30.

7.
Compare the masking threshold C _{ z }(n) with the absolute threshold of hearing in quiet state, termed T(n) in decibel. The maximum of the two is selected as the masking threshold, i.e.,
$$\eta \left(n\right)=\mathrm{max}\left\{{C}_{z}\left(n\right),\phantom{\rule{0.5em}{0ex}}{10}^{\frac{\mathrm{T}\left(n\right)}{10}}\right\}.$$(5)
The masking threshold obtained through the above procedure is designated as η(n), which represents the noise power level not detectable by human ears in the n th band.
3 Frame synchronization
One of the weaknesses of the existing watermarking methods consists in the vulnerability to time shifting and cropping [14]. The frame synchronization is perhaps the most prevailing counterstrategy to deal with such an issue. Many watermarking systems considered dividing the audio signal into two sorts of segments, namely, one for synchronization and the other for watermarking. This study resorts to the idea of frequency division which uses nonoverlapping frequency bands to hide the synchronous codes and information bits separately. Figure 1 illustrates the idea of frequency division, where the synchronous code is placed in the frequencies below 172 Hz and the information bits are allowed to hide in the critical bands above 172 Hz.
To synchronize the frames, this study utilizes a timedomain QIM that was developed in [36] but is modified to suit the requirements here. The audio signal is deliberately partitioned into frames of length L_{f} = 8192 (twice the amount for mask threshold derivation), and each frame is further divided into N_{s} = 32 Subsections. A 32bit Barker code ‘11111011101001110100101001001000’ [37] is employed for the synchronization task because this code has low correlation with a timeshifted version of itself. Each binary bit is first converted into bipolar form, termed S_{b}(k) ∊ {−1, 1}, and then embedded into a subsection spreading over L_{s}(≜L_{f}/N_{s} = 256) samples by
where m and $\widehat{m}$ denote, respectively, the original and modified mean values of the Subsection. D is the quantization step supposedly yielding no perceptible distortion.
To achieve the goal of imperceptivity, the quantization step at sample i, designated as D_{ i }, is obtained by referring to the rootmeansquare of N_{p} past lowpassfiltered samples:
where x_{lp}(i) is the output of feeding the audio signal through a fourth order Butterworth lowpass filter with the cutoff frequency set at 172 Hz. N_{p} is chosen as 1,536. The scaling factor 10^{−10/20} aims at attenuating the signal power by 10 dB. The purpose of using x_{lp}(i) is twofold. First, it provides an estimate of the signal power for frequency components below 172 Hz. Second, it excludes the disturbance from highfrequency bands where the information bits are located.
Following the derivation of a new mean, the proposed timedomain QIM modifies the audio samples in each subsection using
where M(k) is a function designed to have a flat top in the middle but descend to zero on both ends, i.e.,
The variable υ in Equation (9) is a scaling factor used to attain a mean of unity for M(k), i.e.,$\frac{1}{{L}_{\mathrm{s}}}{\displaystyle \sum _{k=0}^{{L}_{\mathrm{s}}1}M\left(k\right)}=1.$
Based on the analysis given in [21], the QIM via Equation (8) introduces a noise with a power level of 7D_{ i }^{2}/48, which is 8.36 dB lower than D_{ i }^{2}. The window M(k) contributes about −0.46 dB to the signaltonoise ratio (SNR). Combining with the 10 dB given in Equation (7), the overall SNR resulting from the watermarking is around 17.9 dB. According to the theory of auditory entropy [31, 34], the masking threshold for the frequency components below 172 Hz is approximately −16 dB below the signal power regardless of signal tonality. Consequently, the purposely reserved 17.9dB SNR is sufficient to ensure the imperceptibility of the embedded synchronous code.
The detection of the synchronization code requires the preparation of a bit sequence $\tilde{b}\left(i\right)$, which is of the same length as the watermarked audio signal and can be derived as
where ${\tilde{m}}_{i}$ denotes the mean computed over a subsection starting from the i th sample. ${\tilde{D}}_{i}$ corresponds to the −10dB RMS of previous N lowpassfiltered samples. After acquiring $\tilde{b}\left(i\right)$, the existence of a synchronous code can be identified by examining the crosscorrelation between the Barker code S_{b}(k) and a decimated version of $\tilde{b}\left(i\right)$:
As Equation (11) places the synchronous code in a backward direction, the largest r(i) over an interval of 8,192 samples indicates a salient demarcation between the frames. This synchronization marker can be more prominent by adding up two other crosscorrelation functions that are 8,192 samples away from the current one.
The position of the marker, termed I, is identified simply by picking the largest peak of ${\widehat{r}}_{3}\left(i\right)$ in each interval:
where i_{start} denotes the starting index.
4 Watermarking via SVD
An advantage of the SVDbased watermarking is that large singular values change very little for most types of attacks. The proposed watermarking scheme thus takes such an advantage by applying the QIM to the gap between two principal singular values. For each packet node of the DWPT, the N coefficients c_{ i }'s in a frame are organized as a 2 × N/2 matrix M in the following manner:
Without loss of generality, the superscript (n) previously used to signify a specific band has been removed in the expression. Taking SVD of M results in M = USV^{T}, where U is a 2 × 2 real unitary matrix, S is a 2 × N / 2 diagonal matrix with nonnegative real diagonal values λ_{ i }'s in decreasing order, and V^{T} (the transpose of V) is an N / 2 × N /2 real unitary matrix. Alternatively, the matrix M can be written as
where u_{ i } and v_{ i } are the i th columns of the matrices U and V. The total energy of the N DWPT coefficients is the squared sum of all the elements in M, i.e.,
The same result can be obtained using
It is recalled that the procedure described in Section 2 provides a masking threshold η, which is the maximum tolerable power variation unperceivable by human ears. The derived threshold can guide us devise a robust and transparent watermarking scheme. This study proposes embedding a watermark bit w_{b} into the matrix M by manipulating λ_{1} and λ_{2} subject to three criteria. First, the overall energy shall remain unchanged. That is
Criterion 1
where ${\lambda}_{1}^{\prime}$ and ${\lambda}_{2}^{\prime}$ denote the adjusted results of λ_{1} and λ_{2}, respectively. Second, the gap between ${\lambda}_{1}^{\prime}$ and ${\lambda}_{2}^{\prime}$, termed ${g}^{\prime}={\lambda}_{1}^{\prime}{\lambda}_{2}^{\prime}$, must comply with the QIM rule according to w_{b}:
Criterion 2
where ⌊ · ⌋ represents the floor function. As for the third criterion, the signal power variation shall not exceed the auditory masking threshold η.
Let M′ denote the matrix restored by substituting the modified eigenvalues into S such that
Because of the constraint imposed by Equation (19), the adjustment of these two eigenvalues thus holds the inequality
and the resulting error energy E_{error} becomes
It is readily seen from Equation (21) that
Ideally, if the error power, i.e., E_{error}/N, falls beneath the masking threshold η, the signal alteration due to watermarking will be inaudible. Such a condition can be expressed as
Criterion 3
Let ${\mathrm{\Delta}}_{max}=2\sqrt{\mathit{N\eta}}$ denote the maximum step size used to quantize the gap between the two eigenvalues without causing perceivable distortion. The modifications with respect to ${\lambda}_{1}^{\prime}$ and ${\lambda}_{2}^{\prime}$ are denoted as ${\lambda}_{1}^{\prime}={\lambda}_{1}+{\delta}_{1}$ and ${\lambda}_{2}^{\prime}={\lambda}_{2}{\delta}_{2}$. Then, the derivation of ${\lambda}_{1}^{\prime}$ and ${\lambda}_{2}^{\prime}$ based on the three criteria becomes very straightforward. Following the replacement of ∆_{max} for ∆ in Equation (19), an equation with variables δ_{1} and δ_{2} is formed:
In combination with Equation (18), δ_{1} can be solved from a quadratic equation like
The relationship among all involved variables is illustrated in Figure 2. After obtaining δ_{1}, δ_{2} is acquirable using Equation (25). As Equation (26) usually comes up with two solutions for δ_{1}, this study chooses the one with a smaller magnitude. Nevertheless, Equation (26) may also render complex roots when (g′)^{2} > E_{ c }. Hence, a preventive measure is taken to ensure the obtainment of real roots. It is noted from Equation (19) that the minimum possible value of g′ is 3Δ_{max}/4 for w_{b} = 1. In an extreme case where ${\lambda}_{1}^{\prime}={g}^{\prime}=3{\mathrm{\Delta}}_{max}/4$ and ${\lambda}_{2}^{\prime}=0$, ∆_{max} must satisfy
Consequently, the preventive measure examines the inequality whether ${\mathrm{\Delta}}_{max}<\frac{4}{3}\sqrt{{E}_{\mathrm{c}}}$ and substitutes ∆_{max} with $\frac{4}{3}\sqrt{{E}_{\mathrm{c}}}$ if the inequality does not hold. This substitution, in turn, guarantees an outcome of nonnegative ${\lambda}_{1}^{\prime}$ and ${\lambda}_{2}^{\prime}$.
With the fulfillment of the three criteria, namely Equations (18), (19), and (24), the audio signal can maintain its segmental power while executing the QIM. The key factor of the entire process turns out to be η, which subsequently determines ∆_{max}, ${\lambda}_{1}^{\prime}$ and ${\lambda}_{2}^{\prime}$. Putting the derived ${\lambda}_{1}^{\prime}$ and ${\lambda}_{2}^{\prime}$ into Equation (20) renders a modified matrix M′ with new DWPT coefficients. Once the processes in all the involved critical bands are completed, the watermarked signal is attained by taking inverse DWPT with respect to the modified DWPT coefficients.
The watermark extraction from the watermarked signal is rather simple. Analogy to the procedures adopted for watermark embedding, the extraction process starts with taking the DWPT of the watermarked audio and then deriving the masking threshold $\tilde{\eta}$ for each packet node. Following the derivation of ${\tilde{\mathrm{\Delta}}}_{max}$ from $\tilde{\eta}$, the watermark bit ${\tilde{w}}_{b}$ can be verified by first calculating
${\tilde{w}}_{b}$is ‘1’ if γ ≥ 0.5, and is ‘0’ otherwise.
5 Further enhancement
The main challenge of the adaptive QIM lies in the presupposition that the quantization steps must be accurately recovered from the watermarked signal. As seen in Section 4, the quantization step is correlated to the masking threshold, of which the formulation involves the tonality and power deduced from the signal. During the watermark embedding, the process of QIM inevitably varies the tonality and therefore causes difficulties in retrieving the quantization steps for watermark extraction. A simple way to overcome this problem is to take advantage of SVD.
It is recalled from Equation (15) that the SVD decomposes the signal into two parts, namely, ${\lambda}_{1}{\mathbf{u}}_{1}{\mathbf{v}}_{1}^{\mathit{T}}$ and ${\lambda}_{2}{\mathbf{u}}_{2}{\mathbf{v}}_{2}^{\mathit{T}}$. These two parts become ${\lambda}_{1}^{\prime}{\mathbf{u}}_{1}{\mathbf{v}}_{1}^{\mathit{T}}$ and ${\lambda}_{2}^{\prime}{\mathbf{u}}_{2}{\mathbf{v}}_{2}^{\mathit{T}}$, respectively, after applying QIM. As ${\lambda}_{1}^{\prime}$ is always larger than ${\lambda}_{2}^{\prime}$, ${\lambda}_{1}^{\prime}{\mathbf{u}}_{1}{\mathbf{v}}_{1}^{\mathit{T}}$ can be regarded as the predominant part of the watermarked signal. If the tonality is merely derived from the predominant part, i.e., ${\lambda}_{1}{\mathbf{u}}_{1}{\mathbf{v}}_{1}^{\mathit{T}}$ in the original signal and ${\lambda}_{1}^{\prime}{\mathbf{u}}_{1}{\mathbf{v}}_{1}^{\mathit{T}}$ in the watermarked signal, the results remain identical because the two scalars, λ_{1} and ${\lambda}_{1}^{\prime}$, do not affect the tonality. Hence, our first enhancement to the proposed DWPTSVD scheme is to use ${\mathbf{u}}_{1}{\mathbf{v}}_{1}^{\mathit{T}}$ to compute for the tonality.
Another important factor in the derivation of the masking threshold is the signal power. Despite that the signal power has been deliberately maintained during watermark embedding, the attacks such as MP3 compression and noise contamination may alter the segmental power. To alleviate the problem of power alteration, our second enhancement adopts a lowpass 2D filter to smoothen the quantization steps distributing over a plane formed by critical band numbers and frame indices. Figure 3 illustrates the idea of filter smoothing. The filter coefficients are obtained from a rotationally symmetric Gaussian function with the variance being 0.5. The filter size is tentatively chosen as 3 × 3 since it offers satisfactory results. It is particularly noted in the end that the quantization steps computed at the embedding stage shall also be processed by the filter when the second enhancement takes effect. The reason for this arrangement is to ensure an exact restoration of the quantization steps from the watermarked signal.
6 Integration of the entire watermarking system
Figure 4 presents the configuration of the developed watermarking system. The watermark can be an arbitrary binary bit sequence. Just for the purpose of illustration, we adopt a binary image W(i, j) of size 32 × 32, which contains an equal amount of 0's and 1's. The procedures for embedding the watermark are as follows:

1.
Maintain security by scrambling the image watermark using the Arnold transform [38].

2.
Convert the scrambled image into a bit stream.

3.
Partition the audio signal into frames of size 4,096 samples.

4.
Insert the synchronization codes into the audio signal using the timedomain adaptive QIM presented in Section 3.

5.
For the third to the fifteenth critical bands in each frame

a.
Compute the DWPT coefficients.

b.
Apply SVD to the matrix formed by the DWPT coefficients.

c.
Derive the quantization step.

d.
Embed one binary bit by quantizing the gap between two principal singular values of SVD.

e.
Recompose the DWPT coefficients.

a.

6.
Perform inverse DWPT to obtain the watermarked audio signal.
The watermark extraction is a reverse process. The procedural steps are the following:

1.
Align the frame by tracing the synchronous markers.

2.
For the third to the fifteenth critical bands in each frame

a.
Compute the DWPT coefficients.

b.
Apply SVD on the matrix formed by the DWPT coefficients.

c.
Derive the quantization step.

d.
Quantize the gap between two singular values.

e.
Translate the quantized value into a binary bit.

a.

3.
Gather the bits from all frames.

4.
Convert the bit sequence into an image matrix.

5.
Take the inverse Arnold transform to restore the watermark image, termed $\tilde{W}\left(i,j\right)$.
7 Performance evaluation
The test subjects comprised ten pieces of 30s music recordings clipped from randomly chosen CD albums, including vocal arrangements and ensembles of musical instruments. All audio signals were sampled at 44.1 kHz with 16bit resolution. The performance evaluation comprises three aspects: payload capacity, quality assessment, and robustness test.
To understand the influences of the two enhancements mentioned in the previous section, the test of the proposed DWPTSVDadaptive QIM consists of three phases, namely, the proposed one solely, the one with enhancement 1, and the one with enhancements 1 and 2.
Three recently developed SVDbased methods, denominated as ‘adaptive DWTSVD’ [22], ‘SVDDCT’ [17], and ‘LWTSVD’ [28], are employed for performance comparison as they represent other ways to exploit the SVD for audio watermarking in transform domains. The minimum and maximum quantization steps in the adaptive DWTSVD are 0.6 and 0.9 respectively, which are the typically suggested values. The parameters α and β for controlling the embedding strength in the SVDDCT are assigned as 0.125 and 0.1, respectively. For the LWTSVD method, the decomposition level of the lifting wavelet transform is chosen as 4 and the quantization step size is 0.6. The other parameters used in these three methods follow original specifications [17, 22, 28].
7.1 Payload
The theoretical payload capacities for the methods under investigation are presented in Table 2. The LWTSVD holds the highest number in comparison to others. The capacity of the proposed scheme is 13 × 44,100/4,096 = 139.97 bps, which is lower than that of the LWTSVD. However, this quantity is already three times more than that achieved by the adaptive DWTSVD and SVDDCT. It is worth pointing out that the payload capacities listed in Table 2 are computed without considering the demand of synchronous codes. In general, these numbers will drop if the watermarking methods need to allocate extra segments for frame synchronization. One advantage of the proposed synchronization technique is that it only affects the spectrum centralized in the first two critical bands, thus leaving the rest critical bands available for information hiding.
7.2 Quality assessment
The quality disturbance resulting from watermark embedding is assessed using the SNR and perceptual evaluation of audio quality (PEAQ) [39, 40]. The SNR is defined as
where s(n) and $\tilde{s}\left(n\right)$ are the original and watermarked audio signals, respectively. Since the auditory quality is a fundamentally subjective concept that does not necessarily correspond to the measured SNR, this study also resorts to the PEAQ to measure the perceived quality. The PEAQ algorithm aims at simulating human perceptual properties and integrates multiple model output variables into a single metric. It renders an objective difference grade (ODG) between −4 and 0, signifying a perceptual impression from ‘very annoying’ and ‘imperceptible’.
Table 2 also provides the measured SNRs and ODGs for all kinds of watermarked audio signals. The SVDDCT generally renders the largest SNR value, while the proposed scheme produces the lowest. Despite that the SNRs do not show any favor for the proposed scheme, the resulting ODGs suggest that our scheme indeed achieves the best perceived quality. In fact, the average ODG is around 0 for our scheme, implying that the watermarked signal is nearly indistinguishable from the original one. The average ODGs for the adaptive DWTSVD and SVDDCT are slightly above 1, indicating that the distortion caused by watermarking may still be perceivable. On the other hand, the quality degradation by the LWTSVD seems to be minor, as the corresponding average ODG is just −0.4. Nevertheless, the ODGs resulting from these three methods are not comparable with ours.
7.3 Robustness test
The robustness test consists of two categories: one is focused on frame synchronization, and the other is concerned with watermark recovery. The attack types considered in this study include the following:

A.
Resampling: conducting downsampling to 11,025 Hz and then upsampling back to 44,100 Hz.

B.
Requantization: quantizing the watermarked signal to 8 bits/sample and then back to 16 bits/sample.

C.
Amplitude scaling: scaling the amplitude of the watermarked audio signal by 0.85.

D.
Noise corruption: adding zeromean white Gaussian noise to the watermarked audio signal with SNR = 30 dB.

E.
Noise corruption: adding zeromean white Gaussian noise to the watermarked audio signal with SNR = 20 dB.

F.
Lowpass filtering: applying a lowpass filter with a cutoff frequency of 8 kHz.

G.
Echo addition: adding an echo signal with a delay of 50 ms and a decay of 5% to the watermarked audio signal.

H.
Jittering: randomly deleting or adding one sample for every 100 samples within each frame.

I.
128kbps MPEG compression: compressing and decompressing the watermarked audio signal with a MPEG layer III coder at a bit rate of 128 kbps.

J.
64kbps MPEG compression: compressing and decompressing the watermarked audio signal with a MPEG layer III coder at a bit rate of 64 kbps.

K.
Time shifting: shifting the watermarked audio signal by an amount of 50% relative to the frame length.
The efficiency of the proposed synchronization scheme is demonstrated via the statistical means and standard deviations of ${\widehat{r}}_{3}\left(i\right)\text{'}s$ discussed in Section 3, along with the misdetection counts of the synchronization markers. As revealed from the results in Table 3, the detectability of the synchronous marks is always reliable, indicating that common attacks do not impose any threat to the watermarking system equipped with such a synchronization technique.
The robustness of the proposed watermarking technique in the presence of various attacks is evaluated using the bit error rate (BER), which is defined as
where ⊕ stands for the exclusiveor operator. Table 4 gives the BERs obtained from the watermarked audio signals under the attacks.
Generally speaking, all the SVDbased methods manifest certain robustness against most attacks. However, the adaptive DWTSVD and LWTSVD appear vulnerable to amplitude scaling. The reason can be ascribed to the fact that some of the controlling parameters in both methods are fixed. A minor change in amplitude may therefore result in a disastrous consequence. In contrast, the SVDDCT and the proposed scheme do not exhibit such deficiency, as both of them are designed to adapt to amplitude variation. Besides amplitude scaling, the adaptive DWTSVD also suffers from the attack of resampling. The reason is due to the altered statistical distribution of the DWT coefficients that eventually leads to inaccurate watermark extraction.
As shown in Table 4, the proposed scheme generally retains very high accuracy under all sorts of attacks, but it seldom reaches 100% correctness. This is because the masking threshold derived from the watermarked signal may somewhat differ from the original one. To ameliorate such drawback, two enhancements have been proposed in Section 5. The first enhancement rectifies the inconsistency in the derivation of tonality. As a consequence, the proposed scheme comes up with a perfect accuracy if no attack is present. Excellent robustness is also observed for attacks like resampling, amplitude scaling, and lowpass filtering. The second enhancement tends to mitigate the power alterations caused by the attacks. After being equipped with the second enhancement, the proposed scheme gains noticeable improvements for all kinds of attacks. More importantly, the changes in SNR and ODG are slight, meaning that the improvement is not obtained at the cost of perceived quality.
7.4 Security
There are several possible ways to promote the watermark security. In [17, 28], the synchronous code was chaotically permutated and the watermark data were scrambled. A similar strategy is certainly applicable to our system. Here, the Arnold transform is chosen to shuffle the watermark image since this technique has been widely utilized in digital image encryption. Aside from data scrambling, the controlling parameters (e.g., the frame length, the arrangement of the matrix in Equation (14), and/or the selected critical bands) can be utilized as secret keys. It would be difficult, if not impossible, to detect the watermark without knowing the exact parameters.
8 Error analysis
There are two types of errors during the search of watermarks. The falsepositive error (FPE) is the probability of declaring an unwatermarked audio signal as a watermarked one, whereas the probability of the opposite condition (classifying a watermarked audio signal as an unwatermarked one) is known as the falsenegative error (FNE).
Following the basic assumption and derivative rules given in [22], the FPE P_{fp} can be computed as
where $H\left(W,\tilde{W}\right)$ denotes the number of matched bits in a total of N_{w} bits, and T is the threshold for claiming the existence of the watermark. $\left(\begin{array}{c}\hfill {N}_{w}\hfill \\ \hfill k\hfill \end{array}\right)$ stands for the binomial coefficient. P_{e} is the probability that the extracted bits match with the original watermark bits. Since the unwatermarked bits are either 0 or 1 with pure randomness, P_{e} is therefore assumed to be 0.5. As a result, Equation (31) can be further simplified as
If N_{w} = 1024 and T = ⌈0.8 × N_{w}⌉ = 820, then P_{fp} = 2.62 × 10^{−88}, which means that FPE can rarely happen.
Analogy to the discussion in the derivation of FPE, the FNE P_{fn} can be computed as
Taking the worst case (where BER = 0.012) in our experiments as an example, the FNE of the proposed scheme is virtually zero.
9 Conclusion
This paper presents an efficient audio watermarking technique, which integrates the DWPT, SVD, and adaptive QIM subject to the auditory masking effect. While the DWPT decomposes the audio signal into critical bands, the exploration of perceptual entropy leads to the derivation of auditory masking thresholds. The thresholds, in turn, determine the quantization steps required by the QIM. In virtue of the robustness of the SVD technique, the proposed watermarking scheme first assembles the DWPT coefficients into a matrix and then manipulates the singular values to satisfy three criteria. As a result, the embedded watermark is guaranteed to restrain underneath the perceptible level. To further improve the overall performance, this study introduces two auxiliary enhancement measures to ensure the recovery of quantization steps.
Apart from the scheme for data embedding, the developed watermarking system is equipped with a competent frame synchronization technique to withstand the timeshifting attacks. The experimental results reveal that the proposed DWPTSVDadaptive QIM scheme performs very well against many attacks such as resampling, requantization, amplitude scaling, lowpass filtering, jittering, echo addition, white noise contamination, and MP3 compression. The comparison with the other SVDrelated watermarking methods indicates that our scheme is comparable to, if not better than, the selected methods. Most importantly, the resulting average ODGs of the proposed scheme are around 0, implying that the embedded watermarks and synchronous codes are virtually inaudible by human ears. All these merits can be attributed to the incorporation of the perceptually adaptive QIM with SVD in the DWPT domain.
References
 1.
Swanson MD, Zhu B, Tewfik AH, Boney L: Robust audio watermarking using perceptual masking. Signal Process. 1998, 66(3):337355. 10.1016/S01651684(98)000140
 2.
Bassia P, Pitas I, Nikolaidis N: Robust audio watermarking in the time domain. IEEE Trans. Multimedia 2001, 3(2):232241. 10.1109/6046.923822
 3.
Lie WN, Chang LC: Robust and highquality timedomain audio watermarking based on lowfrequency amplitude modification. IEEE Trans. Multimedia 2006, 8(1):4659.
 4.
Lemma AN, Aprea J, Oomen W, van de Kerkhof L: A temporal domain audio watermarking technique. IEEE Trans. Signal Processing 2003, 51(4):10881097. 10.1109/TSP.2003.809372
 5.
Abd F: ElSamie. An efficient singular value decomposition algorithm for digital audio watermarking. Int. J. Speech. Technol. 2009, 12(1):2745.
 6.
Li W, Xue X, Lu P: Localized audio watermarking technique robust against timescale modification. IEEE Trans. Multimedia 2006, 8(1):6069.
 7.
Tachibana R, Shimizu S, Kobayashi S, Nakamura T: An audio watermarking method using a twodimensional pseudorandom array. Signal Process. 2002, 82(10):14551469. 10.1016/S01651684(02)002840
 8.
Megías D, SerraRuiz J, Fallahpour M: Efficient selfsynchronised blind audio watermarking system based on time domain and FFT amplitude modification. Signal Process. 2010, 90(12):30783092. 10.1016/j.sigpro.2010.05.012
 9.
Li X, Yu HH: Transparent and robust audio data hiding in cepstrum domain. ICME 2000, 1: 397400.
 10.
Li S, Cui L, Choi J, Cui X: An audio copyright protection schemes based on SMM in cepstrum domain. Lect. Notes Comput. Sc. 2006, 4109: 923927. 10.1007/11815921_102
 11.
Liu SC, Lin SD: BCH codebased robust audio watermarking in cepstrum domain. J. Inf. Sci. Eng. 2006, 22(3):535543.
 12.
Lee SK, Ho YS: Digital audio watermarking in the cepstrum domain. IEEE T. Consum. Electr. 2000, 46(3):744750. 10.1109/30.883441
 13.
Hu HT, Chen WH: A dual cepstrumbased watermarking scheme with selfsynchronization. Signal Process. 2012, 92(4):11091116. 10.1016/j.sigpro.2011.11.001
 14.
Wang XY, Zhao H: A novel synchronization invariant audio watermarking scheme based on DWT and DCT. IEEE Trans. Signal Processing 2006, 54(12):48354840.
 15.
Yeo IK, Kim HJ: Modified patchwork algorithm: a novel audio watermarking scheme. IEEE Trans. Speech and Audio Processing 2003, 11(4):381386. 10.1109/TSA.2003.812145
 16.
Wang X, Qi W, Niu P: A new, adaptive digital audio watermarking based on support vector regression. IEEE T. Audio Speech 2007, 15(8):22702277.
 17.
Lei BY, Soon IY, Li Z: Blind and robust audio watermarking scheme based on SVD–DCT. Signal Process. 2011, 91(8):19731984. 10.1016/j.sigpro.2011.03.001
 18.
He X, Scordilis MS: Efficiently synchronized spreadspectrum audio watermarking with improved psychoacoustic model. Research Letter in Signal Process 2008. 10.1155/2008/251868
 19.
Xiang S, Kim HJ, Huang J: Audio watermarking robust against timescale modification and MP3 compression. Signal Process. 2008, 88(10):23722387. 10.1016/j.sigpro.2008.03.019
 20.
Wang XY, Niu PP, Yang HY: A robust digital audio watermarking based on statistics characteristics. Pattern Recognition 2009, 42(11):30573064. 10.1016/j.patcog.2009.01.015
 21.
Wu S, Huang J, Huang D, Shi YQ: Efficiently selfsynchronized audio watermarking for assured audio data transmission. IEEE Trans. Broadcast. 2005, 51(1):6976. 10.1109/TBC.2004.838265
 22.
Bhat V: K, I Sengupta, A Das. An adaptive audio watermarking based on the singular value decomposition in the wavelet domain. Digit. Signal Process. 2010, 20(6):15471558.
 23.
Chen ST, Wu GD, Huang HN: Waveletdomain audio watermarking scheme using optimisationbased quantisation. IET Signal Process. 2010, 4(6):720727. 10.1049/ietspr.2009.0187
 24.
Chen B, Wornell GW: Quantization index modulation: a class of provably good methods for digital watermarking and information embedding. IEEE Trans. Inform. Theory 2001, 47(4):14231443. 10.1109/18.923725
 25.
Ruizhen L, Tieniu T: An SVDbased watermarking scheme for protecting rightful ownership. IEEE Trans. Multimedia 2002, 4(1):121128. 10.1109/6046.985560
 26.
Bao P, Xiaohu M: Image adaptive watermarking using wavelet domain singular value decomposition. IEEE Trans. Circuits Syst. Video Technol. 2005, 15(1):96102.
 27.
AlNuaimy W, ElBendary MAM, Shafik A, Shawki F, AbouElazm AE, ElFishawy NA, Elhalafawy SM, Diab SM, Sallam BM, Abd FE: ElSamie, HB Kazemian. An SVD audio watermarking approach using chaotic encrypted images. Digit. Signal Process. 2011, 21(6):764779.
 28.
Lei B, Yann I: Soon, F Zhou, Z Li, H Lei. A robust audio watermarking scheme based on lifting wavelet transform and singular value decomposition. Signal Process. 2012, 92(9):19852001.
 29.
Zezula R, Misurec J: Audio digital watermarking algorithm based on SVD in MCLT domain. ICONS 2008, 140143.
 30.
Dhawan A, Mitra SK: Hybrid audio watermarking with spread spectrum and singular value decomposition. INDICON 2008, 1116.
 31.
Carnero B, Drygajlo A: Perceptual speech coding and enhancement using framesynchronized fast wavelet packet transform algorithms. IEEE Trans. Signal Processing 1999, 47(6):16221635. 10.1109/78.765133
 32.
He X, Scordilis MS: An enhanced psychoacoustic model based on the discrete wavelet packet transform. J. Franklin Inst. 2006, 343(7):738755. 10.1016/j.jfranklin.2006.07.005
 33.
Painter T, Spanias A: Perceptual coding of digital audio. Proc. IEEE 2000, 88(4):451515.
 34.
Johnston JD: Estimation of perceptual entropy using noise masking criteria. Proc. IEEE Int. Conf. Acoust. Speech Signal Process. 1988, 2525: 25242527.
 35.
Johnston JD: Transform coding of audio signals using perceptual noise criteria. IEEE J. Select. Areas Commun. 1988, 6(2):314323. 10.1109/49.608
 36.
Hu HT, Yu C: A perceptually adaptive QIM scheme for efficient watermark synchronization. IEICE T. Inf. Syst. 2012, E95D(12):30973100. 10.1587/transinf.E95.D.3097
 37.
Gentry SM: Detection Optimization Using Linear Systems Analysis of a Coded Aperture Laser Sensor System: Sandia Report. Albuquerque: Sandia National Laboratories; 1994.
 38.
Arnold VI, Avez A: Ergodic Problems of Classical Mechanics. New York: Benjamin; 1968.
 39.
ITU Radiocommunication Sector (ITUR): Recommendation BS.1387: Method for Objective Measurements of Perceived Audio Quality. Geneva: International Telecommunication Union; 1998.
 40.
Kabal P: An Examination and Interpretation of ITUR BS.1387: Perceptual Evaluation of Audio Quality, TSP Lab Technical Report. Montréal: Department of Electrical and Computer Engineering, McGill University; 2002.
Acknowledgements
This work was supported by the National Science Council, Taiwan, ROC, under grants NSC1012221E197033 and NSC1022221E197020.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Hu, HT., Chou, HH., Yu, C. et al. Incorporation of perceptually adaptive QIM with singular value decomposition for blind audio watermarking. EURASIP J. Adv. Signal Process. 2014, 12 (2014). https://doi.org/10.1186/16876180201412
Received:
Accepted:
Published:
Keywords
 Singular value decomposition
 Discrete wavelet packet transform
 Adaptive quantization index modulation
 Auditory masking threshold
 Frame synchronization