Reports on Progress:

Validation of Qulity Adviser

In order to validate the predictors and the Quality Adviser a separate experiment was undertaken. A new programme material was used in order to avoid any biases involved in the exploitation of the same material that has been originally used for the calibration of the predictors. For similar reasons a new group of listeners were used. Fig. 1 shows the results of the validation.

figure: Validation

Fig. 1 Results of validation (scores averaged across all listeners)

It is possible to note that the predicted results match well the actual results obtained from the listening test. The correlation between the actual and the predicted scores is high (0.923). Although the maximum absolute error between the scores is large (26 points relative to the 100-point scale), the average error is relatively small and is equal to 10 points only. Taking into account the difficulty of subjective evaluation of audio quality which involves many human-related factors, the small magnitude of the average error between the predicted and actual scores seems to prove the validity of the developed predictors and the Quality Adviser.

A more detailed analysis of the obtained results shows that the accuracy of the predictors is better for 'music' items than for 'movie', 'sport' or 'ambient' sounds. For example, Fig. 2 shows the validation results for 'music' items only (the remaining items were not taken into account in the analysis). It is possible to note that in this case the results of the validation are better than in the previous case: the correlation between the predicted and the actual scores approaches unity (0.96), whereas the maximum and the average errors are equal to 18 and 7.3 respectively.

figure: Validation

Fig. 2 Results of validation for 'music' items only (scores averaged across all listeners)

Details

The following multichannel audio items were used in the validation experiment:

The terms F-B and F-F refer to a spatial scene of a multichannel audio recording.

Evaluated programme material was processed using the following algorithms:

Unprocessed material was also included in the selection of the evaluated items (Hidden References).

The experiment was undertaken using the MUSHRA paradigm.

Seventeen trained listeners took part in the experiment. However, the data from the two listeners (1 and 4) were post-screened due to their poor performance, especially with respect to the evaluation of the Hidden Reference, which is demonstrated in Fig. 3 (it is expected that the Hidden Reference is graded as 100).

figure: Validation

Fig. 3 Evaluation of listeners' performance (numbers near circles are used to identify listeners)