ACM Multimedia 97 - Electronic Proceedings

November 8-14, 1997

Crowne Plaza Hotel, Seattle, USA


An Improved Auditory Interface for the Exploration of Lists

Ian J. Pitt
Institut für Simulation und Graphik,
Otto-von-Guericke-Universität,
39016 Magdeburg,
Germany
+49 391 671 2189
pitt@simsrv.cs.uni-magdeburg.de
http://isgwww. cs.uni-magdeburg.de/isg/frames/pitt.html

Alistair D.N. Edwards
Department of Computer Science,
University of York,
Heslington, York Y01 5DD
UK
+44 1904 432 775
alistair@minster.york.ac.uk
http://www.cs.york.ac.uk/~ alistair


ACM Copyright Notice


Abstract

Synthetic speech is widely used to enable blind people to receive output from computer systems. However, speech is slow to use compared with vision and places far higher demands on short-term memory. These problems are particularly apparent when exploring large data structures such as lists and tables. An experiment was conducted in which subjects were asked to memorise and recite lists of filenames. Analysis showed that subjects organised the material to aid recall and used a range of prosodic devices to convey this organisation to the listener. A practical list-reading program was developed which replicates as far as possible the methods of organisation and spoken presentation used by the subjects. This program was evaluated in a practical task, in comparison with an existing speech-based system for blind users. The results showed that subjects performed the task significantly faster using the demonstration system and also reported lower levels of effort and mental demand.


Keywords

Synthetic speech, Blindness, Prosody, Intonation


Table of Contents


Introduction

Lists are widely used in computer systems (for example, in menus of commands, directory listings, etc.), and the ability to scan a list and select items from it is therefore essential for successful computer operation. Whilst this task is comparatively simple using visual cues, it is vastly more difficult when the list is accessed through speech, as in the case of blind person using a computer with the aid of a speech-synthesiser.

The problem is even more acute when the task is not one of simple recognition but includes an element of recall. Tasks which involve searching a list for a group of related items, or looking for a particular item among a group (such as the most recently updated) require that the user holds information on several items simultaneously. This is comparatively simple when the list is presented visually: having identified the items for comparison the user can shift the focus of attention rapidly between them. Thus the screen or printed surface acts as a form of external memory and the amount of information which has to be stored in the user's own memory is kept to a minimum. When a similar task has to be performed with a spoken list, the user is forced to hold all the necessary information in memory. This increases the difficulty of the task and imposes much tighter constraints on the number of items which can be compared, hence considerably increasing the time required for such tasks.

<-- Table of Contents

Serial Recall of Filenames

In order to establish a baseline for further experiments, an initial experiment was conducted to find out how the memorability of a spoken list of filenames varies with list-length.

The filenames used were modelled on those found in the DOS operating system (which, at the time the research began, was probably the operating system most widely-used by blind people). Thus each filename comprised a main section of up to eight characters followed by a full-stop and up to three more characters. Forty-eight different filenames were used; they are listed in [Pit96].

A program was written which allowed the 48 filenames to be displayed in random order on a computer screen. The filenames could be displayed in groups of either two, three or four, with the program waiting after each group until a key was pressed before displaying the next group. The computer was equipped with a Hal screen-reader and an Apollo II speech synthesiser. The Hal/Apollo combination was set to its default values, and no attempt was made to adjust its intonation or pausing.

Eighteen sighted undergraduate students took part in the experiment. All had at least some knowledge and experience of computers. Each subject sat at a table with the speech synthesiser directly in front of them, the computer keyboard nearby and a pad of paper to hand. They were told that nothing would happen until they struck a key on the computer keyboard, after which they would hear one group of filenames. They were asked to listen to each group until the voice finished speaking and then write down whatever they could remember of the speech. They could then strike a key to hear the next group whenever they were ready. They were allowed a few practice trials (using a different set of filenames) and were then asked to start the recorded trials.

The eighteen subjects were divided into three groups of six. Every subject heard all 48 filenames, but the first six subjects heard them presented in pairs, the second six heard them presented in groups of three, and the final six heard them presented in groups of four.

The filenames written down by the subjects were compared with the contents of a file generated by the program and the number of correctly-recalled filenames noted. Many of the filenames written down by the subjects were incorrectly spelt, but where they were phonologically identical to the spoken string they were counted as correct.

Number of filenames in group
Percentage of filenames correctly recalled
2
76.7%
3
49.7%
4
20.5%

Table 1 Percentage of DOS-style filenames correctly recalled when presented through synthetic speech in groups of 2, 3 or 4.

The results of this experiment are shown in Table 1. It can be seen that even when the subjects heard only two filenames at a time they recalled on average only 76.7% of them correctly. When the filenames were presented in groups of three the recall rate dropped to below 50%, and when the filenames were presented in groups of four the recall rate was only just over 20%. The results were analysed using the Johnkheere trend test and it was found that these results are significant (p < .01), indicating that the fall in recall rates with increasing group size was directly related to the change in group size.

These results suggest that rendering lists of filenames directly into speech - as current speech adaptations for blind people do - places considerable demands on the user.

<-- Table of Contents

Recall and Recitation of Lists

It has long been recognised that organisation aids memorisation and learning, and that human beings strive to identify patterns and meaning in any information presented to them. Bartlett recognised this when he coined the phrase "Effort after meaning" to describe the way in which subjects performing many different tasks seek to organise data in order to make it more memorable and easier to work with [Bar32]. Other work has demonstrated clearly that this principle applies to spoken information too (for example, [Jen52]; [Tul66]; [Bow69]; [Bro78].

An experiment was conducted in order to establish what kinds of organisation might aid the recall of a spoken directory-listing. A group of subjects were given a number of pieces of card, each of which contained the name of an imaginary file. They were asked to sort the cards into any order they liked, memorise them, and then recite the filenames from memory into audio recording equipment. Sixteen DOS-style filenames were used. They are listed in Table 2.

Eight sighted subjects took part in the experiment. All were undergraduate students and all had at least some experience of using computers. Most were regular users of wordprocessing packages running under DOS or Windows on PC-compatible computers.

FILESORT . EXE
FILESORT . C
FILESORT . OBJ
FILESORT . H
YELLOW . EXE
BLUE . EXE
GREEN . EXE
RED . EXE
MAIL . NEW
MAIL . OLD
THESIS . DOC
THESIS . OLD
THESIS . BAK
NOTES . DOC
REVIEW . TXT
NORMAL . EXE

Table 2 The sixteen DOS-style filenames which subjects were asked to arrange in order and then recite from memory.

The cards containing the filenames were shuffled thoroughly before each trial. The subjects were allowed as much time as they needed to sort the filenames and memorise them. No hints of any kind were given, and when any of the subjects asked questions they were told to do whatever they felt appropriate. Once subjects had memorised the list they were asked to recite it aloud, at their own pace, into a microphone which was connected to recording equipment. They were told that the aim was to recite the directory lists in such a way that they would be easily understood by a blind person who needed to learn the contents of the directory. When the recording was complete, subjects were asked what conscious decisions they had made regarding the ordering of the filenames and invited to comment on the process of memorising and recalling the filenames.

The recordings were transcribed and the order in which each subject had recited the filenames was noted. Filenames which were recalled incorrectly or forgotten entirely were noted down. The detailed results are given in [Pit96].

The speaking style used by the subjects was also analysed. For each spoken syllable, a subjective estimate was made of the relative pitch compared with the syllables which preceded and followed it. The length of the pauses was estimated and in some cases timed using a stop-watch. Having identified a number of trends in the data, a few of the recordings were examined in more detail using a CDP workstation. This is a sound-processing system which includes routines to determine the fundamental pitch and harmonic composition of sounds and to allow the extraction of timing data.

Analysis of the results produced a number of findings:

These results suggest that organisation is, as expected, an important aid in the recall of filenames, and that quite simple forms of organisation can be helpful in the memorisation of filenames. Some of the linking attributes used by the subjects (such as the linking of colour names with "normal" or "notes") were highly subjective and could not easily be incorporated into a set of rules. However, other forms of organisation used by the subjects could easily be implemented in software, for example, the grouping of filenames which share a common string.

The high degree of consistency in the prosodic structures used by the subjects suggests that prosody is also important. The prosodic features observed - notably the use of pitch and pausing to delineate groups, and the use of pitch to distinguish between recurring and unique elements within groups - are broadly in line with what we might expect in the light of previous linguistic research findings.

The tendency of speakers to divide strings of speech into shorter sections marked by pausing and pitch changes has been widely noted and studied. Analysis has shown that the pitch of the voice usually rises fairly rapidly at the start of a clause, declines slowly throughout the clause, and then falls sharply at the end. Among the first to identify this pattern were 't Hart & Cohen ['tHa73] who christened it the 'hat' pattern on account of its shape. Where several clauses follow one another to form a sentence, the pattern is repeated for each clause but with a steady reduction in the average pitch of successive clauses. A number of studies (for example, [Nak78]; [Str78]) have shown that this rise and fall in pitch helps the listener segment the speech stream into phrases, and that missing or inappropriate pitch cues increase the time required to assimilate the speech and/or increase the likelihood of errors.

The use of pitch to indicate which parts of filenames were shared by several items and which were unique corresponds with the distinction between 'new' and 'given' information in speech. This distinction was identified by, among others, Halliday ([Hal63], [Hal67a], [Hal67b], [Hal70]). Drawing upon extensive analyses of recorded speech, he argued that intonation is used not only to indicate the boundaries between spoken clauses but also to indicate the position of the most important information within each clause. According to his analysis, each clause contains one item of information which the speaker regards as being important because it is 'new' to the listener. Intonation is used to indicate the position of this item. Halliday uses the term pitch-prominence to describe the marking of 'new' information in this way.

In the light of these research findings and the results obtained from the experiment, it appears likely that organising a list of filenames into suitably-defined groups and using prosodic cues to highlight this organisation will make it easier for a listener to assimilate the information. It also seems likely that dividing a list of filenames into smaller groups and allowing users to explore the list using keyboard commands will in itself make the list more memorable. Baddeley notes that performing any action while exploring a list tends to enhance recall (Baddeley, 1993). He suggests that this is because the association between the action and the response produces a richer memory trace which is easier to recover. Therefore we might expect that making individual or small groups of items available in response to keyboard actions will make them more memorable than the same items heard as a continuous list without intervention from the user.

<-- Table of Contents

Investigating the Role of Prosodic Structure and New/Given Information in the Recall of Filename Lists

The use of prosody in natural speech has been extensively analysed and documented and its importance in speech perception is widely recognised. However, the role played by individual features of prosody in speech perception is still under investigation, and there is little information to guide the designer who wishes to add effective prosody to synthetic speech. A further problem is that most current speech-synthesis systems offer fairly limited prosody - control of pausing and rhythm is generally adequate but the range of discrete pitch levels which can be synthesised is often small and there is usually little scope for applying appropriate pitch-contours to individual syllables. In view of this, one might ask whether adding prosody to synthetic speech is of any real value: is it likely that crudely-implemented prosody based upon incomplete models of natural speech will be of benefit to the user?

Another issue of concern in this particular case is that of repetition. Lists of filenames represent a slightly unusual case in that they may contain a high percentage of repeated elements. Is it enough to mark these (through pitch) as 'given' information, or might it be better to remove them altogether? For example, given the list:

PAPER.OLD
PAPER.DOC
PAPER.RTF

one could recite it in full with the extensions emphasised through pitch, e.g.:

paper dot old, paper dot doc, paper dot rtf

or one could remove the repeated string 'paper' and recite the list as:

paper dot old, dot doc, dot rtf

making it clear through prosody that the three extensions share the same prefix string.

An experiment was conducted to find out how recall of a group of filenames is affected by the presence or absence of prosody and by the repetition of 'given' information. Eight subjects were presented with lists of filenames in the form of synthetic speech and asked to recall as many of the filenames as possible. Two conditions were tested. In the first, filenames were presented in full with no added prosodic variation. In the second, repeated 'given' information was removed and prosody was added.

For the purposes of the experiment, a program was written which selects pre-defined groups of filenames at random and sends the filename-strings to a speech-synthesiser, adding prosody where necessary. The prosody was generated in accordance with the following rules:

  • Each filename is treated as a separate clause, in which the 'new' information (e.g., 'EXE') is presented at a higher pitch than the 'given' information (e.g., 'dot').

  • Successive 'new' syllables are alternately pitched high and low (although all are pitched higher than the associated 'given' syllables) mimicking the rise and fall often associated with lists in natural speech.

  • Each group of filenames is treated as a separate sentence characterised by steadily declining pitch. The 'new' information in the first clause is presented at the highest pitch, after which there is a steady fall in the average pitch of successive clauses and a reduction in the pitch-interval between the lowest pitch (that of the 'given' information) and the highest pitch (that of the 'new' information).

  • The rhythm is arranged such that the first (usually the only) syllable of each item of 'new' information falls on a regular strong beat.

  • The speech-synthesiser used was an Apollo II, as in the earlier experiment. Under the 'added prosody' condition, text strings were sent as separate syllables interleaved with pitch commands, and the timing of each syllable was controlled by the program. Under the 'no added prosody' condition, output was written direct to the computer screen from where it was copied to the Apollo by a Hal screen-reading program. The Hal/Apollo combination was set to its default values, and no attempt was made to adjust its intonation or pausing.

    One hundred DOS-style filenames were used in the experiment. They were divided into twenty groups of filenames comprising four groups each of three, four, five, six, and seven. Each group was different from the others, but within each group all the filenames were identical except for the extension. Each subject heard 50 filenames (two groups each of three, four, five, six, and seven) under each condition.

    The procedure for the experiment was as follows. Each subject sat at a table with the speech synthesiser directly in front of them, the computer keyboard nearby and a pad of paper to hand. They were told that nothing would happen until they struck a key on the computer keyboard, after which they would hear one group of filenames. They were asked to listen to each group until the voice finished speaking and then write down whatever they could remember of the speech. They could then strike a key to hear the next group whenever they were ready. They were allowed a few practice trials (using a different set of filenames) and then started the recorded trials. When they had finished they were asked to record their impressions of the difficulty of the task using a questionnaire based upon the NASA TLX (Task Load Index). The TLX comprises a small number of scales upon which subjects can be asked to indicate their subjective impression of various aspects of task load. The researchers who developed the TLX compared a large number of possible subjective scales but found that many of them produced scores which showed high correlations, suggesting that they were sampling the same impression. The scales included in the final version of the TLX were found to measure separate aspects of task load and to be both robust and sensitive. The TLX is described in detail in [Har88].

    Once the subjects had completed the procedure under one condition, they were asked to repeat the task under the other condition. Upon completion, they were again asked to complete TLX questionnaires. They were also asked to indicate which of the two conditions they had found most pleasant to use by rating them on a single scale: the scale was marked from -10 to +10 with a score of zero indicating no preference.

    The order in which subjects undertook the two conditions was varied so that half the subjects undertook them in one order while the other half undertook them in the reverse order. The two different sets of filenames were each presented under one condition for half the trials and under the other condition for the other half of the trials, thus ensuring that any differences in the inherent memorability of the filenames did not bias the results.

    The filenames written down by the subjects were compared with the contents of a file generated by the program, and the number of correctly-recalled filenames noted. Many of the filenames written down by the subjects were incorrectly spelt, but where they were phonologically identical to the spoken string they were counted as correct. The percentage of filenames correctly recalled under each condition is shown in Table 3. The full results are given in [Pit96].

    No. of filenames in group
    Condition 1: new + given information, no prosody added
    Condition 2: new information only, prosody added
    Averages
    3
    81.3%
    84.4%
    82.9%
    4
    70.0%
    67.5%
    68.8%
    5
    72.9%
    68.75%
    70.8%
    6
    48.2%
    58.93%
    53.6%
    7
    46.9%
    50.0%
    48.5%
    Averages
    63.9%
    65.9%

    Table 3 Percentage of filenames recalled when spoken either in their entirety without prosodic variation, or with given information removed and timing and intonation cues added.

    The first point to note from these results is that organisation into sorted groups clearly aids recall of filenames. This can be seen when the results obtained in this experiment are compared with those obtained in the experiment described in Section 2 (see Table 4). In that experiment, the highest recall rate obtained was only 76.7%, and that occurred when there were only two filenames in each group. In this experiment a higher recall rate - 82.9% (averaged across the two conditions) - was obtained even when there were three filenames in each group. Comparing groups of the same size shows dramatic differences. In this experiment the recall rate dropped only to 68.8% with groups of four filenames, whereas the recall rate for groups of four unrelated filenames in the earlier experiment was just 20.5%.

    While organising the filenames into groups clearly had a significant effect on recall, the introduction of prosody and the removal of given information seems to have had a smaller effect. It can be seen that there is little pattern to the differences between the figures for the smaller group sizes in Table 3. However, condition 2 produced higher recall rates with group sizes of six and seven filenames. In order to determine if any of the differences are significant, the results were analysed using a related-measures t-test. The score obtained for each group size under condition 2 was compared with the result obtained for the same group size under condition 1. The lower group sizes show no significant differences, but the difference in the results obtained with groups of six filenames is significant (p < .05). The difference obtained with groups of seven filenames shows a strong positive value but falls slightly short of the figure needed for significance.

    No. of filenames in group
    Mixed groups
    Sorted groups
    2
    76.7%
    -
    3
    49.7%
    82.9%
    4
    20.5%
    68.8%
    5
    -
    70.8%
    6
    -
    53.6%
    7
    -
    48.5%

    Table 4 Percentage of filenames recalled when presented either in mixed groups (no shared elements present) or in sorted groups (all filenames share a common element).

    The TLX scores shown in Table 5 reveal a clear preference for the condition in which the filenames were presented as new information only with prosody. The scores for mental demand, effort expended, time pressure and frustration are all higher under the condition in which all information was presented without prosody (although the difference in the case of time pressure is very small) while the performance score is lower under this condition.

    In order to determine if any of the differences are significant, the results were analysed using a Wilcoxon test. The score obtained for each TLX parameter under condition 2 was compared with the result obtained for the same TLX parameter under condition 1. This analysis shows that the reduction in time pressure was not significant and neither was the improvement in the performance score. However, the reductions in mental demand, effort and frustration were all found to be significant (p < .025).

    TLX Parameter
    Condition 1: new + given information, no prosody added
    Condition 2: new information only, prosody added
    Averages
    Mental Demand
    7.5
    6.4
    6.9
    Effort
    5.0
    4.2
    4.6
    Time Pressure
    6.0
    5.8
    5.9
    Frustration
    6.5
    5.3
    5.9
    Performance
    2.5
    3.3
    2.9
    Averages
    5.5
    5

    Table 5 Subjects assessments of task load when asked to recall filenames spoken either in their entirety without prosodic variation, or with given information removed and timing and intonation cues added.

    These results are backed up by the score for overall preference. When asked to indicate their preference for one of the two conditions on a continuous scale, a majority of subjects indicated that they preferred the condition in which the filenames were spoken as new information only with prosody. One subject indicated no preference either way and one preferred the other condition. The overall result was a score of 2.5 for the condition in which the filenames were spoken as new information only with prosody.

    These findings support the view that adding appropriate prosody and removing given information reduces the amount of effort required on the part of the listener, and may also aid recall for some types and sizes of presented string.

    <-- Table of Contents


    Designing an Improved Directory-Listing Command for Speech Output

    The findings from the experiments described above were used to design a program which presentslists of DOS-style filenames through synthetic speech and non-speech sounds. This program was designed as a direct replacement for the standard DOS DIR command and forms part of DOStalk, a replacement DOS shell for blind computer-users (see [Pit96]). It was designed with the following principles in mind:

  • The filenames should be sorted into groups of items which share common attributes.

  • The groups should contain a maximum of six filenames and preferably fewer.

  • The groups should be structured into a hierarchy.

  • Users should be able to explore the structure interactively, moving quickly and easily from group to group and between different levels of view.

  • Sound cues should be provided to remind the user where they are within the structure at each stage of exploration.

  • The speech should exclude redundant information wherever possible and should use appropriate rhythm and prosody to indicate the positions of the most important information.

  • Sorting Routines

    As noted earlier, some of the linking attributes used by the subjects were highly subjective and cannot easily be incorporated into a set of rules. However, other forms of organisation can easily be implemented in software, for example, grouping of filenames which share a common string, grouping of names by full filename first and then by extension, and the placing of groups in decreasing order of size and importance with filenames which do not fit into any of the groups left to last.

    A routine was written which sorts filenames and groups together those which share a common string of characters. A number of tests were conducted to find out the optimum length of string two filenames should share in order to be classed as members of the same group. It was found that, for DOS-style filenames with up to eight characters before the extension, the best results are produced when a string of at least four characters is required for a match. Using shorter strings tends to group together filenames which share common letter sequences (for example, 'th' or 'ion') but have no functional relationship to one another. Using longer strings fails to group together some filenames which do have a functional relationship. The routine was written in such a way that the number of characters required for a match could be varied. This made it possible to perform an initial grouping exercise using four-character strings and then, if the resulting group exceeded a certain size, to sort again using more stringent criteria in order to divide the group into two or more sub-groups. Further routines were written to sort extensions into those containing alphabetical characters, those containing numerical characters, and those containing a mixture of the two.

    These routines were combined to produce a file-sorting program which uses a three-stage process:

    1a Any filenames which share a string of four or more characters are placed in a group. For example, SORTING.C, SORTING.EXE, NEWSORT.C and NEWSORT.EXE would be placed in one group since they all share the string 'SORT'.
    1b Filenames grouped at stage (1a) are further sorted and any which share the same full filename (all eight characters prior to the extension) are placed into sub-groups. For example, SORTING.C and SORTING.EXE would be placed into one sub-group while NEWSORT.C and NEWSORT.EXE would be placed in another.
    1c Within each sub-group, the filenames are further sorted into those which have purely alphabetical extensions, those with numerical extensions and those with alpha-numeric extensions. These are placed into separate sub-sub-groups within which they are placed in order, first by length (i.e., a .C extension would come before a .EXE extension) and then alpha-numerically.

    Stage 1 is repeated until no more groups of filenames can be found which share the same string of four or more characters.

    2 If any filenames remain, they are sorted by extension. Filenames which share the same extension are placed into a group. They are then placed in order, first by length of filename and then alphanumerically.

    Stage 2 is repeated until no more groups of filenames can be found which share the same extension.

    3 Any filenames which were not grouped at (1) and (2) are sorted into two groups, one containing filenames with extensions and the other containing filenames without extensions. They are then placed in order, first by length of filename and then alphanumerically.

    If at any stage a group is formed which contains more than five filenames, it is split into two or more sub-groups of approximately equal size. When all the filenames have been grouped, a search is carried out to see if any of the groups contain just one or two filenames. If such groups are found, an attempt will be made to find another sub-group of the same parent group which also contains a small number of filenames and the two groups will, subject to certain restrictions, be recombined.

    Thus the effect of the sorting operation is to yield a set of groups, each of which contains filenames which share certain characteristics, and each of which holds not more than five filenames and (normally) not less than two.

    Having completed this operation on the files in a directory, the program, will repeat the process with any sub-directories it finds.

    User Interface

    Once the program has sorted the files and the sub-directories in the manner described above, it can provide information to the user in various forms. The information can be viewed at four different levels which are accessed using the cursor keys. Movement up and down between the different levels is achieved using the CURSOR UP and CURSOR DOWN keys, while horizontal movement at each level is achieved using the CURSOR LEFT and CURSOR RIGHT keys.

    At each level the information 'wraps around', so using the CURSOR LEFT or CURSOR RIGHT keys to move repeatedly in either direction will cause the program to display all the information available at that level and then return to the starting point.

    An important feature of the improved directory-listing command is that the sorted information it contains is structured hierarchically. At the top level the user can choose to explore either the files or sub-directories within the current directory. Having chosen one of these paths, the user can then explore the information at a high level and, if a group of interest is found, move rapidly down the hierarchy until an individual file or sub-directory within that group is located. Pressing any of the CURSOR keys immediately terminates any operations being performed at the current level and mutes any associated speech.

    Each time the user moves up or down a level, a brief non-speech tone sequence is used to indicate the new level. Each level has a unique pitch assigned to it, the four pitches being:

    Top Level: C above middle C
    Second Level: G above middle C
    Third Level: E above middle C
    Bottom level: Middle C

    For movements upwards through the four information levels, middle 'C' is sounded first for 100ms, followed by the appropriate tone for the new level, also for 100ms. Thus the user hears a pair of tones which ascend in pitch. For movements downwards through the four information levels, 'C' above middle 'C' is sounded first followed by the appropriate tone for the new level. Thus the user hears a pair of tones which descend in pitch. Again, both tones sound for 100ms.

    Pressing F1 at any time whilst using the program causes it to repeat the last item spoken, and pressing the SPACE key mutes any speech. The ESCAPE key exits the directory-listing command and returns the user to the DOStalk prompt.

    Speech Outputs

    The improved directory-listing command structures its speech output in different ways at each level.

    At the top level, the command merely produces the statements " files" and " sub-directories". These are spoken in full with the number identified as new information through pitch prominence.

    At the second level, the matching strings which identify the groups are presented. These strings are spoken in their entirety, using a different voice to the one used at the other levels and presenting the speech at a higher average pitch. This is done to emphasise the fact that these strings merely identify groups and are not themselves file or sub-directory names. Since the groups are normally listed a few at a time, they are given an overall intonation curve which rises sharply during the first string and falls gradually as each successive string is spoken. The pausing is arranged so that successive strings are spoken at brisk but regular intervals, thus giving them a rhythm in which each string falls on a strong beat.

    At the third level, file or sub-directory names are presented in groups. The way in which the names are spoken depends upon the nature of the group.

  • Groups of filenames which share the same complete filename (before the extension) are presented with the first filename spoken in full but with only the extensions of subsequent filenames spoken, for example,

    es - say dot
    old, dot
    new, dot
    doc

    The filename and the individual extensions are given pitch prominence while the spoken words 'dot' are not. The pausing is arranged so that successive extensions are spoken at regular intervals, thus placing them on strong beats of the rhythm.

  • Groups of filenames which share the same string of four or more letters are pronounced in full with the new information (a) marked by pitch prominence and (b) appearing at regular intervals so as to form a distinct rhythm, for example,

    es - say dot doc,
    es - say one dot doc,
    new es - say dot doc,

  • Groups of filenames which share the same extension but have different filenames are spoken in full with the filenames marked by pitch prominence and placed on the strong beats of the rhythm, for example,

    es - say dot doc,
    fig - ures dot doc,
    pic -tures dot doc,

  • Groups of filenames which share no features at all are spoken in full with the both the filenames and the extensions marked by mild pitch prominence and placed on the strong beats of the rhythm. Thus only the 'dots' are presented as given information.

  • In addition to signalling the positions of new and given information within the filenames, intonation is used in all four types of group to link the items together. The pitch rises sharply at the beginning of the group and then declines gradually before falling quite sharply again at the end. This follows the pattern discussed in Section 4.

    <-- Designing an Improved Directory-Listing Command for Speech Output
    <-- Table of Contents


    Evaluation of the Improved Directory-Listing Command

    In order to evaluate the improved directory-listing command, a group of subjects were asked to perform a task involving file identification, copying and deleting. The task was performed using both the improved directory-listing command operating within the DOStalk shell, and the DOS operating system accessed through a Hal screen-reader. Hal incorporates facilities to scan lists and other textual information sequentially or by means of search strings, but it does not attempt to reorganise the information provided by the operating system and it does not provide prosodic cues which reflect information content.

    Experimental Design

    Subjects were presented with a directory structure which comprised a main directory and three sub-directories. The sub-directories contained files with four different filenames and with extensions which were either .OBJ, .EXE, or a two-digit number (for example, .01).

    Subject were told that the sub-directories contained the files associated with a programming project. It was explained that the four different filenames represented four different code modules within the project, and that the files with numerical extensions were different versions of the source code. Their task was to find the most recent (i.e., highest numbered) version of the source-code for each module and copy it into a new directory.

    Eight subjects took part in the evaluation. All were undergraduate students with experience of DOS. A within-subjects design was used because it was felt that this would be more sensitive to differences between the two conditions, particularly when using a comparatively small number of subjects. Thus each subject undertook the task once using DOS/Hal and once using DOStalk and the improved directory-listing command. Two different sets of files and sub-directories were used so that subjects would not be able to use knowledge gained under one condition when performing under the other condition. The two file sets contained the same number of sub-directories and the same overall number of files, but the names used for the sub-directories and files were different and the number of files in each sub-directory and the balance between numerical and other extensions differed. Each file set was used with DOStalk for 50% of the trials and with DOS/Hal for the other 50% to ensure that any differences in the inherent memorability of the filenames did not bias the results. The order in which subjects undertook the two conditions was also varied so that 50% of the subjects used DOS/Hal followed by DOStalk while the other 50% undertook the two conditions in the reverse order.

    Procedure

    Subjects sat at a table with the computer monitor and keyboard directly in front of them and two small loudspeakers positioned one on either side of the monitor.

    With the computer monitor turned on and with no accompanying speech, the task was demonstrated using a set of files and sub-directories similar to the two sets prepared for the evaluation but differing in detail from both of them. This continued until the experimenter was satisfied that the subject fully understood the task.

    The subject was then given a demonstration of Hal or DOStalk and the improved directory-listing command (depending upon which condition was to be undertaken first) and invited to practice using the dummy file set. A sheet containing a brief summary of the commands available in each system was provided. The computer monitor remained on for the initial part of this practice, but after a while the subject was asked to practice with the monitor switched off, using the sounds only. The subject was shown the volume control for the speech synthesiser and, when using DOStalk, the independent volume control for the non-speech sound card, and invited to set these to whatever levels were found most comfortable.

    When both the subject and the experimenter were satisfied that the subject had received sufficient practice using the system, the trial began. The experimenter set-up the system to use the chosen set of files and sub-directories, checked that no information was visible on the monitor screen, and placed the 'crib-sheet' of commands in a prominent position. The subject was then invited to begin the task. The experimenter began timing the trial from the moment the subject pressed the first key.

    During the timed period the experimenter remained on hand to provide help if the subject encountered difficulties. Any help provided was confined to explaining the operation of commands or suggesting steps that might be taken to resolve a difficulty. At no time did the experimenter convey or confirm information on the state of the system, the contents of the sub-directories, or any other matter relating to the data being manipulated as part of the task.

    When, in the judgement of the subject, the task had been completed, the experimenter noted the elapsed time and saved details of the final system state.

    The experimenter then asked the subject to complete a TLX questionnaire. After this the subject was allowed a short break, if desired, before moving onto the second condition.

    When the subject felt ready, the experimenter ran the evaluation again under the second condition. This was undertaken in the same way as for the first condition. The subject was shown the system, allowed to practice using it, then asked to undertake the timed trial. Upon completion, the subject was asked to complete a TLX questionnaire, recording subjective impressions of the task under the same headings as before.

    Finally, the subject was asked to indicate degree of preference for one or other system by placing a mark on a scale which ran from 10 to 10, with 0 in the middle. Scores on the left side of the scale indicated a preference for DOS adapted through Hal while scores on the right side of the scale indicated a preference for DOStalk and the improved directory-listing command. A score of zero indicated no preference for either system.

    Results

    The principal findings from this study are shown in Table 6. It can be seen that subjects performed the task significantly faster when using DOStalk and the improved directory-listing command than when using the standard DOS shell adapted through Hal. They also expressed a strong preference for DOStalk.

    DOStalk
    DOS/Hal
    Average task-completion time (in minutes)
    10.39
    24.12
    Overall Preference
    7.8

    Table 6 Task completion times, error rates and overall preference scores obtained from the comparison of DOStalk against DOS/Hal.

    All the subjects performed the task faster using DOStalk and the improved directory-listing command than using DOS/Hal, and all indicated an overall preference for DOStalk over DOS/Hal. A majority of the subjects gave DOStalk an overall rating of 8, 9 or 10, although a few gave lower ratings and this brought the average figure down slightly. The detailed results are shown in [Pit96].

    The subjects' preference for DOStalk and the improved directory-listing command is reflected in the TLX scores (Table 7) which show a very clear trend. Subjects reported less frustration, less time pressure and less effort expended when using DOStalk, and also gave themselves higher performance ratings. In some cases, for example pressure of time, the difference between the scores for DOStalk and DOS/Hal exceeded 2:1. All of these results are significant. The only exception is mental demand, which was scored low under both conditions and showed no significant difference. The detailed results are shown in [Pit96].

    TLX Parameter
    DOStalk
    D OS/Hal
    Averages
    Mental Demand
    1.8
    2.4
    2.1
    Effort
    4.0
    7.2
    5.6
    Time Pressure
    3.9
    8.2
    6.1
    Frustration
    6.8
    8 .7
    7.8
    Performance
    7.8
    5 .3
    6.6
    Averages
    4.9
    6.4< /center>

    Table 7 TLX scores recorded by subjects taking part in the evaluation of the Improved Directory-Listing Program.

    Discussion

    A few points need to be made regarding these findings. The first is that Hal is an adaptation, designed to be used in conjunction with a wide range of different software. It cannot be expected to correct all the ills of a bad interface in the same way a one-off, purpose-designed program can. Similarly, DOS itself was never intended to be used by blind people. Thus it is not surprising that a program designed from the outset with speech interaction in mind should perform better than an adaptation.

    Another factor which should be taken into account is the subjects' lack of familiarity with synthetic speech. During the evaluation, subjects often ran into problems because they could not understand the synthetic speech and hence had difficulty spelling filenames they wished to copy. This problem affected the DOS/Hal combination far more than DOStalk and the improved directory-listing command because, as described earlier, the directory-listing command allows users to select filenames and then make them the subject of another operation.

    It should be noted that the task-completion times recorded in this experiment are very long compared with those which might have been expected had the task been carried out using visual feedback. An experienced computer user faced with the same task in a visual environment would probably have taken at most a minute to identify the four most recent file v ersions and copy them to a new directory. The much longer task-completion times recorded here reflect both the difficulty of using speech feedback and the subjects' lack of familiarity with synthetic speech interfaces in general and with these interfaces in particular. Unfortunately, it is not possible on the basis of these findings to assess the relative contribution of these two factors, and therefore it is not possible to estimate how much more quickly the task might be completed by an experienced user.

    <-- Evaluation of the Improved Directory-Listing Command
    <-- Table of Contents


    Conclusions

    In spite of the caveats raised in the discussion, there is strong evidence that the improved directory-listing command was both easier and more pleasant to use than the standard DIR command made accessible through the DOS/Hal combination. It is not possible without carrying out more experiments to determine to what extent this is due to the organisation of files into groups, the use of prosody and other features of natural speech, or the inclusion of other features such as the ability to select files and groups for use in subsequent operations. However, the magnitude of the speed improvement and the clear trend observed in the TLX scores suggest that, taken together, these features can significantly improve the usability of a speech-based interface.

    In its present form, the approach used here is applicable only to DOS-style filenames. If this work is to be of practical value, some means must be found of extending the technique to embrace other types of text-string. Some elements of the present approach can be used with little modification, for example, the use of intonation to distinguish between unique and repeated text elements. Many lists include items which have shared words or syllables which could be distinguished in this way. However, the grouping rules would need to be vastly more flexible in order to cope with other types of list. As was noted earlier, some of the strategies used by subjects in this study would be very difficult to emulate in software. However, this does not necessarily mean that all the rules used in such instances would be difficult to encode. It would be interesting to conduct a larger study in an attempt to find what other grouping methods subjects employ, with the aim of identifying frequently-used strategies.

    <-- Table of Contents


    Acknowledgements

    The work described in this paper was supported by the UK Engineering & Physical Sciences Research Council under CASE Studentship 92567838 and also by British Telecom.

    <-- Table of Contents


    Bibliography

    ['tHa73]
    't Hart, J. & Cohen, A. (1973) "Intonation by Rule: A Perceptual Quest", Journal of Phonetics, 1, 309-327.
    [Bar32]
    Bartlett, F.C. (1932) Remembering, Cambridge University Press.
    [Bow69]
    Bower, G.H., Clark, M.C., Lesgold, A.M. & Winzenz, D. (1969) "Hierarchical Retrieval Schemes in Recall of Categorized Word Lists", Journal of Verbal Learning and Verbal Behaviour, 8, 323-343.
    [Bro78]
    Broadbent, D.E., Cooper, P.J. & Broadbent, M.H. (1978) "A Comparison of Hierarchical Matrix Retrieval Schemes in Recall", Journal of Experimental Psychology: Human Learning and Memory, 4, 486-497.
    [Hal63]
    Halliday, M.A.K. (1963) "The Tones of English", Archives of Linguistics, 15, 1-28.
    [Hal67a]
    Halliday, M.A.K. (1967a) "Notes on Transitivity and Theme in English, Part 2", Journal of Linguistics, 3, 199-244.
    [Hal67b]
    Halliday, M.A.K. (1967b) Intonation and Grammar in British English, Mouton, The Hague.
    [Hal70]
    Halliday, M.A.K. (1970) A Course in Spoken English: Intonation, Oxford University Press, Oxford.
    [Har88]
    Hart, S. & Staveland, L.E. (1988) "Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research", Human Mental Workload, Hancock, P.A. & Meshkati, N. (eds.), Elsevier Science, 139-183.
    [Jen52]
    Jenkins, J.J. & Russell, W.A. (1952) "Associative Clustering during Recall", Journal of Abnormal and Social Psychology, 47, 818-821.
    [Nak78]
    Nakatani, L.H. & Schaffer, J. (1978) "Hearing Words without Words: Prosodic Cues for Word perception", Journal of the Acoustical Society of America, 63, 234-244.
    [Pit96]
    Pitt, I.J. (1996) The Principled Design of Speech-Based Interfaces, D.Phil Thesis, Department of Computer Science, University of York, Heslington, York, UK Y01 5DD.
    [Str78]
    Streeter, L. (1978) "Acoustic Determinants of Phrase Boundary Perception", Journal of the Acoustical Society of America, 64, 1582-1592.
    [Tul66]
    Tulving, E. & Pearlstone, Z. (1966) "Availability versus Accessibility of Information in Memory for Words", Journal of Verbal Learning and Verbal Behaviour, 5, 381-391.
    CDP
    The CDP workstation is manufactured by the Composers' Desktop Project, 11 Kilburn Road, York Y01 4DF, UK
    Hal/Apollo
    The Hal screen-reader and the Apollo range of speech-synthesisers are manufactured by Dolphin Systems for People with Disabilities, Unit 96c, Black Pole Trading Estate West, Worcester WR3 8TU
    <-- Table of Contents