Voice User Interface Design

Speech Augmentation Machine (SAM), is a conversational interface for children with Autism Spectrum Disorder (ASD). Children with ASD struggle with basic conversation and recognition of social norms. The initial purpose of SAM is to replicate autism therapy to encourage proper conversation and provide companionship.

Through a 10-Week long student project, I worked to develop skills in conversational interfaces. After conducting primary and secondary research, SAM was designed with a help of a teammate. Check out the full design specifications.

SAM implements a VUI-mediated version of ASD script therapy. Script therapy is when a mediator such as a peer, parent or therapist asks the child questions and the child reads proper responses from a provided script. Eventually the script is faded out and the child answers without explicit direction. These have been shown to be successful and generalizable to other social situations.  Furthermore, machines have been shown to create successful social interactions in autistic children. We believe that a VUI is especially appropriate for an autistic child, due to the machine’s non-judgemental and pragmatic nature. 


  • The therapist loads conversational scripts and practice times into the web application (GUI),
  • SAM is placed in the child’s home and alerts the child when it’s time to practice
  • SAM poses questions to the child
  • The child replies to the question by reading from the script
  • SAM detects speech input via microphone
  • SAM processes the input to text
  • SAM determines if the child’s answer matches the script and provides feedback.
  • The Dialog Manager updates SAMs state
  • SAM provides feedback and asks the next question
  • Continuous: SAM keeps track of correct answers and once it reaches a threshold it directs the child to put the script away
  • Continuous: SAM improves speech recognition via reinforcement learning
  • Continuous: SAM teaches itself to better detect and classify unscripted utterances as relevant/irrelevant, socially appropriate/inappropriate
  • Continuous: Data is collected and uploaded to the database
GroupSAM - Early form prototype

Form Factor

We employ the form factor of an analog (“old school”) child’s telephone in order to alleviate complications from a humanoid or animal form factor (See Figure 2). We do not want to attempt to replicate an in-person conversation because that also would bring in a set of complicating factors such as body language. We employ an artificial voice so as to not seem deceitful, but there are several recorded responses that share functionality so the conversation will not seem “canned”. Upon discussion with Shane Landry, Principle Design Lead from Microsoft Cortana, an artificial voice is a ethical design choice in line with current voice technologies.

Conversational Flow

Times for conversation practice are set via web app or by voice (“Set practice for Tuesdays and Thursdays for 20 minutes”). Once the child agrees to practice, SAM asks a question about a recent activity (“Did you swing at recess today?”), an activity that is occurring soon (“What are you doing on Friday?”), or an object in the environment.

When the child correctly answers a scripted question the system will tally up these answers until they reach a success threshold. At this point, the child is asked to put the script away and the system asks unscripted questions that the child has previously answered correctly. The system will progressively ask more new questions.


Spoken Language Understanding (SLU)

Scripted Responses
When SAM receives a scripted response, it is not necessary for this system to perform spoken language understanding techniques because it only needs to check that the response matches the expected response before moving on to the DM.

Unscripted responses
For unscripted responses, SAM will use semantic analysis to check that the response is related to the prompt. If it is, the data will then move to sentiment analysis. If not, it will move to DM. See Figure 5.