Computer vision-based tracking and feature extraction for lingual ultrasound




Al-Hammuri, Khalid

Journal Title

Journal ISSN

Volume Title



Lingual ultrasound is emerging as an important tool for providing visual feedback to second language learners. In this study, ultrasound videos were recorded in sagittal plane as it provides an image for the full tongue surface in one scan, unlike the transverse plane which provides an information for small portion of the tongue in a single scan. The data were collected from five Arabic speakers as they pronounced fourteen Arabic sounds in three different vowel contexts. The sounds were repeated three times to form 630 ultrasound videos. The thesis algorithm was characterized by four steps. First: denoising the ultrasound image by using the combined curvelet transform and shock filter. Second: automatic selection of the tongue contour area. Third: tongue contour approximation and missing data estimation. Fourth: tongue contour transformation from image space to full concatenated signal and features extraction. The automatic tongue tracking results were validated by measuring the mean sum of distances between automatic and manual tongue contour tracking to give an accuracy of 0.9558mm. The validation for the feature extraction showed that the average mean squared error between the extracted tongue signature for different sound repetitions was 0.000858mm, which means that the algorithm could extract a unique signature for each sound and across different vowel contexts with a high degree of similarity. Unlike other related works, the algorithm showed an efficient and robust approach that could extract the tongue contour and the significant feature for the dynamic tongue movement on the full video frames, not just on the significant single and static video frame as used in the conventional method. The algorithm did not need any training data and had no limitation for the video size or the frame number. The algorithm did not fail during tongue extraction and did not need any manual re-initialization. Even when the ultrasound image recordings missed some tongue contour information, the thesis approach could estimate the missing data with a high degree of accuracy. The usefulness of the thesis approach as it can help the linguistic researchers to replace the manual tongue tracking by an automated tracking to save the time, then extracts the dynamics features for the full speech behavior to give better understanding of the tongue movement during the speech to develop a language learning tool for the second language learners.



computer vision, lingual ultrasound, tracking, feature extraction, tongue