Dementia is hard to diagnose. And there is no known cure, maybe because it too late by the time the diagnosis is confirmed. It is understood that the development of dementia commences 10 to 15 years before the symptoms first appear. Also, language, especially spontaneous speech, is a promising indicator/biomarker for diagnosing dementia and other cognitive disorders.
The model will be trained on English audio.
As far as I know, wav2vec2 is the best candidate.
There is dementia classification from dementia bank. However, it is a binary classification dataset for dementia and no dementia.
To predict dementia 10 to 15 before the onslaught of symptoms, we would need longitudinal data on individuals who develop dementia. The only such available datasets I know is the Framingham heart study, and it is text only.
I have been building a list of public figures diagnosed with dementia and scrapping videos of youtube into different categories; after symptoms, two years before symptoms, five years before symptoms etc. Over the next week, I will build a streamlit app to extract 8 to 10 secs of audio files of the person of interest from the video.
Maybe we can use data from dementia bank or other sources for no dementia class??
Possible links to publicly available datasets include:
- Dementia Bank
- Google sheet with list of public figures with dementia and YouTube urls
The dataset is too small, too noisy?
No dataset on time for no dementia
A proof of concept streamlit app that this works?
The following links can be helpful to better understand the project and
what has previously been done.
- IBM efforts using Farmingham data for dementia prediction