Hi everyone,
I’m developing a pronunciation app for deaf users learning Korean on iOS (Swift) and need to capture actual phonetic pronunciation as text.
The Problem
In Korean, the written form differs from the actual pronunciation due to phonological rules.
Example:
- Written: “목요일” (Thursday)
- Actual pronunciation: [모교일] (due to nasalization: ㄱ+ㅛ → ㄱ+ㄱ+ㅛ)
- What I need: “모교일” (phonetic text)
- What all STT outputs: “목요일” (standard orthography)
Another example:
- Written: “물고기” (fish)
- Actual pronunciation: [물꼬기]
- What I need: “물꼬기”
- What STT outputs: “물고기”
All STT systems output standard orthography, not phonetic transcription. For deaf users learning pronunciation, they need to see exactly how words sound (e.g., “모교일”), not the standard spelling (“목요일”).
What I’ve Tried
1. Apple Speech Framework (iOS native)
- Result: Returns standard orthography only (“목요일”)
- Provides
confidencescores but not phonetic output - No option for phonetic transcription
- Swift code tested - limited to standard spelling
2. Wav2Vec2 (kresnik/wav2vec2-large-xlsr-korean) - Python test
- Result: Extremely poor accuracy, unusable
- Test case: Clear audio of “목요일 목요일”
- Output: “목표 일 목서위 다” (complete gibberish)
- Accuracy too low for production
- Haven’t attempted Core ML conversion
3. Text-to-Phonetic converters (g2pK, etc.)
- Limitation: These convert text → phonetic (목요일 → 모교일)
- I need speech → phonetic (audio → 모교일)
- Requires accurate speech recognition first
4. Forced Alignment
- Limitation: Requires ground truth text
- Users are practicing - I don’t know what they’ll say
- Not suitable for real-time feedback
Requirements
- Platform: iOS app (Swift/SwiftUI)
- Deployment: On-device preferred (Core ML), server-side acceptable
- Input: Audio from AVAudioRecorder
- Desired output: Phonetic Korean text representing actual sounds
- “목요일” → “모교일”
- “물고기” → “물꼬기”
- “밥먹다” → “밤먹다”
- Language: Korean phonological rules essential
- Use case: Deaf users need to see how words actually sound, not standard spelling
My Questions
-
Is it possible to get phonetic transcription (not standard orthography) from speech on iOS?
-
Can Wav2Vec2 or similar models output phonetic text instead of standard spelling? Can this be converted to Core ML?
-
Are there Korean-specific ASR models trained to output phonetic transcription rather than standard orthography?
-
Hybrid approach? Could I combine:
-
Standard STT (Apple Speech) → “목요일”
Summary
This text will be hidden
-
Text-to-phonetic converter (g2pK) → “모교일”
-
But how to handle actual mispronunciations?
-
-
Is this fundamentally impossible? Do all modern ASR systems inherently output standard orthography?
iOS-Specific Constraints
- AVFoundation audio input
- Prefer Core ML for privacy/on-device
- Willing to use server API if necessary
- Deaf users - voice data is sensitive
Additional Context
This is for accessibility. Deaf users learning Korean need to understand that “목요일” is pronounced “모교일”, not “목-요-일” (syllable by syllable).
Standard STT’s conversion to orthography is exactly what I need to avoid.
If phonetic transcription from speech is impossible, what are realistic alternatives for teaching pronunciation to deaf users?
Thank you for any insights!