AI/ML-Based AI Tutor
Voice Input and AI-Powered Speaking Practice
English 4 Everyone is a mobile app for non-native English speakers who want to practice speaking. The user picks a topic (Job Interview, Small Talk, Sports, etc.), gets a vocabulary warm-up and useful sentence patterns, then has a back-and-forth voice conversation with an AI tutor. Progress, streaks, and total speaking time are tracked daily so a learner can build a habit.
The screenshots below were captured from the production output build (screenshoot/output/iphone_67/) and show the core surfaces of the app.








The objective was to build an AI tutor that gives a learner the kind of low-pressure speaking practice they would normally only get from a one-on-one human teacher, but available on a phone, on demand, and at near-zero marginal cost per session.
1. Define the conversation loop
I started by mapping the loop a learner goes through every session: tap a topic → see vocabulary and patterns → tap mic and speak → AI replies in voice → repeat → end with a summary. Every box in that loop became a service responsibility.
2. Build the language brain on Amazon Bedrock
A shared BedrockService wraps Amazon Bedrock with the default model amazon.nova-micro-v1:0, retry-with-backoff on ThrottlingException, ServiceQuotaExceededException, ModelTimeoutException, and ModelErrorException, plus token accounting (inputTokens/outputTokens) so every call can be billed back to the learner's daily allowance. Bedrock also drives offline generation Lambdas: TopicVocabGeneratorLambda, TopicPatternsGeneratorLambda, and TopicImageGeneratorLambda.
3. Add voice input via AWS Transcribe Streaming
SpeechRecognitionService uses @aws-sdk/client-transcribe-streaming: mic audio is uploaded as base64, decoded into a buffer, streamed to Transcribe, and returned as recognized text plus a 0–100 accuracy score and per-word confidence. The service then compares the transcript to the expected text and produces wordScores, feedback, and an isMatch flag the UI uses to highlight which words were nailed and which need another try.
4. Add voice output via Amazon Polly Neural
PollyService synthesizes audio at 24 kHz with the Neural engine and the Joanna voice, returns MP3 plus speech marks (word and sentence) so captions can be highlighted in sync, and chunks long scripts at 2,800 characters with parallel synthesis and client-side stitching.
5. Make it real-time with a dedicated WebSocket API
ai-conversation-stack.ts defines a dedicated apigatewayv2.WebSocketApi with routeSelectionExpression = $request.body.subAction, per-environment stages (dev/staging/prod), throttling (burst 50, rate 25), a $connect handler that validates the Cognito JWT and stores the connection in DynamoDB, and a $disconnect handler that cleans up. The default route is the conversational turn handler.
6. Ship it
All AWS infrastructure is deployed via CDK with one stack per environment, least-privilege IAM per Lambda, and DynamoDB used in single-table mode. Mobile builds target both iOS and Android from a single React Native codebase.
| Area | Tools / Technologies |
|---|---|
| Mobile | React Native (TypeScript), React Navigation, Zustand, React Query |
| Backend (IaC) | AWS CDK (TypeScript), one stack per environment |
| Compute | AWS Lambda (Node.js 20), least-privilege IAM per function |
| API | API Gateway HTTP API + dedicated WebSocket API for AI conversations |
| Auth | Amazon Cognito (JWT authorizer + custom WebSocket authorizer) |
| Data | Amazon DynamoDB (single-table design) |
| AI / LLM | Amazon Bedrock (default model: amazon.nova-micro-v1:0) |
| Voice (TTS) | Amazon Polly Neural (Joanna, 24 kHz, MP3 + speech marks) |
| Voice (STT) | AWS Transcribe Streaming (per-word confidence) |
| Observability | CloudWatch Logs, structured logger with retry/error metrics |
Most language apps either grade typed answers or play canned audio. En4Eo combines a generative LLM, real-time speech recognition, and neural voice in one feedback loop. That combination unlocks four things at once:
The product is differentiated not by the model itself, every team can call Bedrock, but by the seams around it: real-time WebSocket, chunked Polly, per-word Transcribe, and a daily-allowance system that keeps unit economics honest.
Building En4Eo taught me that the most expensive part of an AI/ML product isn't the model, it's the orchestration around the model. Bedrock gave me capable AI cheaply, but the work that mattered was the WebSocket layer that keeps a conversation alive, the Polly chunking strategy that doesn't crash on long scripts, the Transcribe pipeline that returns confidence per word so feedback is actionable, and the daily-allowance system that protects unit economics.
I also learned how much production engineering disappears once it's working. Retry-with-backoff, IAM scoping, throttling, JWT authorization on a WebSocket, and single-table DynamoDB design are invisible when they work, and catastrophic when they don't. Treating them as first-class deliverables, not afterthoughts, is what made the app feel reliable to learners.
Finally, I learned that AI/ML products live or die by the seams between services. Mic, transcribe, model, voice, screen, each handoff has to feel instant or the illusion of a tutor breaks. That insight has reshaped how I think about every AI feature I design now: the model is the engine, but the seams are the product.
This artifact is directly relevant to the AIML program because it operationalizes the concepts covered across the workshops, generative models, supervised feedback, evaluation, deployment, and ethics, inside a single shipping product. It demonstrates that I can move from understanding what a model is to deciding which managed model to call, how to wrap it for reliability, how to measure its output, and how to deliver it to a real user on a phone.
It is also relevant to my career path as a software engineer who builds with AI. Employers want engineers who can architect AI/ML systems end to end: choosing services, designing safe and observable orchestration, controlling cost, and integrating with mobile and cloud at the same time. En4Eo is the artifact I point to when I say I have done that work, not just read about it.
More broadly, an AI tutor for English connects to real-world accessibility, English is still the dominant language of business and higher education, and tools that lower the barrier to spoken fluency have direct social impact. That is the kind of AI/ML product I want to keep building.