AI/ML-Based AI Tutor

Voice Input and AI-Powered Speaking Practice
React Native
AWS Bedrock
Amazon Polly
AWS Transcribe
WebSocket API
CDK
DynamoDB

This artifact documents an AI/ML-based AI tutor for English I designed and built, shipped as a mobile product called English 4 Everyone (En4Eo). It is an end-to-end AI/ML system, not a prototype: a learner opens the app, picks a topic, and has a real spoken conversation with an AI tutor that hears them, responds in natural voice, and grades pronunciation in real time.

I am including it in the portfolio because it pulls together every part of the AI/ML stack covered in this program: large language models for dialogue, speech recognition for input, neural text-to-speech for output, and the production engineering required to make the whole loop feel instant on a phone.

English 4 Everyone is a mobile app for non-native English speakers who want to practice speaking. The user picks a topic (Job Interview, Small Talk, Sports, etc.), gets a vocabulary warm-up and useful sentence patterns, then has a back-and-forth voice conversation with an AI tutor. Progress, streaks, and total speaking time are tracked daily so a learner can build a habit.

The screenshots below were captured from the production output build (screenshoot/output/iphone_67/) and show the core surfaces of the app.

Track Your Progress — daily streak, study minutes, daily AI Practice and Live Talk allowances.
Track Your Progress — daily streak, study minutes, daily AI Practice and Live Talk allowances.
AI-Powered Conversations — pick a topic from Professional, Social, or Specialized.
AI-Powered Conversations — pick a topic from Professional, Social, or Specialized.
Topic Browser — drill into a category like Sports and pick a task by difficulty.
Topic Browser — drill into a category like Sports and pick a task by difficulty.
Speaking Practice — perfect pronunciation with AI feedback.
Speaking Practice — perfect pronunciation with AI feedback.
Grammar Made Simple — task breakdown with key vocabulary before you start.
Grammar Made Simple — task breakdown with key vocabulary before you start.
Get Conversation Tips Before You Talk — useful sentence patterns.
Get Conversation Tips Before You Talk — useful sentence patterns.
Practice Anytime with AI — voice input via the mic button, live transcript and AI replies.
Practice Anytime with AI — voice input via the mic button, live transcript and AI replies.
Session Summary — duration, exchanges, and tasks completed.
Session Summary — duration, exchanges, and tasks completed.

The objective was to build an AI tutor that gives a learner the kind of low-pressure speaking practice they would normally only get from a one-on-one human teacher, but available on a phone, on demand, and at near-zero marginal cost per session.

Make speaking practice low-pressure: No human waiting on the other side; the AI is patient and never judgemental.
Give actionable feedback: Show the learner what they actually said, with per-word accuracy and confidence, not just a pass/fail score.
Keep cost predictable: Lean on AWS managed AI services and serverless compute so unit economics scale cleanly per learner per minute.
Cover real-life topics: Generate Professional, Social, and Specialized topics on demand so the content library stays fresh without manual curation.

1. Define the conversation loop

I started by mapping the loop a learner goes through every session: tap a topic → see vocabulary and patterns → tap mic and speak → AI replies in voice → repeat → end with a summary. Every box in that loop became a service responsibility.

2. Build the language brain on Amazon Bedrock

A shared BedrockService wraps Amazon Bedrock with the default model amazon.nova-micro-v1:0, retry-with-backoff on ThrottlingException, ServiceQuotaExceededException, ModelTimeoutException, and ModelErrorException, plus token accounting (inputTokens/outputTokens) so every call can be billed back to the learner's daily allowance. Bedrock also drives offline generation Lambdas: TopicVocabGeneratorLambda, TopicPatternsGeneratorLambda, and TopicImageGeneratorLambda.

3. Add voice input via AWS Transcribe Streaming

SpeechRecognitionService uses @aws-sdk/client-transcribe-streaming: mic audio is uploaded as base64, decoded into a buffer, streamed to Transcribe, and returned as recognized text plus a 0–100 accuracy score and per-word confidence. The service then compares the transcript to the expected text and produces wordScores, feedback, and an isMatch flag the UI uses to highlight which words were nailed and which need another try.

4. Add voice output via Amazon Polly Neural

PollyService synthesizes audio at 24 kHz with the Neural engine and the Joanna voice, returns MP3 plus speech marks (word and sentence) so captions can be highlighted in sync, and chunks long scripts at 2,800 characters with parallel synthesis and client-side stitching.

5. Make it real-time with a dedicated WebSocket API

ai-conversation-stack.ts defines a dedicated apigatewayv2.WebSocketApi with routeSelectionExpression = $request.body.subAction, per-environment stages (dev/staging/prod), throttling (burst 50, rate 25), a $connect handler that validates the Cognito JWT and stores the connection in DynamoDB, and a $disconnect handler that cleans up. The default route is the conversational turn handler.

6. Ship it

All AWS infrastructure is deployed via CDK with one stack per environment, least-privilege IAM per Lambda, and DynamoDB used in single-table mode. Mobile builds target both iOS and Android from a single React Native codebase.

AreaTools / Technologies
MobileReact Native (TypeScript), React Navigation, Zustand, React Query
Backend (IaC)AWS CDK (TypeScript), one stack per environment
ComputeAWS Lambda (Node.js 20), least-privilege IAM per function
APIAPI Gateway HTTP API + dedicated WebSocket API for AI conversations
AuthAmazon Cognito (JWT authorizer + custom WebSocket authorizer)
DataAmazon DynamoDB (single-table design)
AI / LLMAmazon Bedrock (default model: amazon.nova-micro-v1:0)
Voice (TTS)Amazon Polly Neural (Joanna, 24 kHz, MP3 + speech marks)
Voice (STT)AWS Transcribe Streaming (per-word confidence)
ObservabilityCloudWatch Logs, structured logger with retry/error metrics
Mobile (React Native) ──► API Gateway HTTP API ──► Lambda (REST) ──► DynamoDB └─► API Gateway WebSocket ──► Lambda (AI conv) ──► Bedrock └─► Polly └─► Mic upload (REST) ──► Lambda ──► Transcribe Streaming

Most language apps either grade typed answers or play canned audio. En4Eo combines a generative LLM, real-time speech recognition, and neural voice in one feedback loop. That combination unlocks four things at once:

Open-ended speaking, not multiple choice: The learner can say anything; the LLM understands and replies in context.
Per-word feedback: Transcribe Streaming returns word-level confidence, so the UI shows exactly which words to retry.
Captions synced to audio: Polly speech marks give word/sentence timings the client uses to highlight as the tutor speaks.
Cost discipline by design: Token accounting + daily allowances + WebSocket throttling keep one user from blowing up the bill for everyone else.

The product is differentiated not by the model itself, every team can call Bedrock, but by the seams around it: real-time WebSocket, chunked Polly, per-word Transcribe, and a daily-allowance system that keeps unit economics honest.

Building En4Eo taught me that the most expensive part of an AI/ML product isn't the model, it's the orchestration around the model. Bedrock gave me capable AI cheaply, but the work that mattered was the WebSocket layer that keeps a conversation alive, the Polly chunking strategy that doesn't crash on long scripts, the Transcribe pipeline that returns confidence per word so feedback is actionable, and the daily-allowance system that protects unit economics.

I also learned how much production engineering disappears once it's working. Retry-with-backoff, IAM scoping, throttling, JWT authorization on a WebSocket, and single-table DynamoDB design are invisible when they work, and catastrophic when they don't. Treating them as first-class deliverables, not afterthoughts, is what made the app feel reliable to learners.

Finally, I learned that AI/ML products live or die by the seams between services. Mic, transcribe, model, voice, screen, each handoff has to feel instant or the illusion of a tutor breaks. That insight has reshaped how I think about every AI feature I design now: the model is the engine, but the seams are the product.

This artifact is directly relevant to the AIML program because it operationalizes the concepts covered across the workshops, generative models, supervised feedback, evaluation, deployment, and ethics, inside a single shipping product. It demonstrates that I can move from understanding what a model is to deciding which managed model to call, how to wrap it for reliability, how to measure its output, and how to deliver it to a real user on a phone.

It is also relevant to my career path as a software engineer who builds with AI. Employers want engineers who can architect AI/ML systems end to end: choosing services, designing safe and observable orchestration, controlling cost, and integrating with mobile and cloud at the same time. En4Eo is the artifact I point to when I say I have done that work, not just read about it.

More broadly, an AI tutor for English connects to real-world accessibility, English is still the dominant language of business and higher education, and tools that lower the barrier to spoken fluency have direct social impact. That is the kind of AI/ML product I want to keep building.