Building a Voice-Driven Technical Interview System
Over the past week, I built a full voice-driven technical interview system as part of an engineering challenge — and it was one of the most fun end-to-end AI projects I've done recently.
This wasn't just about stitching APIs together. It was about creating a natural, interactive experience that feels production-ready while handling the complexity of real-time voice processing, AI evaluation, and structured data management.
What I Built
The platform enables users to conduct complete technical interviews using only their voice. No typing, no clicking through forms — just natural conversation with an AI interviewer that asks relevant questions, listens to your answers, and provides detailed evaluation.
Key Features:
🎙️ Real-time Speech-to-Text — Using OpenAI's Whisper model for accurate transcription
🗣️ Text-to-Speech — Natural-sounding AI interviewer voice
🤖 AI-Generated Questions — Dynamic technical questions based on job requirements
📊 Structured Scoring — Intelligent evaluation with detailed feedback
🧱 Nx Monorepo Architecture — Clean separation between NestJS API and React frontend
🗄️ PostgreSQL + TypeORM — Well-structured data models for interviews, questions, and evaluations
🔐 Secure Authentication — User-level access control and session management
The Technical Approach
Architecture
I structured this as an Nx monorepo to keep the API and frontend cleanly separated while sharing TypeScript types and utilities. This made development faster and ensured type safety across the stack.
Backend (NestJS):
- RESTful API endpoints for interview management
- Real-time WebSocket connections for voice streaming
- Integration with OpenAI APIs (Whisper, GPT-4, TTS)
- PostgreSQL database with TypeORM for data persistence
- Authentication and authorization middleware
Frontend (React + TypeScript):
- Voice recording and playback UI
- Real-time transcription display
- Interview progress tracking
- Detailed evaluation results view
Voice Processing Pipeline
The most interesting technical challenge was building a smooth voice interaction flow:
- Capture audio from the user's microphone in real-time
- Stream to Whisper for transcription with minimal latency
- Send transcription to GPT-4 for evaluation and next question generation
- Convert AI response to speech using OpenAI's TTS
- Play back audio while displaying text for accessibility
Getting this to feel natural required careful attention to:
- Buffering strategies to minimize latency
- Error handling for network interruptions
- State management for conversation flow
- Loading states that don't break immersion
AI Evaluation Engine
Each answer goes through a structured evaluation process:
interface Evaluation {
score: number; // 0-10 rating
strengths: string[]; // What the candidate did well
weaknesses: string[]; // Areas for improvement
suggestions: string[]; // Specific advice
nextTopic?: string; // Adaptive questioning
}
The AI doesn't just score answers — it provides actionable feedback and adapts the interview based on the candidate's responses. If someone struggles with arrays, the system might ask follow-up questions on fundamentals. If they excel, it moves to more advanced topics.
Data Model
I designed the database schema to support:
- Multiple job types with different question pools
- Interview sessions with full conversation history
- Question-answer pairs with timestamps
- Aggregated scoring and analytics
Using TypeORM made it easy to define entities with TypeScript decorators and maintain type safety from database to API responses.
Clean Code Patterns
Throughout the project, I focused on:
- Dependency injection in NestJS for testable services
- DTOs with validation using class-validator
- Error boundaries for graceful failure handling
- Environment-based configuration for different deployment environments
- Comprehensive error logging for debugging production issues
What I Learned
1. Voice UX is Different
Text-based interfaces are forgiving — users can edit, think, and take their time. Voice requires immediate feedback, clear state communication, and handling of awkward pauses or interruptions.
2. LLM Orchestration Needs Structure
It's tempting to send everything to GPT and hope for the best. But production LLM apps need guardrails:
- Input validation and sanitization
- Output parsing and validation
- Fallback behaviors for API failures
- Token usage monitoring and cost controls
- Response caching where appropriate
3. Monorepos Speed Up Development
Having the API and frontend in one repo with shared types eliminated an entire class of bugs. When I changed a response shape on the backend, TypeScript immediately showed me what needed updating on the frontend.
4. Real-time Systems Need Buffering
Streaming voice data isn't like HTTP requests. You need smart buffering strategies, reconnection logic, and state synchronization to handle the messy reality of network conditions.
Try It Yourself
The project is open source and available on GitHub:
🔗 github.com/ITsolution-git/voice-agent-interviewer
If you're exploring:
- Voice interface development
- LLM orchestration patterns
- Real-time speech processing
- Nx monorepo architecture
...I'd love to hear your thoughts or answer questions!
What's Next?
Some features I'm considering:
- Multi-language support (Spanish, Mandarin, etc.)
- Video recording for behavioral interviews
- Integration with ATS (Applicant Tracking Systems)
- Collaborative review mode for hiring teams
- Advanced analytics and comparison metrics
This project reminded me why I love working on AI-driven applications. There's something special about building interfaces that feel natural and intelligent — not just functional, but genuinely helpful.
Always excited to learn, iterate, and keep building. 🚀