Building a Voice-Driven Technical Interview System

Over the past week, I built a full voice-driven technical interview system as part of an engineering challenge — and it was one of the most fun end-to-end AI projects I've done recently.

This wasn't just about stitching APIs together. It was about creating a natural, interactive experience that feels production-ready while handling the complexity of real-time voice processing, AI evaluation, and structured data management.

What I Built

The platform enables users to conduct complete technical interviews using only their voice. No typing, no clicking through forms — just natural conversation with an AI interviewer that asks relevant questions, listens to your answers, and provides detailed evaluation.

Key Features:

🎙️ Real-time Speech-to-Text — Using OpenAI's Whisper model for accurate transcription
🗣️ Text-to-Speech — Natural-sounding AI interviewer voice
🤖 AI-Generated Questions — Dynamic technical questions based on job requirements
📊 Structured Scoring — Intelligent evaluation with detailed feedback
🧱 Nx Monorepo Architecture — Clean separation between NestJS API and React frontend
🗄️ PostgreSQL + TypeORM — Well-structured data models for interviews, questions, and evaluations
🔐 Secure Authentication — User-level access control and session management

The Technical Approach

Architecture

I structured this as an Nx monorepo to keep the API and frontend cleanly separated while sharing TypeScript types and utilities. This made development faster and ensured type safety across the stack.

Backend (NestJS):

RESTful API endpoints for interview management
Real-time WebSocket connections for voice streaming
Integration with OpenAI APIs (Whisper, GPT-4, TTS)
PostgreSQL database with TypeORM for data persistence
Authentication and authorization middleware

Frontend (React + TypeScript):

Voice recording and playback UI
Real-time transcription display
Interview progress tracking
Detailed evaluation results view

Voice Processing Pipeline

The most interesting technical challenge was building a smooth voice interaction flow:

Capture audio from the user's microphone in real-time
Stream to Whisper for transcription with minimal latency
Send transcription to GPT-4 for evaluation and next question generation
Convert AI response to speech using OpenAI's TTS
Play back audio while displaying text for accessibility

Getting this to feel natural required careful attention to:

Buffering strategies to minimize latency
Error handling for network interruptions
State management for conversation flow
Loading states that don't break immersion

AI Evaluation Engine

Each answer goes through a structured evaluation process:

interface Evaluation {
  score: number;          // 0-10 rating
  strengths: string[];    // What the candidate did well
  weaknesses: string[];   // Areas for improvement
  suggestions: string[];  // Specific advice
  nextTopic?: string;     // Adaptive questioning
}

The AI doesn't just score answers — it provides actionable feedback and adapts the interview based on the candidate's responses. If someone struggles with arrays, the system might ask follow-up questions on fundamentals. If they excel, it moves to more advanced topics.

Data Model

I designed the database schema to support:

Multiple job types with different question pools
Interview sessions with full conversation history
Question-answer pairs with timestamps
Aggregated scoring and analytics

Using TypeORM made it easy to define entities with TypeScript decorators and maintain type safety from database to API responses.

Clean Code Patterns

Throughout the project, I focused on:

Dependency injection in NestJS for testable services
DTOs with validation using class-validator
Error boundaries for graceful failure handling
Environment-based configuration for different deployment environments
Comprehensive error logging for debugging production issues

What I Learned

1. Voice UX is Different

Text-based interfaces are forgiving — users can edit, think, and take their time. Voice requires immediate feedback, clear state communication, and handling of awkward pauses or interruptions.

2. LLM Orchestration Needs Structure

It's tempting to send everything to GPT and hope for the best. But production LLM apps need guardrails:

Input validation and sanitization
Output parsing and validation
Fallback behaviors for API failures
Token usage monitoring and cost controls
Response caching where appropriate

3. Monorepos Speed Up Development

Having the API and frontend in one repo with shared types eliminated an entire class of bugs. When I changed a response shape on the backend, TypeScript immediately showed me what needed updating on the frontend.

4. Real-time Systems Need Buffering

Streaming voice data isn't like HTTP requests. You need smart buffering strategies, reconnection logic, and state synchronization to handle the messy reality of network conditions.

Try It Yourself

The project is open source and available on GitHub:
🔗 github.com/ITsolution-git/voice-agent-interviewer

If you're exploring:

Voice interface development
LLM orchestration patterns
Real-time speech processing
Nx monorepo architecture

...I'd love to hear your thoughts or answer questions!

What's Next?

Some features I'm considering:

Multi-language support (Spanish, Mandarin, etc.)
Video recording for behavioral interviews
Integration with ATS (Applicant Tracking Systems)
Collaborative review mode for hiring teams
Advanced analytics and comparison metrics

This project reminded me why I love working on AI-driven applications. There's something special about building interfaces that feel natural and intelligent — not just functional, but genuinely helpful.

Always excited to learn, iterate, and keep building. 🚀