Why We Built a Personalization Engine (And Why It Doesn't Use an LLM)
The behavioral science, adaptive learning research, and 5-factor scoring model behind ExecReps' recommendation system
Here's a dirty secret about professional development platforms: they treat every user the same. A VP preparing for a board presentation and a first-time manager rehearsing a 1:1 see the same workout library. The same grid. The same "pick whatever looks interesting" experience.
It's like walking into a gym with 50 machines and no trainer — you wander, you try something random, you leave unsure if you did the right thing.
We know this because we watched it happen. Our beta users would complete one workout, get their AI score, return to the library, stare at 30+ options, and ask the same question:
"What should I do next?"
That question — asked by more users than any other — became the catalyst for everything that followed. So we built a personalization engine from first principles. And no, it doesn't use an LLM.
What the Research Says About Skill Development
Before writing a single line of code, we spent weeks studying the science of how people actually get better at complex skills. Not just any skill — communication, which is uniquely difficult because it involves simultaneous cognitive load (content structure, argument quality) and motor/performance load (vocal delivery, pacing, filler word suppression).
Three bodies of research shaped our approach.
Bayesian Knowledge Tracing: Knowing What You Know
Developed at Carnegie Mellon University's LearnLab, Bayesian Knowledge Tracing (BKT) is the mathematical backbone of adaptive learning systems like Carnegie Learning and Khan Academy. It models each skill as having a probability of mastery, updated after every attempt using Bayes' theorem.
The key insight for ExecReps: our dual-axis scoring system already generates exactly the data BKT needs. Every workout submission produces per-dimension scores across content categories (argument structure, evidence use, audience awareness) and delivery metrics (pace, fluency, filler words, confidence). These map directly to BKT skill nodes.
We don't need to guess what a user knows. We measure it — with voice AI that analyzes what humans hear.
Spaced Repetition: The Forgetting Curve Is Real
Hermann Ebbinghaus demonstrated in 1885 that memory decays exponentially without reinforcement. Modern systems like Duolingo's "Birdbrain" and Anki's FSRS algorithm have turned this into a science: the optimal time to review something is just before you forget it.
But communication skills aren't vocabulary flashcards. You don't "forget" how to structure an argument overnight. What does decay is the automaticity — the ability to do it fluently, under pressure, without conscious effort.
We adapted the spaced repetition model for performance decay. Delivery skills like pace and fluency decay faster than content skills like structure and evidence, because they're more motor-dependent. A user who scored 78 on "Executive Presence" three weeks ago might have an estimated current proficiency of 62 — and the engine knows it's time to practice.
Zone of Proximal Development: The Goldilocks Zone
Lev Vygotsky's 1930s concept of the Zone of Proximal Development (ZPD) has been validated across thousands of educational studies: learning is maximized when difficulty sits just beyond current ability — not so easy it's boring, not so hard it's frustrating.
The sweet spot? Research consistently shows 60–75% success rates maximize learning velocity. Below 40%, learners disengage. Above 85%, they coast. We built this directly into the recommendation engine.
What We Actually Built: The 5-Factor Scoring Model
Here's what surprises people: the recommendation engine doesn't use a large language model. It doesn't need one. It's pure business logic — deterministic scoring built on the rich data our existing voice AI already generates.
Every candidate workout is scored across five factors:
- Skill Match (40%) — How well does this workout target your weakest skills? This is the most sophisticated component: it cross-references your per-dimension scores from completed workouts against the skill targets of available workouts.
- Difficulty Fit (25%) — Is this workout in your Zone of Proximal Development? Every workout has a difficulty tier. Every user has an estimated level. The engine preferentially selects workouts one tier above current performance — stretching without breaking.
- Freshness (20%) — Are decaying skills being resurfaced at the right time? Skills you haven't practiced recently get a boost, weighted by their decay rate.
- Role Relevance (10%) — Does this match your seniority and context? A first-time manager and a C-suite exec need different scenarios.
- Exploration (5%) — A small random boost to ensure variety and discovery, preventing the engine from becoming too narrow.
The 40/25/20/10/5 weighting wasn't arbitrary. It emerged from our Opportunity Solution Tree analysis: the #1 user problem (scored 0.82 on importance × satisfaction gap) was "I don't know what to practice next." Skill matching directly addresses that. The #3 problem (0.65) was "I don't know if this is the right difficulty for me." Hence the 25% difficulty fit weight.
The Skill Taxonomy: Mapping Chaos to Structure
Before the engine could recommend by skill, we needed to solve a foundational problem: ExecReps had 141 unique freeform skill labels across its workout library. "Executive communication," "exec comm," "boardroom presence," and "C-suite presentation skills" were all separate strings describing overlapping competencies.
We mapped all 141 labels to 8 validated dimensions using a hybrid approach:
- O*NET Framework (U.S. Department of Labor) — The occupational taxonomy that maps skills across 900+ occupations. We selected the communication-relevant dimensions: Speaking, Active Listening, Persuasion, Social Perceptiveness, Negotiation, and Instructing.
- World Economic Forum Future of Jobs Report — Identifies the most valued professional skills through 2030. We added Analytical Thinking and Leadership/Social Influence.
- Gemini Embedding Classification — Each freeform label was embedded as a vector and classified to its nearest validated dimension using cosine similarity.
Average similarity score: 0.834 — meaning the automated classification closely matches what a human expert would assign.
The result: 8 clean dimensions that the recommendation engine can score, track, and surface. Every workout has a primary dimension and optional secondary targets. Every user accumulates a skill profile across all 8.
Cold Start: The First 60 Seconds Matter Most
New users have no score history. No skill gaps to target. No decay curves to model. This is the "cold start problem" — and how you solve it determines whether users stick around or bounce.
We solve it in onboarding with two screens that take under 30 seconds combined:
Challenge Picker: Six cards representing common communication challenges — "Thinking on my feet," "Being concise," "Executive presence," "Handling pushback," "Data storytelling," "Persuading stakeholders." Users tap their biggest challenge.
Confidence Sliders: Five sliders across core communication dimensions, each on a 1–10 scale. Research shows self-assessed confidence has only ~0.3 correlation with actual ability — so we don't trust these numbers as ground truth. Instead, they seed the engine's initial recommendations with a boosted exploration factor (0.30 vs. the standard 0.05). The engine starts broad, then converges as real performance data arrives.
This is the contextual bandit approach adapted from Netflix and Spotify: with a new user, explore widely. With an experienced user, exploit what works. The transition happens automatically as the recommendation log fills up.
The Data Architecture Behind It All
The engine is built on three new data structures, all additive to the existing schema:
Skill State Tracker: Per-user, per-dimension record of current score, estimated proficiency (with decay), score trend (improving/stable/declining), and memory stability parameter. Updated after every workout completion.
Recommendation Log: Every recommendation shown, its position, the explanation displayed, user action (selected/skipped/ignored), completion score if acted on, and the algorithm's computed relevance score. This table is gold — it's the training dataset for Phase 2's machine learning model.
Workout-Skill Mapping: Junction table connecting each workout to its target skill dimensions with weights and difficulty tiers. Populated using the skill taxonomy classification results.
All data is team-scoped with Row Level Security. Users see only their own data. Admins see their team. No cross-team leakage.
Why We Didn't Use an LLM (Yet)
It's tempting to throw GPT-4 at personalization. "Just ask the LLM what workout to recommend." But there are three reasons we deliberately chose rule-based scoring for Phase 1:
Determinism matters for trust. When a user asks "why was this recommended?", the answer should be traceable to specific factors — not a black-box probability. "Because your Stakeholder Alignment score decayed from 72 to 58" is verifiable. "Because the AI thought it was relevant" is not.
Cost control. At scale, running an LLM inference for every library page load would add $0.02–0.05 per recommendation set. With 80 submissions per seat per month across thousands of users, that's material. Rule-based scoring is essentially free.
We need the data first. The recommendation log is building the dataset that will make Phase 2's LLM Coach genuinely intelligent — not just a chatbot with access to your scores, but an adaptive coach that understands your learning trajectory, your skill decay patterns, and your optimal challenge level.
Phase 2 will introduce an LLM Coach as a premium feature. By then, we'll have the behavioral data to make it worth paying for.
Making Intelligence Visible
The science behind a recommendation means nothing if users don't understand why it's being shown to them. Our "Recommended for You" banner displays 2–3 workouts with natural-language explanations:
"Good difficulty match for your level" — "Targets your stakeholder alignment gap" — "Builds on your strong data storytelling"
This isn't just good UX — it's backed by Cialdini's Principle of Authority and the Endowment Effect. When users understand the intelligence behind a recommendation, they value it more and are more likely to act on it. Netflix learned this early: "Because you watched X" outperforms generic "Trending" lists by 3–4x in click-through.
Users who want something different can Skip to get alternative recommendations — logging the skip as a signal that refines future scoring.
What's Live — and What's Coming Next
As of March 2026, the full personalization stack is in production:
- Data Foundation — Skill taxonomy, feature flag infrastructure, 8-dimension classification
- Recommendation Engine API — 5-factor scoring function with cold-start handling and decay model
- Library UI — "Recommended for You" banner, smart sort, skip interaction, "Review & Repeat" section
- Enhanced Onboarding — Challenge Picker + Confidence Sliders for new users
- Retrofit Survey — Interstitial modal for existing users to seed personalization
Coming soon: a Communication Profile with radar chart visualization, a Diagnostic Mini-Workout that seeds the engine with measured performance in 60 seconds, and an LLM Coach Mode trained on all this accumulated behavioral data.
The One-Line Version
We didn't bolt AI onto a generic platform. We built a learning system from first principles — where every score generates insight, every recommendation is explainable, and every practice session makes the next one smarter.
Because the question was never "can AI assess your communication skills?"
It was always: "What should you practice next?"
ExecReps' personalization engine is grounded in research from Carnegie Mellon's LearnLab (Bayesian Knowledge Tracing), Duolingo's Birdbrain system (spaced repetition), Vygotsky's Zone of Proximal Development, the O*NET occupational taxonomy, the World Economic Forum Future of Jobs Report, Nir Eyal's Hook Model, BJ Fogg's Behavior Model (Stanford), Daniel Kahneman's Peak-End Rule, and Robert Cialdini's Principles of Persuasion.