The ELO rating system has been measuring skill in chess for over sixty years. The same mathematical framework turns out to be remarkably effective at measuring language proficiency — and it solves problems that XP, streaks, and lesson completion never could.
In the late 1950s, Arpad Elo — a Hungarian-American physics professor and chess master — was asked to improve the rating system used by the United States Chess Federation. The existing system was crude: it gave fixed points for wins and losses regardless of opponent strength. Beat a beginner and beat a grandmaster? Same reward.
Elo's insight was to make rating changes proportional to how surprising the outcome was. If a 1200-rated player beats a 1800-rated player, that's a big upset — the winner gains a lot of points, and the loser drops a lot. If the 1800-rated player wins, that was expected — both ratings barely move. The system was adopted by the USCF in 1960 and by FIDE (the international chess federation) in 1970.
The genius of the system is that it converges on your true ability over time. No matter where you start, after enough games against opponents of varying strength, your rating settles at a number that accurately reflects how strong you are. Today, ELO-style systems are used far beyond chess: competitive gaming, sports leagues, standardized testing, and — increasingly — language learning.
The core idea is elegant: compare your expected performance to your actual performance, and adjust your rating accordingly.
Based on your current rating and the difficulty of the challenge, the system calculates a probability of success. A 1400-rated learner facing a 1200-difficulty task is expected to do well. The same learner facing a 1600-difficulty task is expected to struggle.
After you complete the task, the system compares what actually happened to what was expected. Did you succeed at something that should have been hard for you? That's informative — you might be better than your rating suggests. Did you struggle with something easy? Also informative.
Your rating shifts in proportion to the surprise. Big surprise = big change. No surprise = small change. Over many interactions, the adjustments get smaller as the system becomes more confident in your rating. This is controlled by the K-factor — a parameter that determines how volatile the ratings are. New learners have a high K-factor (ratings move quickly to find the right level), while established learners have a lower K-factor (ratings are more stable).
Most language apps measure engagement — how often you show up and how many exercises you complete. ELO measures something fundamentally different: demonstrated ability relative to task difficulty.
The fundamental difference: XP tells you how much you've practiced. ELO tells you how good you've gotten.
The idea of applying ELO to language learning isn't speculative — it's been validated in peer-reviewed research.
Hou et al. (2019)
In their paper "Modeling language learning using specialized Elo ratings," Hou and colleagues at the University of Pittsburgh applied a modified ELO system to track the proficiency of language learners across many interaction types. They found a 0.90 correlation between ELO-predicted proficiency levels and teacher-assigned CEFR levels.
A 0.90 correlation is remarkably high in educational measurement. For context, the correlation between two human raters assessing the same student typically falls between 0.70 and 0.85. An automated system matching or exceeding human inter-rater reliability is a strong signal that ELO captures something real about language ability.
The key insight from the research is that ELO works for language because language tasks, like chess matches, have variable difficulty, and learner performance varies predictably based on the gap between their ability and the task difficulty. The mathematical framework maps cleanly from one domain to the other.
The ELO framework provides something rare in language education: a measurement that is both continuous (not just six discrete CEFR buckets) and difficulty-adjusted (not just counting correct answers).
Dialog Engine applies the ELO framework to conversational language practice, using it both to measure your proficiency and to select appropriate challenges.
In chess, your opponent has a rating. In Dialog Engine, each conversation scenario has a difficulty rating. An A1-level scene about ordering coffee might be rated at 800. A B2-level scene about negotiating an apartment lease might be rated at 1250. Your performance on each scene — evaluated across comprehensibility, grammatical accuracy, and naturalness — determines whether you "won" or "lost" the matchup, and by how much.
New learners start with a higher K-factor, which means their rating moves quickly. This lets the system find your level fast — within a handful of conversations, your rating approximates your actual ability. As you complete more conversations, the K-factor decreases, making your rating increasingly stable and resistant to random fluctuation. Your rating still moves, but it takes a consistent pattern of performance to shift it significantly.
Your ELO rating drives which scenarios the system offers you. The goal is to keep you in the zone of proximal development — challenged enough that you're learning, but not so far beyond your level that you're lost. As your rating rises, you automatically face more complex scenarios with more advanced vocabulary and grammar expectations.
Your ELO rating maps directly to CEFR levels, giving you a universally understood measure of where you stand. The mapping uses half-levels (A1+, A2+, etc.) for finer-grained tracking within each band.
A1
800
A1+
900
A2
1000
A2+
1100
B1
1200
B1+
1300
B2
1400
B2+
1500
C1
1600
C1+
1700
C2
1800+
New learners start at 800 (A1) and their rating adjusts rapidly from there. A learner who already speaks at an intermediate level might see their rating jump to the 1200–1400 range within their first few conversations as the high initial K-factor quickly corrects the starting position.
A placement test gives you a snapshot — a one-time assessment that might be affected by test anxiety, fatigue, or lucky guesses. An ELO rating is a running measure that updates after every conversation. It's self-correcting: if one bad session drops your rating below your true level, subsequent normal performance will pull it back up. Over time, it becomes an increasingly precise reflection of your demonstrated conversational ability.
There's a well-established principle in performance science: you improve what you measure, as long as the measurement is valid. XP and streaks measure the wrong thing — time and consistency, which are necessary but not sufficient for improvement. You can practice daily for years and plateau if your practice isn't challenging enough.
ELO provides the feedback loop that makes deliberate practice possible. When your rating is stable, you know you need to push harder. When it's climbing, you know your practice is working. When it dips, you know something needs attention. That information — am I actually getting better? — is the most important question a learner can answer, and it's the one that most language apps leave unanswered.
The ELO system isn't just a number. It's a mirror that shows you, honestly and continuously, where you stand. And for a self-directed learner without a teacher to provide that honest assessment, it might be the most valuable tool available.
Get notified when we launch.