KAR³L is a flashcard system prototype developed by CLIP lab at University of Maryland. KAR³L improves upon traditional flashcard systems by adapting to both individual flashcards and users. When KAR³L recommends a flashcard to learn or review, it takes into account the content of the flashcard as well as the user’s study history, incorporating psychological concepts such as active recall, category learning, and spacing effect.

In this post, we summarize the result from the first phase of user experiements. Through an analysis our users’ study history, we provide some insights into how KAR³L works, discuss its advantages over traditional methods and where it can be improved. We’ll also give a sneak peek of what’s coming next.

## Where we are right now?

Since our public launch on August 24th, 424 users signed up, and produced in total over 75,000 study records. Each study record consists of the user ID, flashcard ID, date of study, and whether the user recalled the flashcard successfully. The growth in number of users and records is shown below.

Our goal is both to improve upon existing flashcard systems, and to answer the scientific question: can machine learning be used to improve learning efficiency via various psychological phenomena such as spacing effect. For the later purpose, we implemented two traditional spaced repetition models as our baseline: Leitner and SM-2. To form control groups, we randomly assign a portion of the users to use these traditional methods instead of our model.

## Overview of KAR³L vs. Leitner vs. SM-2

To compare the three models, let’s first get a basic understanding of what kind of flashcards are shown to the users. We categorize each study record by whether the flashcard shown is new, and the result of the evaluation (successful or failed). This categorization gives us a basic framework to understand how each model handles the trade-off of showing new flashcards versus reviewing and how the users respond to each model’s strategy. The figures belows visualize this breakdown of study records in to the 4 categories, and how it changes with time; clicking on the legend highlights each category.

We start to see some difference between KAR³L and other models. Judging from these figures, KAR³L seems to prioritize showing new flashcards over reviewing compared to Leitner and SM-2. This might explain why we see a smaller increase in successful reviews (“Old, Successful” category in the figure) from KAR³L. However, we would like more evidence to support this claim. And more importantly, is KAR³L better (or worse) at helping users make progress due to the difference in priorities? To answer this question, we need to first come up with some metrics to gauge both the progress and the effort from each user.

## Progress vs. Effort

Progress is made when a user correctly recalls a flashcard. However, not all correct recalls are equally indicative of the user’s progress: the first successful recall of a flashcard is more suggestive of progress compared to, say, the recall of a card that the user is already familiar with. We differentiate cards by their levels, i.e., how familiar each flashcard is to the user based on past evaluations. In our definition, if a user recalls a flashcard correctly $X$ times in a row, then the card is at Level $X$. We label unseen flashcards as “Initial” to differentiate them from Level 0 flashcards, which are old cards whose latest evaluation was unsuccessful. By seeing how the number of successful and failed recalls grows on each level, we can see how the user progresses as days go by. In the figure below, we further contrast the user’s progress against the effort on each day, visualized by the bars; click on the legend to highlight each level.

Perhaps a more informative view of this data is to look at the ratio of successful/total evaluations, i.e. the recall rate on each level. This is shown for the same user in the figure below.

Now, what would this figure above look like if we had the perfect model? Well, we might want the Initial recall rate to be lower, since currently more than 50% of the flashcards shown to this user are already known prior to study. For studied flashcards being reviewed, it’s not clear what the optimal recall rate should be to optimize study efficiency: some argue for 85%,1 some argue for 50%.2 Assuming the recall rate should be somewhere between 50% and 85%, we will likely want the Level 0 line to be higher, and Level 1-3 lines to be lower. Currently the recall rate is a bit too high for higher level cards, which means we are probably reviewing those cards too soon. The effort spent on excessive reviews could have been better used for less familiar flashcards.

To compare KAR³L against Leitner and SM-2 in a similar visualization, we aggregate users that are assigned to each scheduler, and make x-axis represent number of minutes the user spent on the app. In the two figures below the band visualizes the standard deviation of the corresponding line; click on the legend to highlight each level.

The second figure shows some interesting differences between KAR³L and traditional methods. First, zooming in on Initial flashcards, we see that the recall rate is higher in KAR³L is higher than the other two models, and shows lower variance both among users and over time. This is partly because KAR³L explicitly controls for the difficulty and topic of new flashcards, as opposed to randomly selecting them as done in the other two models. The recall rate of Initial cards in KAR³L might be a bit too high, but it’s something we can control; we’ll dig more in a second.

Zooming in on the Level 0 cards, again we see lower variance in recall rate from KAR³L, but the mean is similar to Leitner and SM-2. However, if we look at Level 1 cards, the recall rate from KAR³L users is noticeably higher than the other two models, although there is a slight dip towards the end.

Based on analysis above, the recall rate in KAR³L needs some finegrained adjustments. Luckily, KAR³L is designed with this kind of flexibility in mind. One of the tunable parameters of KAR³L is recall target, which specifies the desired recall rate for a card to be reviewed. For example, if the recall target is set to 80%, the model prioritizes flashcards whose probability of recall by the user—according to the model’s prediction—is closest to 80%. So recall target is one of the most important factors that together controls what flahscards are shown to the user.3 The default recall target was set to 100% (a bad idea in hindsight), which partially explains why the recall rate is so high. We have also received user feedback that KAR³L is reviewing too much, which corroborates some of our findings here.

In an attempt to adjust the recall rate, We created a new experimental condition with recall target set to a lower 85%, and assigned some new users to this condition. The figure below compares the recall rate of cards on each leve between the two versions of KAR³L. This change is quite recent so there are fewer users in this group (thus higher variance), and the users haven’t spent as much time on the app yet (thus shorter lines).

This comparison sheds some insight into how the recall target parameter affects the model behavior. As we lower the recall target from $100%$ to 85%, we see that the Initial and Level 0 recall rates become lower, as expected. However, there is some noticeable mismatch between the recall target and the actual recall rate, and for Initial and Level 0 flashcards, the actual recall rate is lower. Weirdly, the recall rate at Level 1 and Level 2 did not drop as significantly, but we can’t draw much of a conclusion due to limited data. We hypothesize that the inconsistency of model behavior with respect to recall target is caused by two issues: how our model adapts to each user, and the overconfidence of the neural network.4 Although this issue might require some smart technical solution, we see this as a positive signal: KAR³L has the potential to be much more flexible than traditional methods, and now we have data to fine-tune its paramaters. The difference in learning curves also highlights the room for further optimizing learning efficiency via machine learning. This brings us to our next steps.

## Next Steps

We identify three main tasks: in depth analysis of KAR³L’s behavior, learn from cognitive science and educational theory literature to improve & expand evaluation, and improve the feedback loop between the users and the models. We’ll briefly explain what we want to achieve in each of the tasks, and highlight what’s coming soon next.

### In depth analysis of KAR³L’s behavior

We want to understand the inconsistency of KAR³L’s behavior with respect to different recall targets, and come up with a remedy. Another important task is to find a good recall target that balances efficiency & fun. We plan to look into educational theory for inspiration. We’ll test the model in simulation using data we have collected so far, and come up with new experimental groups for a better user experience.

### Better evaluation

Our current evaluation, especially the definition of level, is closely related to Anki’s notion of “learned” & “mature”, as well as the boxes in Leitner system. Our progress graph is good for drawing insights, but not rigorous for a comparison between models—it’s not standardized. We would want to test the users with either the same set of flashcards, or flashcards of the same objective difficulty.

### Feedback loop between user and model

We want to provide more feedback to the users, sharing insights on their progress, similar to what we did in this report. In the update that is released with this report, the progress graphs are added to the Statistics page for each user.

We also want to allow the users to provide feedback more directly. In the next phase of experiment, we plan to give the users more control over KAR³L’s settings, such as the recall target and topical preference.

## Get involved

We are happy to discuss this project. Join our Discord, or reach out to our team directly, on Twitter: @ihsgnef, @matthewmshu, or email feet-thinking@googlegroups.com.

1. In Duolingo’s Half-life Regression by Settles & Meeder, the model tries to predict “student is on the verge of being unable to remember”, that is, when the recall rate is around 50%.

2. We don’t go into the details of how our model works in this report. But we hope to release a document specifically for that purpose soon. Stay tuned!

3. Neural models are known to be over-confident in their predictions without calibration. See On Calibration of Modern Neural Networks for a reference.