KA Engineering

KA Engineering

We're the engineers behind Khan Academy. We're building a free, world-class education for anyone, anywhere.

Subscribe

Latest posts

Creating Query Components with Apollo

Brian Genisio on June 12

Migrating to a Mobile Monorepo for React Native

Jared Forsyth on May 29

Memcached-Backed Content Infrastructure

Ben Kraft on May 15

Profiling App Engine Memcached

Ben Kraft on May 1

App Engine Flex Language Shootout

Amos Latteier on April 17

What's New in OSS at Khan Academy

Brian Genisio on April 3

Automating App Store Screenshots

Bryan Clark on March 27

It's Okay to Break Things: Reflections on Khan Academy's Healthy Hackathon

Kimerie Green on March 6

Interning at Khan Academy: from student to intern

Shadaj Laddad on Dec 12, 2016

Prototyping with Framer

Nick Breen on Oct 3, 2016

Evolving our content infrastructure

William Chargin on Sep 19, 2016

Building a Really, Really Small Android App

Charlie Marsh on Aug 22, 2016

A Case for Time Tracking: Data Driven Time-Management

Oliver Northwood on Aug 8, 2016

Time Management at Khan Academy

Several Authors on Jul 25, 2016

Hackathons Can Be Healthy

Tom Yedwab on Jul 11, 2016

Ensuring transaction-safety in Google App Engine

Craig Silverstein on Jun 27, 2016

The User Write Lock: an Alternative to Transactions for Google App Engine

Craig Silverstein on Jun 20, 2016

Khan Academy's Engineering Principles

Ben Kamens on Jun 6, 2016

Minimizing the length of regular expressions, in practice

Craig Silverstein on May 23, 2016

Introducing SwiftTweaks

Bryan Clark on May 9, 2016

The Autonomous Dumbledore

Evy Kassirer on Apr 25, 2016

Engineering career development at Khan Academy

Ben Eater on Apr 11, 2016

Inline CSS at Khan Academy: Aphrodite

Jamie Wong on Mar 29, 2016

Starting Android at Khan Academy

Ben Komalo on Feb 29, 2016

Automating Highly Similar Translations

Kevin Barabash on Feb 15, 2016

The weekly snippet-server: open-sourced

Craig Silverstein on Feb 1, 2016

Stories from our latest intern class

2015 Interns on Dec 21, 2015

Kanbanning the LearnStorm Dev Process

Kevin Dangoor on Dec 7, 2015

Forgo JS packaging? Not so fast

Craig Silverstein on Nov 23, 2015

Switching to Slack

Benjamin Pollack on Nov 9, 2015

Receiving feedback as an intern at Khan Academy

David Wang on Oct 26, 2015

Schrödinger's deploys no more: how we update translations

Chelsea Voss on Oct 12, 2015

i18nize-templates: Internationalization After the Fact

Craig Silverstein on Sep 28, 2015

Making thumbnails fast

William Chargin on Sep 14, 2015

Copy-pasting more than just text

Sam Lau on Aug 31, 2015

No cheating allowed!!

Phillip Lemons on Aug 17, 2015

Fun with slope fields, css and react

Marcos Ojeda on Aug 5, 2015

Khan Academy: a new employee's primer

Riley Shaw on Jul 20, 2015

How wooden puzzles can destroy dev teams

John Sullivan on Jul 6, 2015

Babel in Khan Academy's i18n Toolchain

Kevin Barabash on Jun 22, 2015

tota11y - an accessibility visualization toolkit

Jordan Scales on Jun 8, 2015

Meta

Memcached-Backed Content Infrastructure

by Ben Kraft on May 15

Last post, I wrote about how we did profiling on App Engine's Memcached service to plan our new content-serving infrastructure that will allow us to scale our new content tools to many more languages. But that was all talk, and at Khan Academy we're all about having a bias towards action. So let's talk about how we tested the new backend, rolled it out to users, and the issues we ran into along the way!

Recap: Content serving, the old way

Tom and William have written about our content infrastructure in the past, but as a quick recap, we have a custom content versioning system that allows content creators to edit the content and then later publish it to the site. The end goal of that publish process is to build a new version of our content data, which includes a specific version of each item of content (such as a video, exercise, or topic) at Khan Academy. In most cases, these are stored as a pickled python object with the metadata for a particular piece of content – the actual video data is served by YouTube. The goal of our content serving system is to get all that data out to any frontend servers that need it to display data about that content to users, and then atomically update all of that data when a new version is published. The frontend servers can then use this content data to find the URL of a video, the prerequisites of an exercise, or all the articles within a topic.

In principle, this is a pretty simple problem – all we need is a key-value store, mapping (global content version, content ID) pairs to pickled content item objects. Then we tell all the frontend servers which content version to look for (an interesting problem in and of itself, but outside the scope of this post), and they can fetch whatever data they need. But that doesn't make for very good performance, even if our key-value store is pretty fast. Even after we cache certain global data, many times we need to look at hundreds of items to render a single request. For instance, to display a subject page we might have to load various subtopics, individual videos and articles in the subject, exercises reinforcing those concepts, as well as their prerequisites in case students are struggling.

In our old system, we solved this problem by simply bundling up all the content data for the entire site, compressing it, loading the entire blob on each server at startup, and then loading a diff from the old version any time the content version changed. This gave us very fast access to any item at any time. But it took a lot of memory – and even more so as we started to version content separately for separate languages, and thus might have to keep multiple such blobs around, limiting how many languages we could version in this way. (Between the constraints of App Engine, and the total size of data we wanted to eventually scale to, simply adding more memory wasn't a feasible solution.) So it was time for a new system.

Content serving, the new way

After discussing a few options, our new plan was fairly simple: store each content item separately in memcache, and fetch it in on-demand, caching some set of frequently used items in memory. We simulated several caching strategies and determined that choosing a fixed set of items to cache would be a good approach, with some optimizations.

We went ahead and implemented the core of the system, including the in-memory cache, and the logging to choose which items to cache. In order to avoid spurious reloads, we stored each item by a hash of its data; then each server would, after each content update, load the mapping of item IDs to hashes, in order to find the correct version of each item. We also implemented a switch to allow us to turn the new backend on and off easily, for later testing.

Testing & Iterating

But things were still pretty darn slow. It turned out we had a lot of code that made the assumption that content access was instant. Everywhere we did something of the form

for key in content_keys:
    item = get_by_key(key)
    # do something with item

was a potential performance issue – one memcache fetch is pretty fast, but doing dozens or hundreds sequentially adds up. Rather than trying to find all such pieces of code, we set up a benchmark: we'd deploy both the old and the new code to separate clusters of test servers, and run some scripts to hit a bunch of common URLs on each version. By comparing the numbers, we were able to see which routes were slower, and even run a profiler on both versions and compare the output to find the slow code.

In many cases, we could just replace the above snippet with something like the following:

content_items = get_many_by_key(content_keys)
for item in content_items:
    # do something with item

This way, while we'd have to wait for a single memcache fetch that we didn't before, at least it would only be a single one. In some cases, refactoring the code would have been difficult, but we could at least pre-fetch the items we'd need, and keep them in memory until the end of the request, similarly avoiding a sequential fetch.

This was a pretty effective method – on our first round of testing, half of the hundred URLs we hit were at least 50% slower at the 50th percentile, and khanacademy.org/math was over 7 times slower! But after several rounds of fixes and new tests, we were able to get the new version faster on average and on most specific URLs. Unfortunately, this didn't cover everything – some real user traffic is harder to simulate, and because caching is less effective on test versions with little traffic, there was a lot of noise in our small sample, and so small changes in performance were hard to spot, even if they affected very very popular routes where that small difference would add up.

So we moved on to live testing! Using the switch we had built, we set things up to allow us to more or less instantly change what percentage of our servers were using the new version. Then we rolled out to around 1% of requests. The first test immediately showed a blind spot in our testing – we hadn't done sufficient testing of internationalized pages, and there was a showstopper bug in that code, so we had to turn off the test right away. But with that bug fixed, we ran the next test for several hours on a Friday afternoon. We still saw a small increase in latency overall; the biggest contributor was a few routes that make up a large fraction of our traffic that got just a little bit slower. After more improvements, we moved on to longer tests on around 5%, then 25%. On the last test, we had enough traffic to extrapolate how the new backend would affect load on memcache, and scale it up accordingly before shipping to all users.

Rollout & Future work

Finally, we were able to roll out to all users! Of course, there were still a few small issues after launch, but for the most part things worked pretty well, and our memory usage went down noticeably, as we had hoped. We still haven't rolled out to more languages yet – but when the team working on the tooling is ready to do that, we'll be able to.

We do still have some outstanding problems with the new system, which we're working to fix. First, we've opened ourselves up to more scaling issues with memcache – if we overload the memcache server, causing it to fail to respond to requests, this sometimes causes the frontend servers to query it even more. Now that we depend on memcache more, these issues are potentially more common and more serious. And the manifest that lists the hashes of each item in the current content version is still a big pile of data we have to keep around; it's not as big as the old content bundle, but we'd love to make it even smaller, allowing us to scale even more smoothly. If you're interested in helping us solve those problems and many more, well, we're hiring!