KA Engineering

KA Engineering

We're the engineers behind Khan Academy. We're building a free, world-class education for anyone, anywhere.

Subscribe

Latest posts

Creating Query Components with Apollo

Brian Genisio on June 12

Migrating to a Mobile Monorepo for React Native

Jared Forsyth on May 29

Memcached-Backed Content Infrastructure

Ben Kraft on May 15

Profiling App Engine Memcached

Ben Kraft on May 1

App Engine Flex Language Shootout

Amos Latteier on April 17

What's New in OSS at Khan Academy

Brian Genisio on April 3

Automating App Store Screenshots

Bryan Clark on March 27

It's Okay to Break Things: Reflections on Khan Academy's Healthy Hackathon

Kimerie Green on March 6

Interning at Khan Academy: from student to intern

Shadaj Laddad on Dec 12, 2016

Prototyping with Framer

Nick Breen on Oct 3, 2016

Evolving our content infrastructure

William Chargin on Sep 19, 2016

Building a Really, Really Small Android App

Charlie Marsh on Aug 22, 2016

A Case for Time Tracking: Data Driven Time-Management

Oliver Northwood on Aug 8, 2016

Time Management at Khan Academy

Several Authors on Jul 25, 2016

Hackathons Can Be Healthy

Tom Yedwab on Jul 11, 2016

Ensuring transaction-safety in Google App Engine

Craig Silverstein on Jun 27, 2016

The User Write Lock: an Alternative to Transactions for Google App Engine

Craig Silverstein on Jun 20, 2016

Khan Academy's Engineering Principles

Ben Kamens on Jun 6, 2016

Minimizing the length of regular expressions, in practice

Craig Silverstein on May 23, 2016

Introducing SwiftTweaks

Bryan Clark on May 9, 2016

The Autonomous Dumbledore

Evy Kassirer on Apr 25, 2016

Engineering career development at Khan Academy

Ben Eater on Apr 11, 2016

Inline CSS at Khan Academy: Aphrodite

Jamie Wong on Mar 29, 2016

Starting Android at Khan Academy

Ben Komalo on Feb 29, 2016

Automating Highly Similar Translations

Kevin Barabash on Feb 15, 2016

The weekly snippet-server: open-sourced

Craig Silverstein on Feb 1, 2016

Stories from our latest intern class

2015 Interns on Dec 21, 2015

Kanbanning the LearnStorm Dev Process

Kevin Dangoor on Dec 7, 2015

Forgo JS packaging? Not so fast

Craig Silverstein on Nov 23, 2015

Switching to Slack

Benjamin Pollack on Nov 9, 2015

Receiving feedback as an intern at Khan Academy

David Wang on Oct 26, 2015

Schrödinger's deploys no more: how we update translations

Chelsea Voss on Oct 12, 2015

i18nize-templates: Internationalization After the Fact

Craig Silverstein on Sep 28, 2015

Making thumbnails fast

William Chargin on Sep 14, 2015

Copy-pasting more than just text

Sam Lau on Aug 31, 2015

No cheating allowed!!

Phillip Lemons on Aug 17, 2015

Fun with slope fields, css and react

Marcos Ojeda on Aug 5, 2015

Khan Academy: a new employee's primer

Riley Shaw on Jul 20, 2015

How wooden puzzles can destroy dev teams

John Sullivan on Jul 6, 2015

Babel in Khan Academy's i18n Toolchain

Kevin Barabash on Jun 22, 2015

tota11y - an accessibility visualization toolkit

Jordan Scales on Jun 8, 2015

Meta

Profiling App Engine Memcached

by Ben Kraft on May 1

Last year, William wrote about how we optimized our in-memory content data to take up less space. But this was always a temporary solution: if we want to have a separate content tree for each language, we knew we would need to break up the monolithic content blob, and pull in content on-demand.

In principle, doing this is simple -- we already use App Engine's Memcached, so all we need to do is change our data structures to pull in content from there whenever someone asks for it. But that might be some unknown amount slower -- many requests access dozens or even hundreds of content items, which right now is plenty fast! We had several ideas for how to improve access patterns and speed things up, but we didn't want to spend a couple months building one only to find it wasn't fast enough. We had some vague guesses as to how often we access content, what requests would be the worst, and how fast Memcached is, but we didn't really know. So last fall we set out to find out.

Measuring Memcached latency

First we just wanted to answer a simple question: how fast is a Memcached access on App Engine? Despite our heavy use of Memcached, we didn't really have a good estimate! So we wrote a simple API endpoint to profile a Memcached get:

@app.route('...')
@developer_required
def profile_memcached():
    num_bytes = int(request.args.get(num_bytes))
    data = os.urandom(num_bytes)
    key = 'profile_memcache_%s' % base64.b64encode(os.urandom(16))
    success = memcache.set(key, data)
    if not success:
        raise RuntimeError('Memcached set failed!')

    get_start = time.time()
    data_again = memcache.get(key)
    get_end = time.time()

    memcache.delete(key)
    return flask.jsonify(
        time=get_end - get_start,
        correct=data == data_again,
        key=key)

A thread pool and ten thousand API calls later, we had some data!

Graph of Memcached latency data
Memcached latency distribution over 10,000 requests. On the x-axis, the inverted percentile (so "1" is the 99th percentile), on a log-scale. On the y-axis, latency in seconds, on a log-scale. Each line is a different size of value in Memcached, from 1 to 1 million bytes. So for a 1 million byte value (the uppermost line), the median latency is about 11ms, the 99th percentile is 21ms, and the 99.9th percentile is 43ms.

We got a few things out of this data right away. First of all, latency is in the 1-4ms range for most requests, and nearly always below 10ms. Second, for all except very large values, the size of the value doesn't make a big difference in latency; even for very large values the difference is less than the factor of difference in value sizes.

But adding as many as a hundred serial 1-4ms fetches to each request wouldn't be an acceptable latency cost. We did another series of tests and found that the response time for a single multi-get was similar to that of a single get. So in principle, we could rewrite all our code fetch entities in bulk, but that would be difficult and time-consuming to do across our entire codebase, and we didn't know how effective it would be.

Simulating Memcached latency

Luckily, we had another pile of data at our disposal. For a small percentage of our requests, we set up logging to record which content items were accessed in the request. We initial used this to get simple aggregates: data on how many items were accessed in a particular request, for example. But along with the Memcached profiling data, we could do more: we could simulate how much slower a request would be if it had to make a particular number of Memcached hits.

We started by assuming that every content item accessed was a Memcached fetch (the first time in the request). This was, as expected, too slow: over 10% of requests got at least 50% slower. We could hand-optimize a few problematic routes, but not all of them, so we needed a more general strategy. One natural option was to cache some data in-memory on each instance.

Since we had this data, we could simulate particular strategies for caching some content items in memory. We simulated both an LRU cache and choosing in advance a fixed set of items to cache, with several sizes of each. The latter strategy did significantly better than the LRU cache. This remained true even if we used the chronologically first half of the data to decide which items to cache, and measured against the second half, which we took as validation that we could actually choose a fairly accurate set in practice.

Graph of simulated fixed-cache data
Simulated increased request latency distribution over all instrumented requests. On the x-axis, the inverted percentile, on a log-scale, as in the above graph. On the y-axis, fraction of increased latency, on a log-scale. Each line is a different caching strategy, with or without the multi-get of children. For example, at the 90th percentile, without a cache we see a latency increase around 50%; a fixed cache reduces that to 10%, vs. 12% for an LRU cache.

However, our results still weren't quite as good as we wanted; even excluding a few pages we figured could be manually optimized, we were still adding 25% or more to 5% of requests. But we still hadn't made use of the multi-gets we profiled. First we simulated immediately pulling all children of a topic into cache with a single multi-get for the duration of the request whenever we requested the topic, which we knew would be fairly easy to do. This wasn't enough, but after several variations on it, we found a strategy that did: if, whenever we loaded a topic page, we pulled in all descendants of that topic (again with a single multi-get), down to some depth, that helped a lot. While it would require code changes, we thought they should be simple enough to be manageable, with nowhere near as many unknowns as "now fix up all problematic code paths".

Doing it for real

Since then, we've implemented the system for real, which I'll talk about in our next blog post! Not everything worked out as planned, of course. But to build this system out in reality took months, and testing out different caching strategies in production would have days or weeks of turnaround time, whereas in our simulations we could test out a new strategy in just a few minutes. It required some guessing about our codebase -- knowing that "on topic pages, pull in every descendant of the topic at the start" was something we could implement, while, say, "pull in every item we need to render the page at once" definitely wasn't. Of course, a lot of things we didn't know, but we felt much more confident moving forward with actually implementing a months-long engineering effort with some assurance that the plan has some chance of working. And now we know, a lot more precisely, just what performance we can expect from App Engine Memcached.

Appendix: App Engine Memcached latency data

If you also use App Engine's dedicated Memcached service, you may find the below data useful. Note that each sample was taken over the course of an hour or so, so it doesn't account for rare events like memcache outages or more generally capture latency over time. That's also a potential source of difference between the two tables; they were measured at different times.

Fetch latency by size

%ile1B10B100B1KB10KB100KB1MB
51.381.351.361.411.472.338.29
251.721.701.701.841.922.949.88
502.342.292.282.712.693.7711.16
753.713.673.724.014.095.0812.65
904.664.604.664.834.926.0014.22
955.335.275.455.505.766.7415.46
9910.569.9311.359.4711.8312.1121.09
99.934.5924.5738.4426.9730.5741.1542.84
Memcache latency for a single get of a string value of the given size, over 10,000 samples for each size, in milliseconds.

Multi-get latency

%ile1 key3 keys5 keys10 keys100 keys
503.063.874.003.708.29
754.304.664.775.029.94
905.175.465.616.3011.88
956.016.606.777.6414.12
9917.2217.6425.2323.0132.20
99.992.18234.90225.2270.10108.81
Memcache latency for a multi-get get of several 1KB string values, over 10,000 samples for each size, in milliseconds.