KA Engineering

KA Engineering

We're the engineers behind Khan Academy. We're building a free, world-class education for anyone, anywhere.

Subscribe

Latest posts

Kotlin on the server at Khan Academy

Colin Fuller on June 28

The Original Serverless Architecture is Still Here

Kevin Dangoor on May 31

What do software architects at Khan Academy do?

Kevin Dangoor on May 14

New data pipeline management platform at Khan Academy

Ragini Gupta on April 30

Untangling our Python Code

Carter J. Bastian on April 16

Slicker: A Tool for Moving Things in Python

Ben Kraft on April 2

The Great Python Refactor of 2017 And Also 2018

Craig Silverstein on March 19

Working Remotely

Scott Grant on Oct 2, 2017

Tips for giving your first code reviews

Hannah Blumberg on Sep 18, 2017

Let's Reduce! A Gentle Introduction to Javascript's Reduce Method

Josh Comeau on Jul 10, 2017

Creating Query Components with Apollo

Brian Genisio on Jun 12, 2017

Migrating to a Mobile Monorepo for React Native

Jared Forsyth on May 29, 2017

Memcached-Backed Content Infrastructure

Ben Kraft on May 15, 2017

Profiling App Engine Memcached

Ben Kraft on May 1, 2017

App Engine Flex Language Shootout

Amos Latteier on Apr 17, 2017

What's New in OSS at Khan Academy

Brian Genisio on Apr 3, 2017

Automating App Store Screenshots

Bryan Clark on Mar 27, 2017

It's Okay to Break Things: Reflections on Khan Academy's Healthy Hackathon

Kimerie Green on Mar 6, 2017

Interning at Khan Academy: from student to intern

Shadaj Laddad on Dec 12, 2016

Prototyping with Framer

Nick Breen on Oct 3, 2016

Evolving our content infrastructure

William Chargin on Sep 19, 2016

Building a Really, Really Small Android App

Charlie Marsh on Aug 22, 2016

A Case for Time Tracking: Data Driven Time-Management

Oliver Northwood on Aug 8, 2016

Time Management at Khan Academy

Several Authors on Jul 25, 2016

Hackathons Can Be Healthy

Tom Yedwab on Jul 11, 2016

Ensuring transaction-safety in Google App Engine

Craig Silverstein on Jun 27, 2016

The User Write Lock: an Alternative to Transactions for Google App Engine

Craig Silverstein on Jun 20, 2016

Khan Academy's Engineering Principles

Ben Kamens on Jun 6, 2016

Minimizing the length of regular expressions, in practice

Craig Silverstein on May 23, 2016

Introducing SwiftTweaks

Bryan Clark on May 9, 2016

The Autonomous Dumbledore

Evy Kassirer on Apr 25, 2016

Engineering career development at Khan Academy

Ben Eater on Apr 11, 2016

Inline CSS at Khan Academy: Aphrodite

Jamie Wong on Mar 29, 2016

Starting Android at Khan Academy

Ben Komalo on Feb 29, 2016

Automating Highly Similar Translations

Kevin Barabash on Feb 15, 2016

The weekly snippet-server: open-sourced

Craig Silverstein on Feb 1, 2016

Stories from our latest intern class

2015 Interns on Dec 21, 2015

Kanbanning the LearnStorm Dev Process

Kevin Dangoor on Dec 7, 2015

Forgo JS packaging? Not so fast

Craig Silverstein on Nov 23, 2015

Switching to Slack

Benjamin Pollack on Nov 9, 2015

Receiving feedback as an intern at Khan Academy

David Wang on Oct 26, 2015

Schrödinger's deploys no more: how we update translations

Chelsea Voss on Oct 12, 2015

i18nize-templates: Internationalization After the Fact

Craig Silverstein on Sep 28, 2015

Making thumbnails fast

William Chargin on Sep 14, 2015

Copy-pasting more than just text

Sam Lau on Aug 31, 2015

No cheating allowed!!

Phillip Lemons on Aug 17, 2015

Fun with slope fields, css and react

Marcos Ojeda on Aug 5, 2015

Khan Academy: a new employee's primer

Riley Shaw on Jul 20, 2015

How wooden puzzles can destroy dev teams

John Sullivan on Jul 6, 2015

Babel in Khan Academy's i18n Toolchain

Kevin Barabash on Jun 22, 2015

tota11y - an accessibility visualization toolkit

Jordan Scales on Jun 8, 2015

Meta

Kotlin on the server at Khan Academy

by Colin Fuller on June 28

At Khan Academy, we run our web application using Python 2.7 on Google’s App Engine Standard. We love Python for a lot of reasons — most notably the language’s readability and concision. On the other hand, we have some engineering problems we’re having trouble solving with Python 2.7: parts of our site are very slow, and our hosting costs are skyrocketing due both to increased traffic and the new features we’re adding to make the best possible learning experience.

As part of our 2018 healthy hackathon, I decided to test what it would look like to serve some user-facing requests using a Kotlin-based application running on App Engine Flex. (Kotlin is a modern, statically typed, compiled programming language that runs on the Java virtual machine.) After seeing some very positive efficiency results, and having had a positive developer experience during the hackathon, we decided to go ahead and adopt Kotlin as a second language for server-side development at Khan Academy.

The rest of this blog post will take you through what went into the hackathon experiment, why we chose Kotlin, the results of the experiment, the challenges we faced, and what’s the future of Kotlin at Khan Academy.

Experimenting with Kotlin during the hackathon: a service for analytics events

If you’ve ever watched your browser’s devtools while you’re clicking around khanacademy.org, you might see a bunch of requests to /api/internal/.../mark_conversions. This is our API for recording analytics events from web clients. We collect data such as how many times a person has attempted to solve a particular exercise and how that person navigates around the site. We use these data, for example, to ensure that a redesign of the subjects menu hasn’t accidentally made it impossible to find things on the site, or to verify that changes to our exercises based on the latest external pedagogy research actually helped students learn better.

The API's implementation is pretty simple: it parses out information from the JSON the client sends us, adds extra information from the HTTP headers (for instance, whether it was a phone or a desktop computer), and then sends the information to Google’s BigQuery, our analytics warehouse.

Despite being so simple, this API endpoint had recently climbed to account for more than 10% of our total server costs. We hypothesized this was due to a relatively high per-request overhead. (This API serves a very large number of very small requests.) We wondered how much we could improve our costs if we wrote a separate service optimized for serving just this one API endpoint.

Enter Kotlin, which we’d used for a previous experiment with microservices at Khan Academy. (We've also used it for some of our internal tools.) We’d found it to be both very fast and fairly pleasant to work with. We decided to proceed with writing a Kotlin service to serve this one API, direct traffic to the new API, and compare the number of servers we needed before and after the transition.

Interlude: why Kotlin?

To be frank, our Python code has a lot of room for optimization within Python, so why start over in a new language? Adding an additional language means there’s a lot of new developer training, mental overhead due to switching between languages, and infrastructure code that needs to be written twice.

To be clear, our goal is not to rewrite the entire Khan Academy site in Kotlin (or any other language). (Code rewrites can often be disastrous.) Instead, we want to be able to take a few critical (or costly) parts of our codebase and optimize them to the extreme. But there was a sense, especially on the infrastructure team, that we were sometimes being limited by our tools.

When picking a new language to complement Python, our most important requirement was that we could use existing Google cloud libraries, since we run almost entirely on Google’s cloud infrastructure. This limited us to the languages that Google cloud officially supports — Python, Java, C#, Go, Node.js, and a couple others — or a language that can easily use libraries for these platforms.

We landed on Kotlin because it has excellent interoperability with Java, and it’s very different from Python in several ways that let us better optimize a piece of code according to its requirements:

  • Insofar as it’s possible to make general statements about a language’s performance, Kotlin is fast, while Python is… not.

  • Kotlin has a static, yet expressive type system; Python is dynamically typed. We don't necessarily think static typing is always a win: in some situations static typing with a well-designed type system can help developers write code that has fewer errors and is easier to refactor, while in other situations it can decrease developer productivity by making code that would have worked fine anyway unnecessarily hard to read. But now we can choose static or dynamic typing based on a the requirements on a given piece of code.

  • Kotlin code conventionally uses a functional/immutable style, whereas Python conventionally uses a more imperative style. Each style has a place, but having both Python and Kotlin gives us a choice here too.

  • Kotlin (on the JVM) supports true parallelism within a single process, whereas Python (in the CPython implementation available to us) does not, due to the global interpreter lock. This gives us more flexibility to do asynchronous background processing, or to serve more requests simultaneously.

The implementation

To test, we set up a new service running on App Engine Flex, based upon a minimal Ubuntu image with OpenJDK 8, into which we compiled a small web application based on the Spark web framework. This application reimplemented the single API endpoint we wanted to port, along with a bunch of supporting middleware for things like Khan-compatible authentication and request annotations for analytics.

The result was actually a fair bit of code, so we didn’t want to switch everyone over at once. In order to allow this gradual transition, we passed down a feature flag to our client-side code that chose on a per-person basis whether to use the old Python version of the API or the new Kotlin version. By controlling the percentage of people for whom the feature flag was set, we could gradually roll out to more people.

We initially rolled out to Khan Academy staff only, and when that looked ok, we rolled out to 5% of people for a few days, and then gradually upped that to everyone over the course of another day. All in all this went relatively smoothly, though there were a few hiccups that we’ll discuss further in a moment.

Results

Because our old Python API was running on a set of multithreaded instances (virtual servers) that were concurrently serving a bunch of other APIs, it was not straightforward to directly compare the cost and performance of the old version to the new one. Instead, we looked at our peak daytime instance counts before and after the rollout to estimate how many instances were required to serve all the API traffic on the two versions. After the rollout, the module serving our Python API requests peaked at requiring around 400 instances fewer than it needed before the rollout, and the Kotlin module (serving only this one API) stayed below 50 instances, meaning that one of our Kotlin instances is able to serve roughly 10x the requests that the Python version could serve.

In addition, because the particular App Engine Flex instances we’re using are about 4x cheaper than the (App Engine Standard) Python instances we used before, this represents even more than a 10x cost savings on this route for us.

Server hours are down more than 10x

Challenges

During the rollout, we did encounter some unexpected challenges. The most interesting was maintaining backwards compatibility of the data we were recording for analytics. While we knew this would be an issue and wrote unit tests for our code to make sure this was as close as possible, there were some subtle differences our tests didn’t catch that caused issues.

An example of this was small changes in user agent parsing: we couldn’t use exactly the same library for user agent parsing in Python and Kotlin, and there were small differences in how they converted user agent strings into device and browser information. Uncaught user agent parsing differences propagated into our analytics code for calculating learning sessions — if you switch devices, we start counting a new session — which in turn propagated into the code that calculates total time spent learning. Total learning time is one of a few key metrics we use to evaluate how new features we deploy are affecting learning on the site; it’s very important to keep it consistent over time, and at first our rollout didn’t do that.

How did these differences slip through the cracks? While many of them did not, and were caught either by manual cross-browser testing or by automated tests, testing didn't find all of them. To catch issues that tests may not, we also have monitoring for step changes in our key metrics that alert people to the possibility of bugs being deployed. Even with this monitoring, it took us a while to notice the user agent differences because we happened to deploy the new Kotlin service during one of a few common weeks for spring break in the US public school system, so the overall changes in our traffic patterns masked the issues caused by the rollout.

It’s still an active area of effort to come up with better ways to monitor and test our key queries for analytics. A few years ago, we developed tinyquery, a Python library that’s an in-memory test stub for Google’s BigQuery. While tinyquery is very helpful for us, it still relies on us having generated and used test fixture data that is up-to-date and contains enough variety to detect any issues. The data engineering team at Khan is thinking about better ways to generate or sample fixture data for our key metrics, as well as additional query monitoring tools, and hopefully you’ll hear from us in a future blog post about our efforts here.

The future

Adding and maintaining another language in our codebase is no small effort. We’ve built up a lot of shared knowledge, practices, and tooling around our Python code, and we largely have to start from scratch with Kotlin. But we think it’s worth the effort because the performance and cost gains can be very significant. For our learners, a faster site translates directly to less time spent waiting and more time spent learning. For us, a less expensive site means less money spent running inefficient code and more money spent on making the best possible learning experience.1

What’s up next for Kotlin at Khan Academy? First and foremost, learning! We’re compiling resources and guides internally to help people onboard to Kotlin programming. We’re setting up starter projects in Kotlin to help people learn with hands-on experience in bite-sized pieces. And our analytics, content creation, and internationalization teams are trying Kotlin for some of their internal tooling, giving us experience with medium-sized projects with relatively isolated codebases. In parallel, continued efforts to untangle and refactor our existing Python codebase will yield more isolated APIs that are easier to port to Kotlin if and when the time comes to do so. All in all, with Kotlin we now have another great tool available at Khan Academy for writing the best software we possibly can, in order to provide the best possible experience for our learners.


1Reminder: we’re a non-profit, and you can donate if you want to help us achieve our mission of a free, world-class education for anyone, anywhere.