Khan Engineering

Khan Engineering

We're the engineers behind Khan Academy. We're building a free, world-class education for anyone, anywhere.

Subscribe

Latest posts

Using static analysis in Python, JavaScript and more to make your system safer

Kevin Dangoor on July 26

Kotlin on the server at Khan Academy

Colin Fuller on June 28

The Original Serverless Architecture is Still Here

Kevin Dangoor on May 31

What do software architects at Khan Academy do?

Kevin Dangoor on May 14

New data pipeline management platform at Khan Academy

Ragini Gupta on April 30

Untangling our Python Code

Carter J. Bastian on April 16

Slicker: A Tool for Moving Things in Python

Ben Kraft on April 2

The Great Python Refactor of 2017 And Also 2018

Craig Silverstein on March 19

Working Remotely

Scott Grant on Oct 2, 2017

Tips for giving your first code reviews

Hannah Blumberg on Sep 18, 2017

Let's Reduce! A Gentle Introduction to Javascript's Reduce Method

Josh Comeau on Jul 10, 2017

Creating Query Components with Apollo

Brian Genisio on Jun 12, 2017

Migrating to a Mobile Monorepo for React Native

Jared Forsyth on May 29, 2017

Memcached-Backed Content Infrastructure

Ben Kraft on May 15, 2017

Profiling App Engine Memcached

Ben Kraft on May 1, 2017

App Engine Flex Language Shootout

Amos Latteier on Apr 17, 2017

What's New in OSS at Khan Academy

Brian Genisio on Apr 3, 2017

Automating App Store Screenshots

Bryan Clark on Mar 27, 2017

It's Okay to Break Things: Reflections on Khan Academy's Healthy Hackathon

Kimerie Green on Mar 6, 2017

Interning at Khan Academy: from student to intern

Shadaj Laddad on Dec 12, 2016

Prototyping with Framer

Nick Breen on Oct 3, 2016

Evolving our content infrastructure

William Chargin on Sep 19, 2016

Building a Really, Really Small Android App

Charlie Marsh on Aug 22, 2016

A Case for Time Tracking: Data Driven Time-Management

Oliver Northwood on Aug 8, 2016

Time Management at Khan Academy

Several Authors on Jul 25, 2016

Hackathons Can Be Healthy

Tom Yedwab on Jul 11, 2016

Ensuring transaction-safety in Google App Engine

Craig Silverstein on Jun 27, 2016

The User Write Lock: an Alternative to Transactions for Google App Engine

Craig Silverstein on Jun 20, 2016

Khan Academy's Engineering Principles

Ben Kamens on Jun 6, 2016

Minimizing the length of regular expressions, in practice

Craig Silverstein on May 23, 2016

Introducing SwiftTweaks

Bryan Clark on May 9, 2016

The Autonomous Dumbledore

Evy Kassirer on Apr 25, 2016

Engineering career development at Khan Academy

Ben Eater on Apr 11, 2016

Inline CSS at Khan Academy: Aphrodite

Jamie Wong on Mar 29, 2016

Starting Android at Khan Academy

Ben Komalo on Feb 29, 2016

Automating Highly Similar Translations

Kevin Barabash on Feb 15, 2016

The weekly snippet-server: open-sourced

Craig Silverstein on Feb 1, 2016

Stories from our latest intern class

2015 Interns on Dec 21, 2015

Kanbanning the LearnStorm Dev Process

Kevin Dangoor on Dec 7, 2015

Forgo JS packaging? Not so fast

Craig Silverstein on Nov 23, 2015

Switching to Slack

Benjamin Pollack on Nov 9, 2015

Receiving feedback as an intern at Khan Academy

David Wang on Oct 26, 2015

Schrödinger's deploys no more: how we update translations

Chelsea Voss on Oct 12, 2015

i18nize-templates: Internationalization After the Fact

Craig Silverstein on Sep 28, 2015

Making thumbnails fast

William Chargin on Sep 14, 2015

Copy-pasting more than just text

Sam Lau on Aug 31, 2015

No cheating allowed!!

Phillip Lemons on Aug 17, 2015

Fun with slope fields, css and react

Marcos Ojeda on Aug 5, 2015

Khan Academy: a new employee's primer

Riley Shaw on Jul 20, 2015

How wooden puzzles can destroy dev teams

John Sullivan on Jul 6, 2015

Babel in Khan Academy's i18n Toolchain

Kevin Barabash on Jun 22, 2015

tota11y - an accessibility visualization toolkit

Jordan Scales on Jun 8, 2015

Meta

The Great Python Refactor of 2017 And Also 2018

by Craig Silverstein on March 19

Our codebase was a mess. One day, we decided to do something about it: namely, move a bunch of files around. It took us two months.

This blog post is first in a series describing the Great Khan Academy Python Refactor of 2017 And Also 2018. In this post, I'll explain where our codebase went wrong, what we wanted to do to fix it, and why it was so difficult. It will include tips and code that others can use to avoid some of this difficulty themselves. The second post will describe Slicker, a tool we wrote which formed the backbone of our refactoring effort. The third post will describe how we used this refactoring as an opportunity to reduce inter-file dependencies within our codebase, and how that benefited us.

So what was wrong with our codebase? Take your pick:

  1. We had files named parent.py, coaches/parent.py, users/parent.py, and api/internal/parent.py, all holding different bits of logic pertaining to parents of students on Khan Academy.
  2. We had a method named login.login.Login.login. (If you're wondering, this method was part of the login process.)
  3. Our top-level directory had 234 python files in it. More than 10% of our codebase was in our root directory. Not surprisingly, those files were not very related to each other. There were low level utility files like ip_util.py. There were weird specialized utilities like dismissed_items.py. There was application logic like activity_summary.py, and dev-only tools like appengine_stats.py. Oh, there were also 93 Python subdirectories. (Plus a few more for non-Python code.) While the root directory was overstuffed, 30 of these subdirectories contained 5 files or less.

Now all our parent-related code is in two Python files. (One for the API and one for everything else.) The login method is now called login.handlers.ka_login.login_and_set_current. And there are half as many Python subdirectories as before, and 1/20th (!) as many Python files in the root directory.

How did it all go wrong?

The glib answer is: bit by bit. People created new subdirectories thinking they would turn into major projects, but they didn't. Or they would create a new file and put it in the root directory because the closest analogue was at the root level. Eventually the problems fed on themselves. When you can't find existing code relevant to what you're doing, you write new code in a new file. And when the organization of the codebase makes no sense -- when it doesn't relate to the company's current products or organizational structure -- you have no good place to put your new file so you put it somewhere arbitrary. Without anyone "owning" the structure of the codebase, it never got any pruning or reshaping.

What did we do about it?

Khan Academy set aside 6 weeks at the end of 2017 for the entire company to work on paying down technical debt, and three of us decided to use that time to whip (the Python portion of) our codebase into shape. [*]

We started with the subdirectories. One of our senior engineers, familiar with most of the codebase, decided on a preliminary list of directories we would have. Some of these reflected our products, such as test_prep or translation. Others reflected our major data structures and workflows, such as coaches or login. Some reflected lower-level infrastructure, such as email or pubsub. They then spent two days bucketing every file in our codebase into one of these directories. A few dozen files defied categorization and were marked TODO.

After this rough cut the real work began. Another group went through every file in the codebase again, and recategorized as appropriate. During this process, directories went away and were added, were split and merged. Files would end up moved somewhere else after more careful examination. Often, files were slated to be broken up, since they held unrelated pieces of functionality. (Some particularly large files were split up into 4 different directories!) There were no shortcuts here; in this step we scanned through almost every line of code in the codebase.

The most interesting result of this step of the process is how we rethought some parts of our codebase. For instance, historically our emails directory knew about every type of email we could ever send a user. As part of this analysis, we realized the code would be cleaner if only the email-sending framework lived in emails (now renamed email), and each specific email lived with the code most appropriate to it (SAT-related emails in sat, new-user emails in login, etc.) Similarly, we moved our API handlers out of a single api directory and into their relevant project directories. Both organizations have their merits, but we had fallen into one of them by default. This process made us think about these alternatives explicitly.

Once we knew what files we wanted to go where, we used slicker to do the actual moving. It works as a smarter mv, not only renaming the file but also fixing all imports as well as references in comments and strings (such as mock.patch('path.to.symbol').) It can also move individual functions and classes from one file to another, which we used to split up files.

Why was this hard?

  1. Code review. While slicker worked very well, there were enough ambiguous cases that every change had to be carefully reviewed. For instance, we renamed feeds.py to content_render/rss/feeds.py. Now consider code like this:

    _MODULE_WHITELIST = ['feeds', ...]
    user_data['feeds'] = _get_feeds_for_user()
    

    In the first case we want to rewrite feeds to content_render.rss.feeds, but in the second case we don't. There is no way an automated tool can know this, so we depended on code review (and, of course, tests). I spent entire days doing nothing but code reviews.

  2. Merge conflicts. This was the most unexpected time-sink. We expected merge conflicts with people who were actually working on our codebase while we did the moves, and coordinated carefully with our co-workers to minimize those. But we did not expect all the merge conflicts we had with ourselves. While we thought we would be able to do many renames in parallel, when we tried it we got tons of merge conflicts. This is because whenever you move a file, you change import statements in other files. If somefile.py has both import foo and import bar, and you rename foo.py while I rename bar.py, then we're both editing somefile.py. Whoever lands their changes second has to merge in the other change. Unfortunately, import statements tend to be clumped together, and git treats these two changes as a conflict (even though they're really not). Each such conflict has to be resolved by hand. Resolving merge conflicts took longer -- sometimes significantly longer -- than actually creating the commit in the first place.

  3. Implicit dependencies on directory structure. We had plenty of files that would do things like datafile = os.path.join(__file__, '..', 'testdata', 'whatever'). With the reorganization this code broke. Fixing all of these was tedious.

  4. Pickling. When the pickle library pickles a class or a function, it does so by full name: path.to.module.myfunc. If the class or function moves, you can no longer unpickle the data. While complaints about pickle are nothing new, there is no getting around this particular problem: if you want to store the name of a function in persistent storage so you can call it later, you need some way of saying what that function is. And in fact, we need to do just that for AppEngine's useful deferred-execution library, which we use a lot. And all those uses were slated to break when we moved files around.

Enter pickle_util. This is a wrapper around pickle that we use in Khan Academy code, that lets you register symbol renames. When the pickle_util unpickler finds a class, function, or other symbol that it cannot import, it checks a look-up table for the symbol's new location. It then tries to import the symbol from there instead. This lets you transparently unpickle symbols even after they have moved. The source code is easy to adapt for your own use (works with both pickle and cPickle, but it's only tested on python2).

But we had a problem: we didn't know what symbols we needed to register with pickle_util, because we don't have any master list of which functions and classes might be pickled somewhere. We could just register every single symbol in our codebase with pickle_util, but that's unwieldy and slow. So instead we used another solution: pickle guards.

A pickle guard is a "forwarding file" we create whenever we move a file to a new location. The pickle guard file lives at the old location and just imports files from the new location. This forwarding file should never be imported by our code (since all references should go to the new location now), but it will still be imported by pickle when unpickling symbols that reference that old location. So we have it log a message whenever it's imported saying "pickle is using a symbol from this file!" We can then examine our logs to see places we need to register symbols with pickle_util. Once the logs are all clear, we can delete the pickle-guard files and have a nice, clean codebase.

Here's an example pickle-guard file for when we renamed google_analytics.py to analytics/google_analytics.py:

logging.error("Should not be importing %s, "
              "update pickle_util.py", __file__)

from analytics.google_analytics import _construct_event_payload  # NoQA: F401
from analytics.google_analytics import _fix_payload_unicode  # NoQA: F401
from analytics.google_analytics import _send_event_to_ga_sync  # NoQA: F401
from analytics.google_analytics import google_analytics_user_id  # NoQA: F401
from analytics.google_analytics import mark_ga_activation  # NoQA: F401
from analytics.google_analytics import send_event_to_ga  # NoQA: F401

We created pickle-guard files whenever we moved a file. The code to automate that is here.

What did we get right? What did we get wrong?

We did a lot of things right, I think. We were smart to figure out exactly what we were going to do before we started doing it; the mechanics of the moves were intricate enough without having to keep in mind the semantics as well.

And we were smart not try to do all this in a single "flag day", where nobody could work on the core product while we did the move, with the goal of avoiding merge conflicts. We considered it, but flag days, however tempting, are never a good idea. And they would not have been here. There were so many unexpected gotchas that it would have been more like a flag month. By planning from the beginning to work around others' schedules, we were able to mostly avoid merge conflicts with other developers while not blocking anyone's work.

However, in retrospect I wish we had invested more in a tool to automatically resolve merge conflicts with each other, so we didn't have to continually fix up import blocks. It was time consuming and error prone, and probably could have been automated.

I also wish we had generated pickle guards in a separate commit from the one that moved the file around. As it is, we did all of it together, and now git does not recognize our file moves as moves. Instead, it sees that we edited google_analytics.py a lot (getting rid of the old content and replacing it with the pickle-guard content) and created a new, seemingly unrelated, file called analytics/google_analytics.py. Now git blame and git log <file> do not work very well. (We can use git blame -C, which works ok but is slow. There's no good solution for git log.) We could have easily avoided this by putting the move in its own commit.

Two months later

There's always a risk that the minute you've finished a cleanup like this, the weeds start growing again. But we've not found that to be the case. The clearer organization makes it more obvious where new code should go, and it's easier for teams to focus their code in just one or two directories that match their project. Of course, as the company introduces new products and sunsets old ones, the code structure will need to change to match, but now it's obvious how to do that. And it's easier for even the most long-tenured engineer to find their way around the codebase. As for our new employees, they don't know how good they have it.


[*]We focused on Python because the Javascript code was already fairly well organized: the better organized JS code is, the more compact the bundles that users download, and the faster the user experience. Thus, JS code gets cleaned up as part of performance projects, while Python code does not.