Khan Engineering

Khan Engineering

We're the engineers behind Khan Academy. We're building a free, world-class education for anyone, anywhere.


Latest posts

Making Websites Work with Windows High Contrast Mode

Diedra Rater on March 21

Kotlin for Python developers

Aasmund Eldhuset on Nov 29, 2018

Using static analysis in Python, JavaScript and more to make your system safer

Kevin Dangoor on Jul 26, 2018

Kotlin on the server at Khan Academy

Colin Fuller on Jun 28, 2018

The Original Serverless Architecture is Still Here

Kevin Dangoor on May 31, 2018

What do software architects at Khan Academy do?

Kevin Dangoor on May 14, 2018

New data pipeline management platform at Khan Academy

Ragini Gupta on Apr 30, 2018

Untangling our Python Code

Carter J. Bastian on Apr 16, 2018

Slicker: A Tool for Moving Things in Python

Ben Kraft on Apr 2, 2018

The Great Python Refactor of 2017 And Also 2018

Craig Silverstein on Mar 19, 2018

Working Remotely

Scott Grant on Oct 2, 2017

Tips for giving your first code reviews

Hannah Blumberg on Sep 18, 2017

Let's Reduce! A Gentle Introduction to Javascript's Reduce Method

Josh Comeau on Jul 10, 2017

Creating Query Components with Apollo

Brian Genisio on Jun 12, 2017

Migrating to a Mobile Monorepo for React Native

Jared Forsyth on May 29, 2017

Memcached-Backed Content Infrastructure

Ben Kraft on May 15, 2017

Profiling App Engine Memcached

Ben Kraft on May 1, 2017

App Engine Flex Language Shootout

Amos Latteier on Apr 17, 2017

What's New in OSS at Khan Academy

Brian Genisio on Apr 3, 2017

Automating App Store Screenshots

Bryan Clark on Mar 27, 2017

It's Okay to Break Things: Reflections on Khan Academy's Healthy Hackathon

Kimerie Green on Mar 6, 2017

Interning at Khan Academy: from student to intern

Shadaj Laddad on Dec 12, 2016

Prototyping with Framer

Nick Breen on Oct 3, 2016

Evolving our content infrastructure

William Chargin on Sep 19, 2016

Building a Really, Really Small Android App

Charlie Marsh on Aug 22, 2016

A Case for Time Tracking: Data Driven Time-Management

Oliver Northwood on Aug 8, 2016

Time Management at Khan Academy

Several Authors on Jul 25, 2016

Hackathons Can Be Healthy

Tom Yedwab on Jul 11, 2016

Ensuring transaction-safety in Google App Engine

Craig Silverstein on Jun 27, 2016

The User Write Lock: an Alternative to Transactions for Google App Engine

Craig Silverstein on Jun 20, 2016

Khan Academy's Engineering Principles

Ben Kamens on Jun 6, 2016

Minimizing the length of regular expressions, in practice

Craig Silverstein on May 23, 2016

Introducing SwiftTweaks

Bryan Clark on May 9, 2016

The Autonomous Dumbledore

Evy Kassirer on Apr 25, 2016

Engineering career development at Khan Academy

Ben Eater on Apr 11, 2016

Inline CSS at Khan Academy: Aphrodite

Jamie Wong on Mar 29, 2016

Starting Android at Khan Academy

Ben Komalo on Feb 29, 2016

Automating Highly Similar Translations

Kevin Barabash on Feb 15, 2016

The weekly snippet-server: open-sourced

Craig Silverstein on Feb 1, 2016

Stories from our latest intern class

2015 Interns on Dec 21, 2015

Kanbanning the LearnStorm Dev Process

Kevin Dangoor on Dec 7, 2015

Forgo JS packaging? Not so fast

Craig Silverstein on Nov 23, 2015

Switching to Slack

Benjamin Pollack on Nov 9, 2015

Receiving feedback as an intern at Khan Academy

David Wang on Oct 26, 2015

Schrödinger's deploys no more: how we update translations

Chelsea Voss on Oct 12, 2015

i18nize-templates: Internationalization After the Fact

Craig Silverstein on Sep 28, 2015

Making thumbnails fast

William Chargin on Sep 14, 2015

Copy-pasting more than just text

Sam Lau on Aug 31, 2015

No cheating allowed!!

Phillip Lemons on Aug 17, 2015

Fun with slope fields, css and react

Marcos Ojeda on Aug 5, 2015

Khan Academy: a new employee's primer

Riley Shaw on Jul 20, 2015

How wooden puzzles can destroy dev teams

John Sullivan on Jul 6, 2015

Babel in Khan Academy's i18n Toolchain

Kevin Barabash on Jun 22, 2015

tota11y - an accessibility visualization toolkit

Jordan Scales on Jun 8, 2015


Ensuring transaction-safety in Google App Engine

by Craig Silverstein on Jun 27, 2016

In last week's exciting post, I described an alternative to transactions that we use at Khan Academy, to ensure atomic datastore operations.

When used correctly, both the user-write lock and transactions are effective at avoiding a particular form of database corruption -- call it "data stomping." Data stomping happens when two requests try to modify the same datastore entity at the same time.


Request B does not see A's modifications, and its PUT overwrites A's PUT. A's modifications are entirely lost, even when they don't conflict with B's.

Transactions solve this problem by noticing the contention at request B's put() time, and forcing request B to retry from the beginning. Locks solve the problem by not allowing the time-overlap at all.

Note that for both techniques, you need to follow the GET - MODIFY - PUT idiom. It is an error -- a db stomping waiting to happen -- to do the GET outside the transaction/lock!

In this blog post, I describe the infrastructure we put in place at Khan Academy (which uses Google App Engine) to notice that error, and to make it easy to modify the source code to prevent it. We are making the source code available in two files:

  • a generic db/ndb hooking infrastructure
  • the specific hooks we use to detect and alert for transaction-safety violations

How do people use transactions (and locks) wrong?

The mistake people make is simple: they do the GET outside the transaction (or lock). Then when the transaction retries, it doesn't re-GET, so you end up with request B stomping out request A's changes.

You may think it's easy to remember to always do your GET's inside a transaction, but there are many ways to get this wrong:

  • You do the PUT in a function that's far removed from the GET.
  • You are given an entity and forget to run entity = entity.key.get() to "re-GET" inside the transaction
  • There are multiple codepaths used to GET an object, and only some of them -- maybe the ones used 99% of the time, so everything seems mostly-fine -- are done inside the transaction
  • The get() call gives a cached result

This last cause was a big problem for us: we would cache the entity corresponding to the current user, for efficiency. Then, whenever we wanted to update the current user, we'd do get_current_user().modify().put() inside a transaction, without realizing that get_current_user() was returning some cached entity that was fetched way before the transaction started.

The solution is pretty straightforward, once you realize there's a problem. The issue is finding out there's a problem in the first place, and then tracing through the code to find the problematic GET.

A Taxonomy of Data Stomping Errors

While the GET-outside-transaction error is the most common, there are many related types of data corruption. The infrastructure we put in place catches the following three types:


Doing the PUT inside a transaction or user-lock, but not the GET.

def seems_ok_but_is_not(uid):
     user_data = UserData.get_from_id(uid)   # cached!
     user_data.points += 5

The problem here is that get_from_user_id() gets the user-data entity from a cache. So even though it looks like you're doing the GET from within the transaction, you're actually (potentially) just seeing some object that was gotten much earlier in the request, outside this transaction.

Totally unprotected stomping

Doing a GET - MODIFY - PUT entirely outside a transaction or user-lock.

def badfunc(user_data):
     user_data_again = db.get(user_data.key())
     user_data_again.points += 5
Internal stomping

Doing two nested (or interleaved) GET - MODIFY - PUT's inside a single transaction/lock.

def _internal_fn(uid):
   user_data1 = get_user(uid)
   user_data1.points += 5

def public_fn(uid):
   user_data2 = get_user(uid)
   user_data2.points += 10

The problem here is that user_data1 and user_data2 are totally different python objects. When we do the user_data2.put(), it totally overwrites the change made in user_data1. This is the classical db-stomping problem, but within a single request!

How To Use It

To get the benefits of transaction-safety checking, you must annotate a db/ndb model with a decorator saying what method you use to guarantee safe put()'s:

  1. @never_written_model() -- super rare!
  2. @abstract_model() -- commonly for polymodels and utility classes
  3. @structured_property_model() -- for (Local)StructuredProperty models
  4. @written_once_model() -- easiest to use correctly (no need for transactions)
  5. @written_in_transaction_model() -- you put get-modify-put in a transaction
  6. @written_with_user_lock_model(lockid_fn) -- you put get-modify-put in a user write lock
  7. @written_via_cron_model() -- appengine lets you schedule cron jobs; if an entity is only accessed via a cron job, we know two requests will never access that entity at the same time
  8. @dangerously_written_outside_transaction_model() -- for legacy code
  9. @dangerously_written_outside_transaction_model_or_user_lock() -- ditto

These instruct the transaction-safety system what kinds of violations to look for. There is much more documentation of each choice at the bottom of Note that @written_with_user_lock_model takes an argument: that should a be a function that takes an entity and returns the lock_id for that entity. For instance, if the lock is protecting a single user, the lock_id might be the user-id. This is necessary because a single lock can protect many different entities. Example:

@db_decorators.written_with_user_lock_model(lambda e: e.kaid)
class UserVideo(db.Model):
    """A single user's interaction with a single video."""
    user = db.UserProperty(indexed=True)
    kaid = db.StringProperty(indexed=True)   # user's user-id
    video_key = object_property.KeyProperty(indexed=True)

Second, you have to wrap your WSGI application in the transaction-safety middleware:

app = webapp2.WSGIApplication([...routes...])
app = txn_safety.TransactionSafetyMiddleware(app)

Then you just run your application. If there is a transaction-safety violation, the system will log it:

Did a put() of the same entity from two different python objects: <class 'user_models.UserData'>.
Other put:
File "/api/internal/", line 408, in update_user_scratchpad old_points, old_challenge_status, client_dt, time_taken)
File "/api/internal/", line 436, in add_actions_for_user_scratchpad finished=(progress == "complete"))
File "/scratchpads/", line 2775, in record_for_user_and_scratchpad scratchpad=scratchpad)
File "/rewards/", line 119, in update_with_triggers_no_put user_data, possible_badges, dry_run=dry_run, **kwargs)
File "/rewards/", line 158, in maybe_award_badges_no_put badge.award_to(user_data=user_data, **kwargs)
File "/badges/", line 450, in award_to user_data,, self.description)
File "/notifications/", line 201, in send_certificate_notifications coach.put()
File "/", line 4173, in put result = super(UserData, self).put(*args, **kwargs)
Traceback (most recent call last):
File "/api/internal/", line 408, in update_user_scratchpad old_points, old_challenge_status, client_dt, time_taken)
File "/api/internal/", line 436, in add_actions_for_user_scratchpad finished=(progress == "complete"))
File "/scratchpads/", line 2777, in record_for_user_and_scratchpad user_data.put()
File "/", line 4173, in put result = super(UserData, self).put(*args, **kwargs)
File "/", line 55, in wrapper hook(model_or_models)
File "/", line 613, in _examine_put_state _examine_tainted_put(entity)
File "/", line 605, in _examine_tainted_put % (type(entity), tb))

This is an example of "internal stomping." If you had access to the source code, these tracebacks would be enough to tell you that record_for_user_and_scratchpad does a get() + put() of some user-data, and send_certificate_notifications does a nested get() + put() of the same user-data.

For power users, the source code documents functions like disable_user_write_lock_checking_in_test().

In the last blog post I mentioned that's fetch_under_user_write_lock could not be used at that time. Well, with the functionality in this blog post, it can be!, making it really easy to re-fetch an entity -- or not, as needed -- under the user write lock.

def update_points(user_data):
    with fetch_under_user_write_lock(user_data) as ud_again:
        ud_again.points += 5

If we are already under the write lock, this is a noop, otherwise it will re-fetch the entity under the lock. It works for both db and ndb entities.

How It Works

The basic approach of the transaction-safety infrastructure is to annotate every datastore entity with a history of when it was retrieved from the datastore and what the state of the world was at the time: in transaction X, or under user lock Y. At put() time, it examines that history to make sure it's in the same transaction or user lock -- or indeed in any transaction at all -- and complains if so, giving a traceback of the put() call to help with debugging. It also keeps track of whether the same entity was get()-ed multiple times, which is needed to detect internal stomping.

Here is a snippet from to demonstrate how it works:

# For a newly created entity, we don't need a transaction.
if not hasattr(entity, '_ts_get_nonce'):
    return     # not created via a get()
get_transaction = getattr(
    entity, '_transaction_at_request_time', None)
put_transaction = _transaction_object()
if not get_transaction and not put_transaction:
    _ts_violation('Did not use a transaction')
elif not get_transaction:
    _ts_violation('Did the get() outside a transaction')
elif not put_transaction:
    _ts_violation('Did the put() outside a transaction')
elif get_transaction != put_transaction:
    _ts_violation('Did the get() and put() in different txns')

The bulk of the complexity is actually in the code for adding get-hooks and put-hooks in App Engine db and ndb models. While there is a built-in hook system for ndb, it is not adequate for our purposes because it only hooks get() calls, not queries. And the older db library has no hooks at all. provides a uniform interface for hooking all functions that get or return entities in both db libraries.

Appendix: Non-Data Stomping Errors

Data stomping is not the only problems you can run into with db data. Here are 4 cases our infrastructure does not detect.

Stale reads

GET + GET - MODIFY - PUT + <use first GET>

def goodfunc(user_data):
   user_data_again = user_data.key.get()
   user_data_again.points += 5

def oopsfunc(user_data):
   if should_assign_points:
   if user_data.points > 100:   # stale read!

The problem here is that goodfunc() updates user_data_again, but leaves user_data untouched. So the user_data.points read will never see the 5 points you just awarded!


Two PUT's that should be in a transaction together.

This (not db data stomping) is the traditional motivation for using transactions. If you are modifying both a coach and student to teach each about the other, that should happen inside a transaction. We do nothing to check that you do.


Two new-entity PUT's with the same key at the same time.

If request A does MyModel(key='foo', value=1).put() and request B does MyModel(key='foo', value=2).put(), only one will win and the other will be thrown away.

App Engine provides get_or_insert(), which you can use in lieu of put() in situations where that is a concern. Note that this is only an issue if you explicitly specify a key param. Otherwise, unique keys are assigned automatically, and it's impossible for two new-entity put()'s to conflict.


You want A's GET - MODIFY - PUT to happen before B's, but B goes before A.

API X is a call that gives a user some points. API Y is a call that sees if a user has enough points for a particular badge, and awards it if so. You want to make sure, in your request, that API X is called before API Y. But while our code guarantees those two API's won't update the user-data at the same time, nothing guarantees one request will run first. You have to do that ordering constraint in your own code.