Khan Engineering

Khan Engineering

We're the engineers behind Khan Academy. We're building a free, world-class education for anyone, anywhere.

Subscribe

Latest posts

Using static analysis in Python, JavaScript and more to make your system safer

Kevin Dangoor on July 26

Kotlin on the server at Khan Academy

Colin Fuller on June 28

The Original Serverless Architecture is Still Here

Kevin Dangoor on May 31

What do software architects at Khan Academy do?

Kevin Dangoor on May 14

New data pipeline management platform at Khan Academy

Ragini Gupta on April 30

Untangling our Python Code

Carter J. Bastian on April 16

Slicker: A Tool for Moving Things in Python

Ben Kraft on April 2

The Great Python Refactor of 2017 And Also 2018

Craig Silverstein on March 19

Working Remotely

Scott Grant on Oct 2, 2017

Tips for giving your first code reviews

Hannah Blumberg on Sep 18, 2017

Let's Reduce! A Gentle Introduction to Javascript's Reduce Method

Josh Comeau on Jul 10, 2017

Creating Query Components with Apollo

Brian Genisio on Jun 12, 2017

Migrating to a Mobile Monorepo for React Native

Jared Forsyth on May 29, 2017

Memcached-Backed Content Infrastructure

Ben Kraft on May 15, 2017

Profiling App Engine Memcached

Ben Kraft on May 1, 2017

App Engine Flex Language Shootout

Amos Latteier on Apr 17, 2017

What's New in OSS at Khan Academy

Brian Genisio on Apr 3, 2017

Automating App Store Screenshots

Bryan Clark on Mar 27, 2017

It's Okay to Break Things: Reflections on Khan Academy's Healthy Hackathon

Kimerie Green on Mar 6, 2017

Interning at Khan Academy: from student to intern

Shadaj Laddad on Dec 12, 2016

Prototyping with Framer

Nick Breen on Oct 3, 2016

Evolving our content infrastructure

William Chargin on Sep 19, 2016

Building a Really, Really Small Android App

Charlie Marsh on Aug 22, 2016

A Case for Time Tracking: Data Driven Time-Management

Oliver Northwood on Aug 8, 2016

Time Management at Khan Academy

Several Authors on Jul 25, 2016

Hackathons Can Be Healthy

Tom Yedwab on Jul 11, 2016

Ensuring transaction-safety in Google App Engine

Craig Silverstein on Jun 27, 2016

The User Write Lock: an Alternative to Transactions for Google App Engine

Craig Silverstein on Jun 20, 2016

Khan Academy's Engineering Principles

Ben Kamens on Jun 6, 2016

Minimizing the length of regular expressions, in practice

Craig Silverstein on May 23, 2016

Introducing SwiftTweaks

Bryan Clark on May 9, 2016

The Autonomous Dumbledore

Evy Kassirer on Apr 25, 2016

Engineering career development at Khan Academy

Ben Eater on Apr 11, 2016

Inline CSS at Khan Academy: Aphrodite

Jamie Wong on Mar 29, 2016

Starting Android at Khan Academy

Ben Komalo on Feb 29, 2016

Automating Highly Similar Translations

Kevin Barabash on Feb 15, 2016

The weekly snippet-server: open-sourced

Craig Silverstein on Feb 1, 2016

Stories from our latest intern class

2015 Interns on Dec 21, 2015

Kanbanning the LearnStorm Dev Process

Kevin Dangoor on Dec 7, 2015

Forgo JS packaging? Not so fast

Craig Silverstein on Nov 23, 2015

Switching to Slack

Benjamin Pollack on Nov 9, 2015

Receiving feedback as an intern at Khan Academy

David Wang on Oct 26, 2015

Schrödinger's deploys no more: how we update translations

Chelsea Voss on Oct 12, 2015

i18nize-templates: Internationalization After the Fact

Craig Silverstein on Sep 28, 2015

Making thumbnails fast

William Chargin on Sep 14, 2015

Copy-pasting more than just text

Sam Lau on Aug 31, 2015

No cheating allowed!!

Phillip Lemons on Aug 17, 2015

Fun with slope fields, css and react

Marcos Ojeda on Aug 5, 2015

Khan Academy: a new employee's primer

Riley Shaw on Jul 20, 2015

How wooden puzzles can destroy dev teams

John Sullivan on Jul 6, 2015

Babel in Khan Academy's i18n Toolchain

Kevin Barabash on Jun 22, 2015

tota11y - an accessibility visualization toolkit

Jordan Scales on Jun 8, 2015

Meta

New data pipeline management platform at Khan Academy

by Ragini Gupta on April 30

Data is very crucial to Khan Academy and is itself an internal product for the company. Analysts, engineers and marketing are some of the daily consumers of data. We have various systems for managing pipelines which are self-contained processes for doing analysis on some data. They consume some input and produce some output. There are more than two hundred data pipelines in the company currently and the number is constantly growing.

We at Khan Academy realized the importance of having an efficient way of managing our increasing number of pipelines and the result was… Khanalytics.

Life before Khanalytics

There was no single place where one could find all pipelines:

  • Some were very well tied to our website and resided in our cloud infrastructure along with the rest of our application.
  • Some pipelines were more manual and generated data by manual querying.
  • Some were using R scripts and were run manually on local machines.
  • Some were hosted on a separate machine and run using a cron service.

We had complex pipelines with different stages that couldn’t talk to each other:

  • Some pipelines were complex enough to have their different stages written in different languages or use different tools.
  • There was no way we could use R, Python, Google BigQuery, Google Cloud Dataflow etc. in the very same pipeline and pass data between those stages.

The pipeline iteration process was slow:

  • Since most pipelines lived with the code of rest of the website, changing an existing pipeline or creating a new one required many of the same steps as building user-facing functionality.
  • Scheduling a single pipeline also involved multiple steps every time.
  • Debugging pipeline failures was hard as the logs were either not easy to find or not informative enough.

It’s too easy to introduce bugs:

  • Since most pipelines existed with the website codebase, a bug in the pipelines’ code could introduce errors on the complete website.

Khanalytics: a platform for data pipeline lifecycle management

It’s easier to describe Khanalytics by talking of its features:

  • Completely sandboxed environment for running batch jobs.
  • A web user interface that’s user friendly and aimed to be used by non-developers as well.
  • Ability to parallelize different steps in a pipeline automatically.
  • Single place to find all logs for debugging.
  • Ability to schedule pipelines individually and also with dependency on other pipelines.

Architecture

Khanalytics is built on some core fundamentals mentioned below.

Everything is a container: All steps in Khanalytics (called stages) including the core components of the application are completely isolated from each other since they run in containers. We use Kubernetes in Google Kubernetes Engine to create a cluster and manage deployment of our containers. We pre-build images for all containers (Python, BigQuery, R, etc.) and start a container with the relevant image. Since everything is in a container, it’s easy enough to customize or extend the types of pipelines we’d want to support.

Statelessness: All state is managed by etcd, which is a state store. The individual services communicate with etcd about the state of individual pipelines, to start any new pipelines or to update the current state of a running pipeline. This improves reliability of the application as there are fewer communication paths.

Static configuration: We store the configuration of a pipeline in a JSON format which is static. The configuration consists of the environment, individual stages, inputs, outputs and intermediates. We allow variable interpolation in the environment variables so that certain things like current date that are dynamic, are filled in at runtime. The static configuration allows us to lint it as soon as it’s created ensuring that the configuration is always valid. It also allows us to know about the pipeline in advance and how parallel its different stages are.

Scheduling: Khanalytics uses the Kubernetes scheduler to schedule jobs and run them at a specified time, once or repeatedly. This is done with Kubernetes’ inherent cron feature called CronJob.

Permissions: Khanalytics uses Google’s Identity Aware Proxy (IAP). Once a user accessing Khanalytics is authorized, they log into the application using a service account which in turn has pre-declared access to different external services which can be accessed from Khanalytics.

User-interface

While we can access and manage pipelines using a command-line tool, a user friendly UI allows for non-developer users to create, manage and view their pipelines.

A picture of the list of completed pipelines

A picture of the logs of a pipeline

Impact

Khanalytics started as a hackathon project at Khan Academy and resulted in being our single, go-to place for managing analytics pipelines. Apart from solving the problems it was created for, we find it creating bigger impacts. It provides more empowerment to the non-developers which encourages them to create data pipelines for things that weren’t priorities for the engineering team. We saw an increase in its usage and a reduction in debugging and monitoring time. It has also promoted reusability of data or pipeline stages as a result of more visibility and easier interpretability.

What next?

We are improving the platform significantly in its reliability, usability as well as its ability to meet all needs of Khan Academy. We have a lot of work ahead of us in migrating all our analytics pipelines to Khanalytics and this migration process is also an input into testing and improving Khanalytics even more. We don’t think that we are far away from Khanalytics being the single solution to data pipelining at Khan Academy or even beyond, as we consider the possibility of open sourcing Khanalytics for others to use.

Many thanks to Colin Fuller, Kevin Dangoor and Tom Yedwab for their review of this post and a big shout-out to all who've worked and provided feedback on Khanalytics to help make what it is today.