KA Engineering

KA Engineering

We're the engineers behind Khan Academy. We're building a free, world-class education for anyone, anywhere.

Subscribe

Upcoming fortnightly post

App Engine Memcache Performance

by Ben Kraft on May 1

Latest posts

App Engine Flex Language Shootout

Amos Latteier on April 17

What's New in OSS at Khan Academy

Brian Genisio on April 3

Automating App Store Screenshots

Bryan Clark on March 27

It's Okay to Break Things: Reflections on Khan Academy's Healthy Hackathon

Kimerie Green on March 6

Interning at Khan Academy: from student to intern

Shadaj Laddad on December 12

Prototyping with Framer

Nick Breen on October 3

Evolving our content infrastructure

William Chargin on September 19

Building a Really, Really Small Android App

Charlie Marsh on August 22

A Case for Time Tracking: Data Driven Time-Management

Oliver Northwood on August 8

Time Management at Khan Academy

Several Authors on July 25

Hackathons Can Be Healthy

Tom Yedwab on July 11

Ensuring transaction-safety in Google App Engine

Craig Silverstein on June 27

The User Write Lock: an Alternative to Transactions for Google App Engine

Craig Silverstein on June 20

Khan Academy's Engineering Principles

Ben Kamens on June 6

Minimizing the length of regular expressions, in practice

Craig Silverstein on May 23

Introducing SwiftTweaks

Bryan Clark on May 9

The Autonomous Dumbledore

Evy Kassirer on April 25

Engineering career development at Khan Academy

Ben Eater on April 11

Inline CSS at Khan Academy: Aphrodite

Jamie Wong on March 29

Starting Android at Khan Academy

Ben Komalo on February 29

Automating Highly Similar Translations

Kevin Barabash on February 15

The weekly snippet-server: open-sourced

Craig Silverstein on February 1

Stories from our latest intern class

2015 Interns on December 21

Kanbanning the LearnStorm Dev Process

Kevin Dangoor on December 7

Forgo JS packaging? Not so fast

Craig Silverstein on November 23

Switching to Slack

Benjamin Pollack on November 9

Receiving feedback as an intern at Khan Academy

David Wang on October 26

Schrödinger's deploys no more: how we update translations

Chelsea Voss on October 12

i18nize-templates: Internationalization After the Fact

Craig Silverstein on September 28

Making thumbnails fast

William Chargin on September 14

Copy-pasting more than just text

Sam Lau on August 31

No cheating allowed!!

Phillip Lemons on August 17

Fun with slope fields, css and react

Marcos Ojeda on August 5

Khan Academy: a new employee's primer

Riley Shaw on July 20

How wooden puzzles can destroy dev teams

John Sullivan on July 6

Babel in Khan Academy's i18n Toolchain

Kevin Barabash on June 22

tota11y - an accessibility visualization toolkit

Jordan Scales on June 8

Meta

Automating Highly Similar Translations

by Kevin Barabash on February 15

Khan Academy is available in 12 languages and is in the process of being translated into many more. We also have a lot of content (videos, articles, exercises, etc.) that needs to be translated into all of those languages. In order to help translators find the most high priority items to work on we have a translator dashboard.

We recently redesigned this dashboard. The main goal of this work was to improve translator efficiency. We accomplished this by ensuring that the translation status on items was up to date and that items were organized in a way that made sense to translators as opposed to how items were stored in the database.

Old Dashboard old dashboard

New Dashboard new dashboard

In addition to the dashboard, we also created a tool to help with doing the translations themselves. The tool features different views of our content that translators can quickly switch between depending of their workflow. It also includes a feature called smart translations which can be used to automate some of the translation work.

Before explaining how smart translations works, it's helpful to understand the problem it's trying to solve.

On Khan Academy we have lots of exercises. Initially we used a tool called khan-exercises to auto generate questions (along with answers and hints). Over time we noticed limitations in the types of questions that could be auto generated. Also, it was difficult for content creators and translators to work with. It was eventually replaced with another tool, perseus, which empowers content creators to create specific question variants to make sure a skill is fully covered instead of auto-generating random ones.

As a result we have many exercises with lots of very similar strings that need to be translated, e.g.

Simplify $9/12$.
Simplify $8/6$.
Simplify $15/3$.
...

How it works

The process can be broken down into the following steps:

  1. Group English strings which differ only in places that don't contain any natural language text, such as formulas. "Simplify $9/12$ is grouped with "Simplify $8/6$" but not with "Square $3/4$".
  2. Within each group, check to see if any of the English strings are already translated. If they are, create a template that can be used to translate the rest of the strings in that group. If we know that "Simplify $9/12$" translates to "Implifysay $9/12$" then we can guess that "Simplify $8/6$" will translated to "Implifysay $8/6$".
  3. Update the UI to show how many strings in a group can be translated based on the groups that have translation templates.
  4. When a user clicks "Add smart translations" we use the translation template to generate suggestions for the untranslated strings in the group.

Here's a quick video of what a user sees when using smart translations:

Implementation Details

The library that implements the grouping, template creation, and translation generation is available at Khan/translation-assistant.

Grouping

To better understand the problem let's look at an example string:

"Solve for $x$.  $x - 5 = 10$"

This string is made up of some natural language (NL) text and some non-natural language (non-NL) text. In this case "Solve for " and ". " are NL text while "$x$" and "$x - 5 = 10$" are non-NL text.

As long as strings only differ by their non-NL text we could use the translation for one string as a template and then just swap out the non-NL text. We group strings by replacing all non-NL text with placeholders, e.g.

"Solve for $x$.  $x - 5 = 10$"
"Solve for $m$.  $2m + 3 = 7$"
"Solve for $p$.  $12 = p + 6$"

map to:

"Solve for __MATH__.  __MATH__"

The strings with placeholders are used as keys to a dictionary where each value is an array containing objects that can be used to access the English strings and translated strings as they're added.

Creating/Applying Templates

The translation template contains two things:

  • a translated string with all non-NL text replaced with placeholders
  • a mapping between where each piece of math appears in the translated string and where it came from in the English string

We need this mapping because words can be re-ordered or repeated depending on the grammar of the target language, e.g.

"Solve for $x$.  $x - 5 = 10$" 
=> "$x$ orfay olvesay $x$.  $x - 5 = 10$"

In this case the template should look like this:

{
    tmplStr: "__MATH__ orfay olvesay __MATH__.  __MATH__",
    mapping: [0, 0, 1]
}

The mapping is somewhat terse, but the index of the array represents which __MATH__ placeholder in the translated is being mapped to which piece of math in the English string to be translated. In this case the first piece of math should be repeated twice followed by the second piece once.

In order to generate a new translation, we just need to extract the bits of math from a new English string (in the same group) and then apply the mapping to make sure that the math ends up in the right place.

"Solve for $m$.  $2m + 3 = 7$"
=> maths = ["$m$", "$2m + 3 = 7$"]

"__MATH__ orfay olvesay __MATH__.  __MATH__", [0, 0, 1]
=> "$m$ orfay olvesay $m$.  $2m + 3 = 7$"

Text in Math

Some of our content contains NL text inside \text{} blocks that are inside math. We'd like to be able to automatically translate the strings within the \text{} blocks in the following way:

"Find *red* if $\text{red} - 5 = 10$?"
=> "Indfay *edray* fiay $\text{edray} - 5 = 10$?"

To do so we have to modify our original approach to differentiate between math containing \text{red} and math containing other \text{}. Instead of simplify using the English string with NL-text replaced, we include a list of the strings from within each of the \text{} blocks. The key is actually a stringified version of the an object that looks like this:

{
    str: "Find *red* if __MATH__?",
    texts: ["red"]
}

We also create a mapping between English \text{} strings and translated ones. In this case that mapping would look like this:

{ "red": "edray" }

When the translation assistant is suggesting translations containing \text{} blocks it must perform an extra step when replacing the __MATH__ placeholders in the translated string. It must update the strings within the \text{} blocks, e.g.

// text to translate
"Find *red* if $2 = 8 - \text{red}$?"

// insert LaTeX into template translation
"Indfay *edray* fiay __MATH__?"
=> "Indfay *edray* fiay $2 = 8 - \text{red}$?"

// replace strings inside of \text{}
"Indfay *edray* fiay $2 = 8 - \text{red}$?"
=> "Indfay *edray* fiay $2 = 8 - \text{edray}$?"

Conclusion

Although the examples only contain math, our exercise strings can also contain links to images or widgets placeholders for things like text fields, multiple choice answers, or interactive graphs. Smart translations handles these non-NL text items in much the same way.

There are some limitations with this approach. Namely, it doesn't handle plurals correctly. Translators still have to proof read the translations but it definitely takes the tedious busy work of copy/paste out of the equation.

Also, if the translator makes a mistake in the initial translation and clicks "Add smart translations" that error will be duplicated. Luckily, it's just as easy to fix mistakes as it is to make them.

We received lots of positive feedback from our translations on this feature.
Here are a couple of quotes from our translators:

  • ...Smart [Translations] helps us a lot (and it is fun to see the progress). I like to feel real and fast progress, and still have the control over the strings.
  • They save a lot of time, requiring only a quick proofreading to guarantee they are correct.