Trace-based Just-in-Time Type Specialization for Dynamic

Languages

Andreas Gal

∗

+

, Brendan Eich

∗

, Mike Shaver

∗

, David Anderson

∗

, David Mandelin

∗

,

Mohammad R. Haghighat

$

, Blake Kaplan

∗

, Graydon Hoare

∗

, Boris Zbarsky

∗

, Jason Orendorff

∗

,

Jesse Ruderman

∗

, Edwin Smith

#

, Rick Reitmaier

#

, Michael Bebenita

+

, Mason Chang

+#

, Michael Franz

+

Mozilla Corporation

∗

{

gal,brendan,shaver,danderson,dmandelin,mrbkap,graydon,bz,jorendorff,jruderman

}

@mozilla.com

Adobe Corporation

#

{

edwsmith,rreitmai

}

@adobe.com

Intel Corporation

$

{

mohammad.r.haghighat

}

@intel.com

University of California, Irvine

+

{

mbebenit,changm,franz

}

@uci.edu

Abstract

Dynamic languages such as JavaScript are more difficult to com-

pile than statically typed ones. Since no concrete type information

is available, traditional compilers need to emit generic code that can

handle all possible type combinations at runtime. We present an al-

ternative compilation technique for dynamically-typed languages

that identifies frequently executed loop traces at run-time and then

generates machine code on the fly that is specialized for the ac-

tual dynamic types occurring on each path through the loop. Our

method provides cheap inter-procedural type specialization, and an

elegant and efficient way of incrementally compiling lazily discov-

ered alternative paths through nested loops. We have implemented

a dynamic compiler for JavaScript based on our technique and we

have measured speedups of 10x and more for certain benchmark

programs.

Categories and Subject Descriptors

D.3.4 [

Programming Lan-

guages

]: Processors —

Incremental compilers, code generation

.

General Terms

Design, Experimentation, Measurement, Perfor-

mance.

Keywords

JavaScript, just-in-time compilation, trace trees.

1. Introduction

Dynamic languages

such as JavaScript, Python, and Ruby, are pop-

ular since they are expressive, accessible to non-experts, and make

deployment as easy as distributing a source file. They are used for

small scripts as well as for complex applications. JavaScript, for

example, is the de facto standard for client-side web programming

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for profit or commercial advantage and that copies bear this notice and the full citation

on the first page. To copy otherwise, to republish, to post on servers or to redistribute

to lists, requires prior specific permission and/or a fee.

PLDI’09,

June 15–20, 2009, Dublin, Ireland.

Copyright

c

©

2009 ACM 978-1-60558-392-1/09/06...$5.00

and is used for the application logic of browser-based productivity

applications such as Google Mail, Google Docs and Zimbra Col-

laboration Suite. In this domain, in order to provide a fluid user

experience and enable a new generation of applications, virtual ma-

chines must provide a low startup time and high performance.

Compilers for statically typed languages rely on type informa-

tion to generate efficient machine code. In a dynamically typed pro-

gramming language such as JavaScript, the types of expressions

may vary at runtime. This means that the compiler can no longer

easily transform operations into machine instructions that operate

on one specific type. Without exact type information, the compiler

must emit slower generalized machine code that can deal with all

potential type combinations. While compile-time static type infer-

ence might be able to gather type information to generate opti-

mized machine code, traditional static analysis is very expensive

and hence not well suited for the highly interactive environment of

a web browser.

We present a trace-based compilation technique for dynamic

languages that reconciles speed of compilation with excellent per-

formance of the generated machine code. Our system uses a mixed-

mode execution approach: the system starts running JavaScript in a

fast-starting bytecode interpreter. As the program runs, the system

identifies

hot

(frequently executed) bytecode sequences, records

them, and compiles them to fast native code. We call such a se-

quence of instructions a

trace

.

Unlike method-based dynamic compilers, our dynamic com-

piler operates at the granularity of individual loops. This design

choice is based on the expectation that programs spend most of

their time in hot loops. Even in dynamically typed languages, we

expect hot loops to be mostly

type-stable

, meaning that the types of

values are invariant. (12) For example, we would expect loop coun-

ters that start as integers to remain integers for all iterations. When

both of these expectations hold, a trace-based compiler can cover

the program execution with a small number of type-specialized, ef-

ficiently compiled traces.

Each compiled trace covers one path through the program with

one mapping of values to types. When the VM executes a compiled

trace, it cannot guarantee that the same path will be followed

or that the same types will occur in subsequent loop iterations.

Hence, recording and compiling a trace

speculates

that the path and

typing will be exactly as they were during recording for subsequent

iterations of the loop.

Every compiled trace contains all the

guards

(checks) required

to validate the speculation. If one of the guards fails (if control

flow is different, or a value of a different type is generated), the

trace exits. If an exit becomes hot, the VM can record a

branch

trace

starting at the exit to cover the new path. In this way, the VM

records a

trace tree

covering all the hot paths through the loop.

Nested loops can be difficult to optimize for tracing VMs. In

a na

̈

ıve implementation, inner loops would become hot first, and

the VM would start tracing there. When the inner loop exits, the

VM would detect that a different branch was taken. The VM would

try to record a branch trace, and find that the trace reaches not the

inner loop header, but the outer loop header. At this point, the VM

could continue tracing until it reaches the inner loop header again,

thus tracing the outer loop inside a trace tree for the inner loop.

But this requires tracing a copy of the outer loop for every side exit

and type combination in the inner loop. In essence, this is a form

of unintended tail duplication, which can easily overflow the code

cache. Alternatively, the VM could simply stop tracing, and give up

on ever tracing outer loops.

We solve the nested loop problem by recording

nested trace

trees

. Our system traces the inner loop exactly as the na

̈

ıve version.

The system stops extending the inner tree when it reaches an outer

loop, but then it starts a new trace at the outer loop header. When

the outer loop reaches the inner loop header, the system tries to call

the trace tree for the inner loop. If the call succeeds, the VM records

the call to the inner tree as part of the outer trace and finishes

the outer trace as normal. In this way, our system can trace any

number of loops nested to any depth without causing excessive tail

duplication.

These techniques allow a VM to dynamically translate a pro-

gram to nested, type-specialized trace trees. Because traces can

cross function call boundaries, our techniques also achieve the ef-

fects of inlining. Because traces have no internal control-flow joins,

they can be optimized in linear time by a simple compiler (10).

Thus, our tracing VM efficiently performs the same kind of op-

timizations that would require interprocedural analysis in a static

optimization setting. This makes tracing an attractive and effective

tool to type specialize even complex function call-rich code.

We implemented these techniques for an existing JavaScript in-

terpreter, SpiderMonkey. We call the resulting tracing VM

Trace-

Monkey

. TraceMonkey supports all the JavaScript features of Spi-

derMonkey, with a 2x-20x speedup for traceable programs.

This paper makes the following contributions:

•

We explain an algorithm for dynamically forming trace trees to

cover a program, representing nested loops as nested trace trees.

•

We explain how to speculatively generate efficient type-specialized

code for traces from dynamic language programs.

•

We validate our tracing techniques in an implementation based

on the SpiderMonkey JavaScript interpreter, achieving 2x-20x

speedups on many programs.

The remainder of this paper is organized as follows. Section 3 is

a general overview of trace tree based compilation we use to cap-

ture and compile frequently executed code regions. In Section 4

we describe our approach of covering nested loops using a num-

ber of individual trace trees. In Section 5 we describe our trace-

compilation based speculative type specialization approach we use

to generate efficient machine code from recorded bytecode traces.

Our implementation of a dynamic type-specializing compiler for

JavaScript is described in Section 6. Related work is discussed in

Section 8. In Section 7 we evaluate our dynamic compiler based on

1
								for (var i = 2; i < 100; ++i) {

2
								if (!primes[i])

3
								continue;

4
								for (var k = i + i; i < 100; k += i)

5
								primes[k] = false;

6
								}

Figure 1. Sample program: sieve of Eratosthenes.

primes

is

initialized to an array of 100

false

values on entry to this code

snippet.

Interpret

Bytecodes

Monitor

Record

LIR

T

race

Execute

Compiled

T

race

Enter

Compiled

T

race

Compile

LIR

T

race

Leave

Compiled

T

race

loop

edge

hot

loop/exit

abort

recording

fi

nish at

loop header

cold/blacklisted

loop/exit

compiled trace

ready

loop edge with

same types

side exit to

existing trace

side exit,

no existing trace

Overhead

Interpreting

Native

Symbol Key

Figure 2.

State machine describing the major activities of Trace-

Monkey and the conditions that cause transitions to a new activ-

ity. In the dark box, TM executes JS as compiled traces. In the

light gray boxes, TM executes JS in the standard interpreter. White

boxes are overhead. Thus, to maximize performance, we need to

maximize time spent in the darkest box and minimize time spent in

the white boxes. The best case is a loop where the types at the loop

edge are the same as the types on entry–then TM can stay in native

code until the loop is done.

a set of industry benchmarks. The paper ends with conclusions in

Section 9 and an outlook on future work is presented in Section 10.

2. Overview: Example Tracing Run

This section provides an overview of our system by describing

how TraceMonkey executes an example program. The example

program, shown in Figure 1, computes the first 100 prime numbers

with nested loops. The narrative should be read along with Figure 2,

which describes the activities TraceMonkey performs and when it

transitions between the loops.

TraceMonkey always begins executing a program in the byte-

code interpreter. Every loop back edge is a potential trace point.

When the interpreter crosses a loop edge, TraceMonkey invokes

the

trace monitor

, which may decide to record or execute a native

trace. At the start of execution, there are no compiled traces yet, so

the trace monitor counts the number of times each loop back edge is

executed until a loop becomes

hot

, currently after 2 crossings. Note

that the way our loops are compiled, the loop edge is crossed before

entering the loop, so the second crossing occurs immediately after

the first iteration.

Here is the sequence of events broken down by outer loop

iteration: