Documentation for numba-nextgen
This project is maintained by ContinuumIO
This document describes the core numba language, which is designed to generate efficient code for general, pythonic code. It will allow us to implement most of the features that currently reside in the compiler directly in a runtime. On top of this small core language we can write more advanced features such as subtype polymorphism through method tables.
I believe we need the following features:
Methods on user-defined types with specified representations (structs or otherwise)
- Careful control over allocation, mutability and ownership
Polymorphism: Generic functions, traits, overloading
- subtyping and inheritance is left to a runtime implementation
- dynamic dispatch for traits is left to a runtime implementation : - static dispatch only requires some type checking support
User-defined typing rules
Careful control over inlining, unrolling and specialization
Array oriented computing: map/reduce/scan/etc
Extension of the code generator
Support for multi-stage programming would be nice, but is considered a bonus and deferred to external tools like macropy or mython for now. The control over optimizations likely provides enough functionality to generate good code.
This describes a closed environment with an optionally static, inferred, language. Static typing will help provide better error messages, and can prevent inintended use.
Polymorphism is provided through:
- generic (monomorphized) functions (like C++ templates)
- overloading
- traits (like interfaces)
- subtyping ("python classes")
This language's goals are ultimate control over performance, and a language with a well-defined and easily understood subset for the GPU.
This language is inspired by the following languages: Rust, Terra, RPython, Julia, Parakeet, mypy, copperhead. The traits are very similar to Rust's traits, and are related to type classes in Haskell and interfaces in Go.
However, Go interfaces do not allow type-based specialization, and hence need runtime type tagging and method dispatch through vtables. Type conversion between interfaces needs to be runtime-checked type (and new vtables build at those points, if not cached). Compile-time overloading is precluded. In Go, interfaces specify what something can do, as opposed to what something can be. This can be a useful in a few situations, but it means we cannot constrain what things can be (e.g. any numeric type).
In julia we can constrain the types we operate over, which happens through subtyping. E.g.:
julia> Int <: Integer
true
julia> Int <: Real
true
julia> Int <: FloatingPoint
false
So we can define a function which operates only over numbers:
julia> function f(x :: Number)
return x * x
end
Here's a the generated code when x
is an Int
:
julia> disassemble(f, (Int,))
define %jl_value_t* @f618(%jl_value_t*, %jl_value_t**, i32) {
top:
%3 = load %jl_value_t** %1, align 8, !dbg !5256
%4 = getelementptr inbounds %jl_value_t* %3, i64 0, i32 0, !dbg !5256
%5 = getelementptr %jl_value_t** %4, i64 1, !dbg !5256
%6 = bitcast %jl_value_t** %5 to i64*, !dbg !5256
%7 = load i64* %6, align 8, !dbg !5256
%8 = mul i64 %7, %7, !dbg !5263
%9 = call %jl_value_t* @jl_box_int64(i64 %8), !dbg !5263
ret %jl_value_t* %9, !dbg !5263
}
Disassembling with Number
generates a much larger chunk of code, which
uses boxed code and ultimately (runtime) multiple dispatch of the *
function:
%15 = call %jl_value_t* @jl_apply_generic(%jl_value_t* inttoptr (i64 4316726176 to %jl_value_t*), %jl_value_t** %.sub, i32 2), !dbg !5191
However, since the implementation of a function is specialized for the supertype, it doesn't know the concrete subtype. Type inference can help prevent these situations and use subtype-specialized code. However, it's very easy to make it generate slow code:
julia> function g(c)
if c > 2
x = 2
else
x = 3.0
end
return f(x)
end
julia> disassemble(g, (Bool,))
This prints a large chunk of LLVM code (using boxed values), since we are unifying an Int with a Float. Using both ints, or both floats however leads to very efficient code.
What we want in our language is full control over specialization and memory allocation, and easily-understood semantics for what works on the GPU and what doesn't. The following sections will detail how the above features will get us there.
We want to support user-defined types with:
- control over representation
- (special) methods
- control over mutability
- control over stack- vs gc-allocation
User-defined types do not support inheritance, which is left to a runtime implementation. This means that the callees of call-sites are static, and can be called directly. This further means they can be inlined (something we will exploit).
This means that we can even write the most performance-critical parts of our runtime in this way. The compiler needs to support the following types natively:
- int
- float
- pointer
- struct (with optional methods and properties)
- union
- array (constant size)
Anything else is written in the runtime:
- range
- complex
- array
- string/unicode
- etc
This means we can easily experiment with different data representations and extend functionality. For instance we can wrap and override the native integer multiply to check for overflow, and raise an exception or issue a warning, or convert to a BigInt.
Type representation can be specified through a type 'layout':
@jit
class Array(object):
layout = Struct([('data', 'Char *')])
Each individual field can be specified to be immutable, or all can be specified immutable through a decorator:
@jit(immutable=True)
class Array(object):
...
If all fields are immutable, the object can be stack allocated. Unless
manually specified with stack=True
, the compiler is free to decide
where to allocate the object. This decision may differ depending on the
target (cpu or gpu).
The Array
above can be stack-allocated since its fields are immutable
-even though the contained data may not be.
If data is mutable, it is allocated on the heap. This means that allocation of such an object is incompatible with a GPU code generator. Hence, data structures like Arrays must be passed in from the host, and things like Lists are not supported. However, one can write a List implementation with static size that supports appending a bounded number of objects.
We disallow explicit stack allocation for mutable types for the following reason:
x = mutable() # stack allocate
y = x # copy x into y
y.value = 1 # update y.value, which does not affect x.value
To make this work one would need to track the lifetimes of the object itself and all the variables the object is written into, at which point we defer you to the Rust programming language. We leave stack allocation of mutable objects purely as a compile-time optimization.
Destructors are supported only for heap-allocated types, irrespective of mutability. If a __del__ method is implemented, the object will be automatically heap-allocated (unless escape analysis can say otherwise).
Ownership is tied to mutability:
- Data is owned when (recursively) immutable
- Data is shared when it, or some field is mutable (recursively)
Owned data may be send over a channel to another thread or task. Shared data cannot be send, unless explicitly marked as a safe operation:
channel.send(borrow(x))
The user must guarantee that 'x' stays alive while it is consumed. This is useful for things like parallel computation on arrays.
User-defined types are parameterizable:
@jit('Array[Type dtype, Int ndim]')
class Array(object):
...
Parameters can be types or values of builtin type int. This allows specialization for values, such as the dimensionality of an array:
@jit('Array[Type dtype, Int ndim]')
class Array(object):
layout = Struct([('data', 'Char *'), ('strides', 'Tuple[Int, ndim]')])
@signature('Tuple[Int, ndim] -> T')
def __getitem__(self, indices):
...
This specifies that we take a Tuple
of Int
s an size ndim
as
argument, and return an item of type T
. The T
and ndim
are
resolved as type parameters, which means they specify concrete types in
the method signature.
The type can now be used as follows:
myarray = Array[Double, 2]()
This will mostly appear in (numba) library code, and not in user-written code, which uses higher-level APIs that ultimately construct these types. E.g.:
@overload(np.ndarray)
def typeof(array):
return Array[typeof(array.dtype), array.ndim]
@overload(np.dtype)
def typeof(array):
return { np.double: Double, ...}[array.dtype]
Supported forms of polymorphism are generic functions, overloading, traits and subtyping and inheritance.
Generic functions are like @autojit
, they provide specialized code for
each unique combination of input types. They may be optionally typed and
constrained (through traits).
@jit('(a -> b) -> [a] -> [b]')
def map(f, xs):
...
This specifies a map implementation that is specialized for each combination of type instances for type variables `a` and `b`. Type variables may be further constrained through traits, in a similar way to Rust's traits (http://static.rust-lang.org/doc/tutorial.html#traits)), allowing you to operate for instance only on arrays of numbers, or arrays of floating point values.
Traits specify an interface that value instances implement. Similarly to Rust's traits and Haskell's type classes, they are a form of bounded polymorphism, allowing users to constrain type variables ("this function operates on floating point values only").
They also specify a generic interface that objects can implement. Classes can declare they belong to a certain trait, allowing any instance of the class to be used through the trait:
@jit('(a -> b) -> Iterable[a] -> [b]')
def map(f, xs):
...
Our map now takes an iterable and returns a list. Written this way, a single map implementation now works for any iterable. Any value implementing the Iterable trait can now be used:
@jit('Array[Type dtype, Int ndim]')
class Array(Iterable['dtype']):
...
We can now use map() over our array. The generated code must now insert
a `conversion` between Array[dtype, ndim]
and trait
Iterable[dtype]
, which concretely means packing up a vtable pointer
and a boxed Array pointer. This form of polymorphism will likely be
incompatible with the GPU backend. However, we can still use our
generic functions by telling the compiler to specialize on input types:
@specialize.argtypes('f', 'xs')
@jit('(a -> b) -> Iterable[a] -> [b]')
def map(f, xs):
...
Alternatively, we can allow them to simply constrain type variables, and not actually specify the type as the trait. The type is supplied instead by the calling context:
@signature('(it:Iterable[a]) => (a -> b) -> it -> [b]')
def map(f, xs):
...
The constraints are specified in similar way to Haskell's type classes.
The only implementation required in the compiler to support this is the
type checking feature, otherwise it's entirely the same as generic
functions above. Multiple constraints can be expressed, e.g.
(it:Iterable[a], a:Integral)
. Alternative syntax could be '(a -> b)
-> lst : Iterable[a] -> [b]', but this is less clear when 'it' is
reused elsewhere as a type variable.
Traits can further use inheritance and have default implementations. This can be trivially implemented at the Python level, requiring no special knowledge in the compiler.
These mechanisms provide compile-time selection for our language. It is
required to support the compiled convert
from section 3, and necessary
for many implementations, e.g.:
@jit('Int -> Int')
def int(x):
return x
@jit('String -> Int')
def int(x):
return parse_int(x)
Overloading is also provided for methods:
@jit
class SomeNeatClass(object):
@signature('Int -> Int')
def __add__(self, other):
return self.value + other
@signature('String -> Int')
def __add__(self, other):
return str(self.value) + other
We further need a way to "overload" python functions to provide a way to provide alternative implementations or to type it. We can easily provide implementations for all builtins:
pytypedef(builtins.int, int)
I think Julia does really well here. Analogously we define three functions:
- typeof(pyobj) -> Type
- convert(Type, Value) -> Value
- unify(Type, Type) -> Type
The convert
function may make sense as a method on the objects
instead, which is more pythonic, e.g. __convert__
. unify
does not
really make sense as a method since it belongs to neither of the two
arguments.
Unify takes two types and returns the result type of the given types.
This result type can be specified by the user. For instance, we may
determine that unify(Int, Float)
is Union(Int, Float)
, or that it is
Float
. The union will give the same result as Python would, but it is
also more expensive in the terms of the operations used on it (and
potentially storage capacity). Unify is used on types only at control
flow merge points.
A final missing piece are a form of ad-hoc polymophism, namely coercions. This is tricky in the presence of overloading, where multiple coercions are possible, but only a single coercion is preferable. E.g.:
@overload('Float32 -> Float32 -> Float32')
def add(a, b):
return a + b
@overload('Complex64 -> Complex64 -> Complex64')
def add(a, b):
return a + b
Which implementation is add(1, 2)
supposed to pick, Int
freely
coerces to both Float32
and Complex64
? Since we don't want built-in
coercion rules, which are not user-overridable or extensible, we need
some sort of coercion function. We choose a function
coercion_distance(src_type, dst_type)
which returns the supposed
distance between two types, or raises a TypeError. Since this is not
compiled, we decide to not make it a method of the source type.
@overload(Int, Float)
def coercion_distance(int_type, float_type):
return ...
These functions are used at compile time to determine which conversions to insert, or whether to issue typing errors.
We need to allow careful control over optimizations and code specialization. This allows us to use the abstractions we need, without paying them if we know we can't afford it. We propose the following intrinsics exposed to users:
for x in unroll(iterable): ...
@specialize.arg(0)
The first compiler intrinsic allows unrolling over constant iterables. For instance, the following would be a valid usage:
x = (1, 2, 3)
for i in unroll(x):
...
An initial implementation will likely simply recognize special container
types (Tuple, List, etc). Later we may allow arbitrary (user-written!)
iterables, where the result of len()
must be ultimately constant
(after inlining and register promotion).
The ability to specialize on various things, similar to specialization
in rpython (rpython/rlib/objectmodel.py
).
These decorators should also be supported as extra arguments to
@signature
etc.
Parakeet and copperhead do this really well. We need map, reduce, zip, list comprehensions, etc.
We can support an @opaque
decorator that marks a function or method as
"opaque", which means it must be resolved by the code generator. A
decorator @codegen(thefunc)
registers a code generator function for
the function or method being called:
@jit('Int[Int size]')
class Int(object):
@opague('Int -> Int', eval_if_const=True)
def __add__(self, other):
return a + b
@codegen(Int.__add__)
def emit_add(codegen, self, other):
# 'self' and 'other' are (typed) pykit values
return codegen.builder.add(self, other)
This can also be useful to retain high-level information, instead of expanding it out beforehand. This can enable high-level optimizations, e.g. consider the following code:
L = []
for i in range(n):
L.append(i)
L = map(f, L)
If we expand L = []
and L.append(i)
into memory allocations and
resizes before considering the map
, we forgo a potential optimization
where the compiler performs loop fusion and eliminates the intermediate
list.
So an opague function may have an implementation, but it may be resolved at a later stage during the pipeline if it is still needed:
@codegen(List.__init__)
def emit_new_list(codegen, self):
return codegen.builder.new_list(self.type)
@llcodegen('new_list')
def emit_new_list(codegen, self):
return codegen.gen_call(List.__init__)
This should be done with low-level code that doesn't need further high-level optimizations. Users must also ensure this process terminates (there must be no cycles the call graph).
The mechanisms above allow us to easily evaluate how code will be compiled, and asses the performance implications. Furthermore, we can easily see what is GPU incompatible, i.e. anything that:
- uses CFFI (this implies use of Object, which is implemented in terms of CFFI)
- uses traits that don't merely constrain type variables
- allocates anything mutable
Everything else should still work.
We want to fuse operations producing intermediate structures such as lists or arrays. Fusion or deforestation has been attempted in various ways, we will first cover some of the existing research in the field.
Rewrite rules can be used to specify patterns to perform fusion ([1]_, [2]_, [3]_), e.g.:
map f (map g xs) = map (f . g) xs
The dot represents the composition operator. To avoid the need for a
pattern for each pair of operators, we can express fusable higher-order
functions in terms of a small set of combinators. One approach is
build/foldr, where build
generates a list, and foldr
(reduce)
consumes it ([3]). Foldr can be defined as follows:
foldr f z [] = z
foldr f z (x:xs) = f x (foldr f z xs)
build
is the dual of foldr
, instead of reducing a list it generates
one. Using just build and foldr, a single rewrite rule can be used for
deforestation:
foldr k z (build g) = g k z
This is easy to understand considering that build generates a list, and foldr then consumes it, so there's no point in building it in the first place. Build is specified as follows:
build g (:) []
This means g
is applied to the cons
constructor and the empty list.
We can define a range function (from
in [3]) as follows:
range a b = if a > b then []
else a : (range (a + 1) b)
Abstracting over cons and nil (the empty list) [3], we get:
range' a b = \ f lst -> if a > b then lst
else f a (range' (a + 1) b f lst)
It's easy to see the equivalence to range
above by substituting (:)
for f
and []
for lst. We can now use range'
with build
([3]):
range a b = build (range' a b)
Things like map
can now be expressed as follows ([3]):
map f xs = build (\ cons lst -> foldr (\ a b -> cons (f a) b) lst xs)
However, some functions cannot be expressed in this framework, like zip ([4]_).
Another major approach is based on stream fusion ([4]_, [5]_). It expresses the higher-order functions in terms of streams ([4]_):
map f = unstream . map' f . stream
unstream
converts a stream back to a list, and stream converts a list
to a stream. Under composition, like map f (map g xs)
, we get
unsteam . map' f . stream . unsteam . map' g . stream
. The fusion then
relies on eliminating the composition of stream
with unstream
:
stream (unstream s) = s
A stream consists of a stepper function and a state. Stepper functions
produce new step states. The states are Done
, Yield
or Skip
.
Done
signals that the stream is consumed, Yield
yields a new value
and state, and Skip
signals that a certain value needs to be skipped
(for things like filter).
Let's see this in action ([5]):
stream :: [a] -> Stream a
stream xs0 = Stream next xs0
where
next [] = Done
next (x:xs) = Yield x xs
This converts a list to a Stream. It constructs a Stream with a new
stepper function next
and the initial state (the given list). The
next
stepper function produces a new step state every time it is
called. Streams can be consumed as follows:
map f (Stream next0 s0) = Stream next s0
where
next s = case next0 s of
Done -> Done
Skip s' -> Skip s'
Yield x s' -> Yield (f x) s'
Here we specify a new stepper function next
that, given a state,
advances the stream it consumes with the new state, and yields new
results. It wraps this stepper function in a new stream. [5]_ further
extends this work to allow operation over various kinds of streams:
- Chunked streams for bulk memory operations
- Vector (multi) streams for SIMD computation
- Normal streams that yield one value at a time
It bundles the various streams together in a product type. The idea is that all streams are available at the same time. Hence a producer can produce in the most efficient way, and the consumer can consume in the most efficient way. These concepts don't always align, in which case fallbacks are in place, for instance a chunked stream can be processed as a scalar stream, or vice-versa. In addition to inlining and other optimizations it relies heavily on call-pattern specialization ([6]), allowing the compiler to eliminate pattern matching of consumer sites.
The concept of a stream encapsulating a state and a stepper function is
akin to iterators in Python, where the state is part of the iterator and
the stepping functionality is provided by the __next__
method.
Although iterators can be composed and specialized on static callee
destination ( the __next__ method of another iterator), they are
most naturally expressed as generators:
def map(f, xs):
for x in xs:
yield f(xs)
The state is naturally captured in the generator's stack frame. To allow fusion we need to inline producers into consumers. This is possible only if we can turn the lazy generator into a non-lazy producer, i.e. the consumer must immediately consume the result. This introduces a restriction:
The generator may not be stored, passed to other functions or returned. We can capture this notion by having
iter(generator)
create astream
, and disallowing the rewrite rulestream (unstream s) = s
to trigger when theunstream
has multiple uses.This means the value remains `unstreamed` (which itself is lazy, but effectively constitutes a fusion boundary).
Since we can express many (all?) higher-order fusable functions as generator, we have a powerful building block (in the same way as the previously outlined research methods), that will give us rewrite rules for free. I.e., we will not need to state the following:
map(f, map(g, xs)) = map(f . g, xs)
since this automatically follows from the definition of map:
@signature('(a -> b) -> Stream a -> Stream b')
def map(f, xs):
for x in xs:
yield f(x)
The two things that need to be addressed are 1) how to inline generators and 2) how do we specialize on argument "sub-terms".
The inlining pattern is straightforward:
- remove the loop back-edge
- promote loop index to stack variable
- inline generator
- transform 'yield val' to 'i = val'
- replace each 'yield' from the callee with a copy of the loop body of the caller
Now consider a set of generators that have multiple yield expressions:
def f(x):
yield x
yield x
Inlining of the producer into the consumer means duplicating the body for each yield. This can lead to exponential code explosion in the size of the depth of the terms:
for i in f(f(f(x))):
print i
Will result in a function with 8 print statements. However, it is not always possible to generate static code without multiple yields, consider the concatenation function:
def concat(xs, ys):
for x in xs:
yield x
for y in ys:
yield ys
This function has two yields. If we rewrite it to use only one yield:
def concat(xs, ys):
for g in (xs, ys):
for x in g:
yield x
We have introduced dynamicity that cannot be eliminated without specialization on the values (i.e. unrolling the outer loop, yielding the first implementation). This not special in any way, it is inherent to inlining and we and treat it as such (by simply using an inlining threshold). Crossing the threshold simply means temporaries are not eliminated -- in this case this means generator "cells" remain.
If this proves problematic, functions such as concat can instead always unstream their results. Even better than fully unstreaming, or sticking with a generator cell, is to use a buffering generator fused with the expression that consumes N iterations and buffers the results. This divides the constant overhead of generators by a constant factor.
Specialization follows from inlining, there are two cases:
- internal terms
- boundary terms
stream (unstream s)
is rewritten, the result is fused
Internal terms are rewritten according to the stream (unstream s)
rule. What eventually follows at a boundary is a) consumption through a
user-written loop or b) consumption through the remaining unstream. In
either case the result is consumed, and the inliner will start inlining
top-down (reducing the terms top-down).
For simplicity we exclude support for chunked streams. Analogous to [5]_ we can expose a SIMD vector type to the user. This vector can be yielded by a producer to a consumer.
How then, does a consumer pick which stream to operate on? For instance, zip can only efficiently be implemented if both inputs are the same, not if one returns vectors and the other scalars (or worse, switching back and forth mid-way):
def zip(xs, ys):
while True:
try:
yield (next(xs), next(ys))
except StopIteration:
break
For functions like zip, which are polymorphic in their arguments, we can simply constrain our inputs:
@overload('Stream[Vector a] -> Stream[Vector b] -> Stream[(Vector a, Vector b)]')
@overload('Stream a -> Stream b -> Stream (a, b)')
def zip(xs, ys):
...
Of course, this means if one of the arguments produces vectors, and the other scalars, we need to convert one to the other:
@overload('Stream[Vector a] -> Stream a')
def convert(stream):
for x in stream:
yield x
Which basically unpacks values from the SIMD register.
Alternatively, a mixed stream of vectors and scalars can be consumed. [5]_ distinguises between two vector streams:
- a producer stream, which can yield Vector | Scalar
- a consumer stream, where the consumer chooses whether to read vectors or scalars. A consumer can start with vectors, and when the vector stream is consumed read from the scalar stream.
A producer stream is useful for producers that mostly yield vectors, but sometimes need to yield a few scalars. This class includes functions like concat that concatenates two streams, or e.g. a stream over a multi-dimensional array where inner-contiguous dimensions have a number of elements not 0 modulo the vector size.
A consumer stream on the other hand is useful for functions like zip, allowing them to vectorize part of the input. However, this does not seem terribly useful for multi-dimensional arrays with contiguous rows, where it would only vectorize the first row and then fall back to scalarized code.
However, neither model really makes sense for us, since we would already manually specialize our loops:
@overload('Array a 2 -> Stream a')
def stream_array(array, vector_size):
for row in array:
for i in range(len(row) / vector_size):
yield load_vector(row.data + i * 4)
for i in range(i * 4, len(row)):
yield row[i]
This means code consuming scalars and code consuming vectors can be matched up through pattern specialiation (which is not just type-based branch pruning).
To keep things simple, we will stick with a producer stream, yielding either vectors or scalars. Consumers then pattern-match on the produced values, and pattern specialization can then switch between the two alternatives:
def sum(xs):
vzero = Vector(zero)
zero = 0
for x in xs:
if isinstance(x, Vector):
vzero += x
else:
zero += x
return zero + vreduce(add, vzero)
To understand pattern specialization, consider xs
is a
stream_array(a)
. This results in approximately the following code
after inlining:
stream_array(array, vector_size):
for row in array:
for i in range(len(row) / vector_size):
x = load_vector(row.data + i * 4)
if isinstance(x, Vector):
vzero += x
else:
zero += x
for i in range(i * 4, len(row)):
x = row[i]
if isinstance(x, Vector):
vzero += x
else:
zero += x
It is now easy to see that we can eliminate the second pattern in the first loop, and the first pattern in the second loop.
To summarize, to support fusion in a general and pythonic way can be modelled on generators. To support this we need:
- generator inlining
- For SIMD and bulk operations, call pattern specialization. For us this means branch pruning and branch merging based on type.
The most important optimization is the fusion, SIMD is a useful extension. Depending on the LLVM vectorizer (or possibly our own), it may not be necessary.
This section discusses typing for numba. There is plenty of literature on type inference, most notable is the Damas-Hindley-Milner Algorithm W. for lambda calculus [1]_, and an extension for ML. The algorithm handles let-polymorphism (a.k.a. ML-polymorphism), a form of parametric polymorphism where type variables themselves may not be polymorphic. For example, consider:
def f(g, x):
g(x)
g(0)
We can call f
with a function, which must accept x
and an value of
type int. Since g
is a monotype in f
, the second call to g
restricts what we accept for x
: it must be something that promotes
with an integer. In other words, the type for g
is a -> b
and not
∀a,b.a -> b
.
Although linear in practise, the algorithm's worst case behaviour is exponential ([2]_), since it does not share results for different function invocations. The cartesian product algorithm ([3]_) avoids this by sharing monomorphic template instantiations. It considers all possible receivers of a message send, and takes the union of the results of all instances of the cartesian product substitution. The paper does not seem to address circular type dependencies, where the receiver can change based on the input types:
def f(x):
for i in range(10):
x = g(x)
leading to
define void f(X0 %x0) {
cond:
%0 = lt %i 10
cbranch %0 body exit
body:
%x1 = phi(x0, x2)
%x2 = call g(%x0)
br cond
exit:
ret void
}
However, this can be readily solved through fix-point iteration. If we assign type variables throughout the function first, we get the following constraints:
[ X1 = Union(X0, X2), G = X1 -> T2 , X2 = T2 ]
We can represent a function as a set of overloaded signatures. However, the function application is problematic, since we send X1 (which will be assigned a union type). WIthout using the cartesian product this would lead to exponential behaviour since there are 2^N subsets for N types.
We use the cartesian product algorithm on a constraint network based on the dataflow graph. To understand it, we need to understand the input language. Since we put most functionality of the language in the user-domain, we desugar operator syntax through special methods, and we further support overloaded functions.
The front-end generates a simple language that can conceptually be described through the syntax below:
e = x variable
| x = a assignment
| const(x) constants
| x.a attribute
| f(x) application
| jump/ret/exc_throw/... control flow
| (T) x conversion
As you'll notice, there are no operators, loops, etc. Control flow is encoded through jumps, exception raising, return, etc. Loops can be readily detected through a simple analysis (see pykit/analysis/loop_detection.py).
We take this input grammar and generate a simpler constraint network, that looks somewhat like this:
e = x.a attribute
| f(x) application
| flow(a, b) data flow
This is a directed graph where each node classifies the constraint on
the inputs. Types propagate through this network until no more changes
can take place. If there is an edge A -> B
, then whenever A
is
updated, types are propagated to B
and processed according to the
constraint on B
. E.g. if B
is a function call, and A
is an input
argument, we analyze the function call with the new values in the
cartesian product.
Coercions may happen in two syntactic constructs:
- application
- control flow merges (phi nodes)
For application we have a working implementation in Blaze that determines the best match for polymorphic type signatures, and allows for coercions. For control flow merges, the user can choose whether to promote values, or whether to create a sum-type. A post-pass can simply insert coercions where argument types do not match parameter types.
We intend to support subtyping in the runtime through python inheritance. When a class B inherits from a class A, we check for a compatible interface for the methods (argument types are contravariant and return types covariant). When typing, the only thing we need to implement are coercion and unification:
Type B coerces to type A if B is a subtype of A Type A coerces to type B if B is a subtype of A with a runtime check only
Then class types A and B unify iff A is a subtype of B or vice-versa. The result of unification is always the supertype.
Finally, parameteric types will be classified invariant, to avoid
unintended mistakes in the face of mutable containers. Consider e.g.
superclass A
and subclass B
. Assume we have the function that
accepts an argument typed A[:]
. If we treat the dtype as covariant,
then we may pass an array B[:]
for that argument. However, the code
can legally write A
s into the array, violating the rule that we can
only assign subtypes. The problem is that reading values is covariant,
whereas writing is contravariant. In other words, the parameter must be
covariant as well as contravariant at the same time, which is only
satisfied when A = B
.
The exception is maybe function types, for which we have built-in variance rules.
Types can only be parameterized by variables and user-defined or built-in types. Type variables may be constrained through traits (type sets can readily be constructed by implementing (empty) traits).
Nearly all built-in data types are implemented in the runtime.
To support mutable heap-allocated types, we need a garbage collector. To get started quickly we can use Boehm or reference counting. We will want to port one of the available copying collectors and use a shadowstack or a lazy pointer stack (for bonus points). The GC should then be local to each thread, since there is no shared state between threads (only owned and borrowed data is allowed).
Garbage collection is abstracted by pykit.
Exceptions are also handled by pykit. We can implement several models, depending on the target architecture:
costful (error return codes) : - This will be used on the GPU
zero-cost : - This should be used where supported. We will start with costful
setjmp/longjmp : - This will need to happen for every stack frame in case of a shadow stack
Local exception handling will be translated to jumps. This is not contrived, since we intend to make heavy use of inlining:
while 1:
try:
i = x.__next__()
except StopIteration:
break
x.__next__()
may be inlined (and will be in many instances, like
range()), and the raise StopIteration
will be translated to a jump.
Control flow simplification can further optimize the extra jump (jump to
break, break to loop exit).
As mentioned in the core language overview, memory is not shared unless borrowed. This process is unsafe and correctness must be ensured by the user. Immutable data can be copied over channels between threads. Due to a thread-local GC, all threads can run at the same time and allocate memory at the same time.
We will remove prange and simply use a parallel map with a closure.
Traits are mostly a compile-time type-checking detail and some simple runtime decorator support. Traits with dynamic dispatch require vtables, something we can implement in the runtime as well:
https://github.com/zdevito/terra/blob/master/tests/lib/golike.t
Extension types are currently built on top of CPython objects. This should be avoided. We need to decouple numba with anything CPython, for the sake of portability as well as pycc.
Extension types can also easily be written in the runtime:
unify()
needs to return the supertype or raise a type errorconvert(obj, Object)
needs to do a runtime typecheckcoerce_distance
needs to return a distance for how far the supertype is up the inheritance tree
The approach is simple: generate a wrapper method for each method in the extension type that does a vtable lookup.
This time we will start with the most common case: closures consumed as inner functions. This means we don't need dynamic binding for our cell variables, and we can do simple lambda lifting instead of complicated closure conversion. This also trivially works on the GPU, allowing one to use map, filter etc, with lambdas trivially.