Ninkasi Bug Fixes

I spent a bit of time hacking on Ninkasi code this weekend!

I got rid of one terrible idea I implemented, replaced it with something much better, fixed one crash bug, and two potential coroutine-related memory leaks!

Removing One Ugly Anti-feature

First order of business was to change the object-function-call stupidity I put in there forever ago. It's an anti-feature I never should have considered. Here's how it went:

An object can have values inside of it assigned to functions, like so:

function testFunc(testValue)
{
    print("Value passed in: ", testValue, "\n");
}

var ob = object();
ob["foo"] = testFunc;

Previously, these functions can be called like so:

ob.foo();

When called in this manner, the object itself is inserted as the first argument to the function. This is roughly equivalent to the following code:

{
    var _temp = ob.foo;
    _temp(ob);
}

And the output of this is, predictably:

Value passed in: <object:0>

The problem with this is that there's no way to distinguish between when you want the object to be inserted as the first argument to a function and when you do not.

My reasoning at the time I developed this "feature" was sketchy at best, so I finally got around to cutting it out. Now there is no special case considered for calling a function directly from an object's field using '.'.

Now this code...

function testFunc(testValue)
{
    print("Value passed in: ", testValue, "\n");
}

var ob = object();
ob["foo"] = testFunc;

ob.foo();

... is a runtime error, as it should be:

error:
test.nks:14: Incorrect argument count for function call. Expected: 1 Found: 0

To replace this feature, because still like some aspects of it, I have stolen and repurposed C's indirection operator. The following code acts the way ob.foo() acted before:

ob->foo();

Which is equivalent to:

ob.foo(ob);

This gets us back to the original result:

Value passed in: <object:0>

But now you get to decide if you want to call a function on an object without the object itself being added in as an argument, by using the -> or the ..

One Crash Bug

Callable Objects

First, an explanation of the feature this affected:

Callable objects in Ninkasi are a little oddity I thought it would be useful to throw in. It's sort of a halfway point to actual closures.

Objects may be called as functions through the use of specially-named fields.

Take the following code:

var ob = newobject;
ob._exec = function(foo) {
    print("foo: ", foo, "\n");
};
ob._data = "blah blah blah";
ob();

The output of this code is:

foo: blah blah blah

It's that simple. Set an object's _exec field to a function, set an object's _data field to a value, and you may call the object itself as though it were a function.

This code...

ob();

... is equivalent to this code...

ob._exec(ob._data);

You can even add extra arguments after the _data one, which is inserted first:

var ob = object();
ob._exec = function(foo, bar) {
    print("foo: ", foo, ", ", bar, "\n");
};
ob._data = "blah blah blah";
ob("whatever");

Output:

foo: blah blah blah, whatever

The Bug

This feature does, unfortunately, involve some weird shifting-around of stack data.

This part wasn't thought out super-well, so bare with me here.

The setup for the function-call operator involves pushing the function ID onto the stack, followed by all the function arguments, followed by an integer indicating the number of arguments there are.

Here's what the stack will look like for a normal 3-parameter function call, at the time the call opcode executes (with stack indices relative to the top of the stack):

index | value       | type
------------------------------
-6    | ...         |
-5    | function-ID | function
-4    | arg0        | any type
-3    | arg1        | any type
-2    | arg2        | any type
-1    | 3           | integer

We have the function ID, followed by the arguments, followed by an argument count.

In order for the _data field to always be first, we need the stack to look like this:

index | value       | type
------------------------------
-7    | ...         |
-6    | function-ID | function
-5    | _data       | any type
-4    | arg0        | any type
-3    | arg1        | any type
-2    | arg2        | any type
-1    | 4           | integer

That just involves inserting the _data into the stack, shifting everything forward by 1, and adding 1 to the argument count entry.

I know this is probably tickling everyone's inefficient-VM sense, but this feature was kind of an afterthought for my personal toy language, so be nice. :)

Anyway, our bug today was in the code to shift the stack data around. The argument count coming into the function call opcode could be larger than the stack, and it would happily underrun the bottom of the stack, leading to your usual C buffer overrun/underrun type of error. That code is also not the prettiest code in the world, leading me to think I was either tired or in a rush when I wrote it. Let's have a look...

// Make room for the _data contents on the stack.
nkiVmStackPush_internal(vm);

// Shift everything up the stack.
for(i = argumentCount; i >= 1; i--) {
  vm->currentExecutionContext->stack.values[
    vm->currentExecutionContext->stack.size - argumentCount - 2 + i + 1] =
    vm->currentExecutionContext->stack.values[
      vm->currentExecutionContext->stack.size - argumentCount - 2 + i];
}

Yikes.

How to Setup This Bug

So here's an interesting thing: You can't.

Well, let me be more clear, you can't if you're using the scripting language directly. This is purely a vulnerability exposed through the binary state snapshots.

I haven't actually reverse-engineered AFL's crash output for this one, to be entirely honest, but my thinking is that the bad executable just pushes a bad argument count onto the stack before doing a normal callable-object call.

The Fix

First, I know what you're going to say. How dare I not have a system in place to protect against these kinds of overrun/underruns, when it's so easy to check for!? That usually comes after "why the hell are you using C89 for this project when it's absolutely prone to these kinds of errors!?" followed by a recommendation that I switch to Rust.

The thing is, though, I do have a way of safely interacting with the stack. I'm just not using it here because... uh... reasons?

The stoage space allocated to the stack only doubles in size when it reallocates, so, given that the stack capacity is always a power of two, it's pretty easy to just mask off the lowest some-number-of bits for stack indices going into the array. I even store that stack mask, and use it like I actually know what I'm doing in most spots!

The fix (along with some much-needed cleanup) look like this:

struct NKVMStack *stack = &(vm->currentExecutionContext->stack);
struct NKValue *stackValues;
nkuint32_t stackSize;
nkuint32_t stackMask;
nkuint32_t stackSizeMinusArguments;

// Make room for the _data contents on the stack.
nkiVmStackPush_internal(vm);

stackValues = stack->values;
stackSize = stack->size;
stackMask = stack->indexMask;
stackSizeMinusArguments = stackSize - argumentCount - 2;

// Shift everything up the stack.
for(i = argumentCount; i >= 1; i--) {
    stackValues[(stackSizeMinusArguments + i + 1) & stackMask] =
      stackValues[(stackSizeMinusArguments + i) & stackMask];
}

This also throws in a few extra variables to cut down on those giant globs of indirection and indexing inside the loop, potentially offering a speedup on compilers that can't optimize out those redundant operations (not that it's a particularly fast piece of code to begin with).

Memory Leaks

I'm not going to get into too much detail with this one, because it's pretty simple. A failure during coroutine creation would make the program allocate the storage space needed for the coroutine's execution context (including stack, program counter, etc) and then fail to set the pointer to store it, causing it to leak.

The fix is just to check for those failures and clean up the mess when it does.

Memory still Tracked

So with this leak still have a backup system in place for these. Allocations inside the VM are actually tracked and easily cleaned up during deallocation of the VM object itself. It's not a C memory leak, it's just a leak of VM address space, more or less.

When the VM's address space is exhausted, it will bail out with its own internal allocation error and can still be cleaned up by the hosting application. So as far as the stability of the hosting application is concerned, and in relation to potentially hostile script code, this isn't a "real" threat. It does suck for long-running scripts, though.

Final Thoughts

AFL is great. Valgrind is awesome, too. All these issues found today were through those two tools. Thinking you can write secure code without them or some equivalent to them is hubris in its purest form.

The issues with coroutine setup here are definitely showing my lack of testing after writing the coroutine code. The callable object issues are something I wasn't expecting, because that code's been in there forever.

Ninkasi may never be perfect and vulnerability-proof, but at least it got a little bit closer today, and I'm pretty happy with that.

Posted: 2021-09-05

Tags: ninkasi, gruedorf