Skip to content

2023

Considering user-defined numerical types

Originally from: https://c3.handmade.network/blog/p/8635-considering_user-defined_numerical_types

Recently, I did some work on the math libraries for C3. This involved working on vector, matrix and complex types. In the process I added some conveniences to the built in (simd) vector types. One result of this was that rather than having a Vector2 and Vector3 user defined type, I would simply add methods to float[<2>] and float[<3>] (+ double versions). This works especially well since + - / * are all defined on vectors.

In other words, even without operator overloading this works:

float[<2>] a = get_a();
float[<2>] b = get_b();
return a + b;

Seeing as a complex number being nothing other than a vector of two elements, it seemed interesting to implement the complex type that way, and get much arithmetics for free.

Just making a complex type a typedef of float[<2>] has problems though. Any method defined on the complex type would be defined on float[<2>]!

define Complex = float[<2>]; // Type alias
// Define multiply
fn Complex Complex.mul(Complex a, Complex b) 
{
    return {
        a[0] * b[0] - a[1] * b[1], 
        a[1] * b[0] + b[1] * a[0] 
    };
}
...
float[<2>] f = get_f();
f = f.mul(f); // Accidentally get the Complex version!

Distinct types

Now C3 has the concept of "distinct" types. That is, a type which is in practice identical to some type, but has a different name and will not be implicitly cast into the other. For example, I can write define Id = distinct int and have the compiler complain if an int rather than an Id is used.

This is similar to the C trick of wrapping a type in a struct, but without the inconvenience of that method.

This solves our problem form from before:

define Complex = distinct float[<2>]; // Distinct type
fn Complex Complex.mul(Complex a, Complex b) 
{
    return {
        a[0] * b[0] - a[1] * b[1], 
        a[1] * b[0] + b[1] * a[0] 
    };
}
float[<2>] f = get_f();
Complex c = get_c();
c = c.mul(c); // Works.
f = f.mul(f); // Compile time error!

At first glance this looks promising. Unfortunately the advantages of distinct types becomes a disadvantage: the distinct type retains the functions of the original type. For the c + d case this is what we want, but c * d is not what we expect:

Complex c = { 1, 3 };
Complex d = { 2, 7 };
e = c.mul(d); // Correct! e is { -19, 13 }
e = c * d; // e is { 2, 21 }

While we can try to remember to use the right thing, it's far from ideal. Especially if this is baked into the standard library: you can't have a type that mostly behaves incorrectly for regular operators!

Possible solutions

Since we're able to change the language semantics to try to "fix" this while still using "distinct", there are a few obvious solutions:

  1. Being able to "override" an operator for a distinct type. In this case we would override * and / and leave the other operators. This would be a limited form of overloading.
  2. Being able to "turn off" operators. So in this case we turn off * and / forcing the programmer to use methods, like mul and div instead.
  3. Always require to explicitly inherit operators and methods.

These could work, but we need to recognize the added complexity needed for these solutions. And on top of that, some functionality can't quite be described this way, such as conversion of floats to complex numbers.

Operator overloading with structs

The common solution in C++ would be a struct with operator overloading to get + - * /. C3 doesn't have operator overloading for numbers, but maybe we could add it?

However, operator overloading is not sufficient to get us conversion from floats to complex numbers. For that we need user-defined conversion operators, which interacts with the type system in various ways. This leaves the whole problem with custom constructor and custom conversions: is float -> Complex a conversion function on float or a construction function on Complex? All of this interacts in subtle ways with other implicit conversions.

Built-in types

Another possibility is of course to make the types built-in. After all this is how C does complex types. But then the problem is how to limit it: ok, complex types built-in but then what about quaternions? Matrices? Matrices with complex numbers? Matrices with quaternions?!

Drawing a line here means some types have better support than others, and trying to go beyond (simd) vectors, it's hard to figure out where that line should be drawn.

Worth it?

Ultimately each feature needs to be balanced against utility. Are the benefits sufficiently big to motivate the cost. Comparing what is already in C3 against what would be necessary to add, it seems that the cost would be fairly high.

Even if vectors work for complex numbers, matrices are more likely to require operator overloading with structs which is a bigger feature than overriding operators on distinct types. This means that the idea fixing so that Complex can be a vector is a feature with very limited use.

General overloading and user-defined conversion functions can be applied to a wider set of types, but has a much higher cost with the primary use restricted to numeric types and string handling.

So even if it's more useful, it's also costs a whole lot more in terms of language complexity, making it ultimately a net negative for the language as a whole.

So unless I have come up with some other solution, user-defined numeric types will have to stick with methods and explicit conversions.

Comments


Comment by Christoffer Lernö

Recently, I did some work on the math libraries for C3. This involved working on vector, matrix and complex types. In the process I added some conveniences to the built in (simd) vector types. One result of this was that rather than having a Vector2 and Vector3 user defined type, I would simply add methods to float[<2>] and float[<3>] (+ double versions). This works especially well since + - / * are all defined on vectors.

In other words, even without operator overloading this works:

float[<2>] a = get_a();
float[<2>] b = get_b();
return a + b;

Seeing as a complex number being nothing other than a vector of two elements, it seemed interesting to implement the complex type that way, and get much arithmetics for free.

Just making a complex type a typedef of float[<2>] has problems though. Any method defined on the complex type would be defined on float[<2>]!

define Complex = float[<2>]; // Type alias
// Define multiply
fn Complex Complex.mul(Complex a, Complex b) 
{
    return {
        a[0] * b[0] - a[1] * b[1], 
        a[1] * b[0] + b[1] * a[0] 
    };
}
...
float[<2>] f = get_f();
f = f.mul(f); // Accidentally get the Complex version!

Distinct types

Now C3 has the concept of "distinct" types. That is, a type which is in practice identical to some type, but has a different name and will not be implicitly cast into the other. For example, I can write define Id = distinct int and have the compiler complain if an int rather than an Id is used.

This is similar to the C trick of wrapping a type in a struct, but without the inconvenience of that method.

This solves our problem form from before:

define Complex = distinct float[<2>]; // Distinct type
fn Complex Complex.mul(Complex a, Complex b) 
{
    return {
        a[0] * b[0] - a[1] * b[1], 
        a[1] * b[0] + b[1] * a[0] 
    };
}
float[<2>] f = get_f();
Complex c = get_c();
c = c.mul(c); // Works.
f = f.mul(f); // Compile time error!

At first glance this looks promising. Unfortunately the advantages of distinct types becomes a disadvantage: the distinct type retains the functions of the original type. For the c + d case this is what we want, but c * d is not what we expect:

Complex c = { 1, 3 };
Complex d = { 2, 7 };
e = c.mul(d); // Correct! e is { -19, 13 }
e = c * d; // e is { 2, 21 }

While we can try to remember to use the right thing, it's far from ideal. Especially if this is baked into the standard library: you can't have a type that mostly behaves incorrectly for regular operators!

Possible solutions

Since we're able to change the language semantics to try to "fix" this while still using "distinct", there are a few obvious solutions:

  1. Being able to "override" an operator for a distinct type. In this case we would override * and / and leave the other operators. This would be a limited form of overloading.
  2. Being able to "turn off" operators. So in this case we turn off * and / forcing the programmer to use methods, like mul and div instead.
  3. Always require to explicitly inherit operators and methods.

These could work, but we need to recognize the added complexity needed for these solutions. And on top of that, some functionality can't quite be described this way, such as conversion of floats to complex numbers.

Operator overloading with structs

The common solution in C++ would be a struct with operator overloading to get + - * /. C3 doesn't have operator overloading for numbers, but maybe we could add it?

However, operator overloading is not sufficient to get us conversion from floats to complex numbers. For that we need user-defined conversion operators, which interacts with the type system in various ways. This leaves the whole problem with custom constructor and custom conversions: is float -> Complex a conversion function on float or a construction function on Complex? All of this interacts in subtle ways with other implicit conversions.

Built-in types

Another possibility is of course to make the types built-in. After all this is how C does complex types. But then the problem is how to limit it: ok, complex types built-in but then what about quaternions? Matrices? Matrices with complex numbers? Matrices with quaternions?!

Drawing a line here means some types have better support than others, and trying to go beyond (simd) vectors, it's hard to figure out where that line should be drawn.

Worth it?

Ultimately each feature needs to be balanced against utility. Are the benefits sufficiently big to motivate the cost. Comparing what is already in C3 against what would be necessary to add, it seems that the cost would be fairly high.

Even if vectors work for complex numbers, matrices are more likely to require operator overloading with structs which is a bigger feature than overriding operators on distinct types. This means that the idea fixing so that Complex can be a vector is a feature with very limited use.

General overloading and user-defined conversion functions can be applied to a wider set of types, but has a much higher cost with the primary use restricted to numeric types and string handling.

So even if it's more useful, it's also costs a whole lot more in terms of language complexity, making it ultimately a net negative for the language as a whole.

So unless I have come up with some other solution, user-defined numeric types will have to stick with methods and explicit conversions.

Handling parsing and semantic errors in a compiler

Originally from: https://c3.handmade.network/blog/p/8632-handling_parsing_and_semantic_errors_in_a_compiler

There was recently a question on r/ProgrammingLanguages about error handling strategies in a compiler.

The more correct errors a compiler can produce, the better for a language where compile times are long. On the other hand false positives are not helping anyone.

In my case, the language compiles fast enough, so my focus has been to avoid false positives. I use the following rules:

  1. Lexing errors: these are handed over to the parser creating parser errors.
  2. Parsing errors: skip forward until there is some token it is possible to safely sync on.
  3. Parser sync: some tokens will always be the start of a top level statement in my language: struct, import, module. Those are safe to use. For some other token types indentation can help: for example in C3 fn is a good sync token if it appears at zero indentation, but if it's found further in, it's likely part of a function type declaration: define Foo = fn void();. Only sync on tokens you are really sure of.
  4. No semantic analysis for code that doesn't parse: code that doesn't parse are very unlikely to semantically analyse. Pessimistic parser sync means lots of valid code may get skipped, making semantic analysis fail even though the code might pass.
  5. Use poisoning during semantic analysis. I saw this first described by Walter Bright, the creator of the D-Language. It is simple and incredibly effective in avoiding incorrect error reporting during semantic analysis. The algorithm is simply this: if an AST-node has an error, report it and mark it as poisoned. Then proceed to mark the parent of this AST-node poisoned as well, stopping any further analysis of the node (but without reporting any further errors).

I don't do anything particularly clever in regards to error reporting, but I found that these rules are sufficient to give very robust and correct error handling.

Comments


Comment by Christoffer Lernö

There was recently a question on r/ProgrammingLanguages about error handling strategies in a compiler.

The more correct errors a compiler can produce, the better for a language where compile times are long. On the other hand false positives are not helping anyone.

In my case, the language compiles fast enough, so my focus has been to avoid false positives. I use the following rules:

  1. Lexing errors: these are handed over to the parser creating parser errors.
  2. Parsing errors: skip forward until there is some token it is possible to safely sync on.
  3. Parser sync: some tokens will always be the start of a top level statement in my language: struct, import, module. Those are safe to use. For some other token types indentation can help: for example in C3 fn is a good sync token if it appears at zero indentation, but if it's found further in, it's likely part of a function type declaration: define Foo = fn void();. Only sync on tokens you are really sure of.
  4. No semantic analysis for code that doesn't parse: code that doesn't parse are very unlikely to semantically analyse. Pessimistic parser sync means lots of valid code may get skipped, making semantic analysis fail even though the code might pass.
  5. Use poisoning during semantic analysis. I saw this first described by Walter Bright, the creator of the D-Language. It is simple and incredibly effective in avoiding incorrect error reporting during semantic analysis. The algorithm is simply this: if an AST-node has an error, report it and mark it as poisoned. Then proceed to mark the parent of this AST-node poisoned as well, stopping any further analysis of the node (but without reporting any further errors).

I don't do anything particularly clever in regards to error reporting, but I found that these rules are sufficient to give very robust and correct error handling.

Killing off structural casts

Originally from: https://c3.handmade.network/blog/p/8630-killing_off_structural_casts

Structural casts are now gone from C3. This was the ability to do this:

struct Foo { int a; int b; }
struct Bar { int x; int y; }

fn void test(Foo f)
{
    // Actual layout of Foo is the same as Bar
    Bar b = (Bar)f;
    // This also ok:
    int[2] x = (int[2])f;
}

Although I think that in some ways this is a good feature, it is too permissive to be good: it's not always clear that the structural cast is even intended, and yet it suddenly allows a wide range of (explicit) casts. While doing a pointer case like (Bar*)&f would usually raise all sorts of warning flags, one would typically assume a value cast to be fairly safe and intentional. Structural casting breaks that.

The intention was a check that essentially confirms that bitcasting from one type to the other will retain match the internal data. This could then be combined with an @autocast attribute allowing something like this:

fn void foo(@autocast Foo f) { ... }

fn void test()
{
    Bar b = { 1, 2 };
    foo(b); // implicitly foo((Foo)b) due to the @autocast
}

The canonical use for this was when an external API could be used with a structurally equivalent internal type: For example you use a library which takes a Vector2 everywhere, and maybe there is another library in use that has it's Vector2D. And finally there is a Vec2 used internally in the application. With structural casts, these could be used interchangeably as long as they were structurally equivalent.

However, there are other solutions: transparent unions (see the GCC feature), macro forwarding wrappers and user definable conversions.

Then there is the question of use cases: while this vector case is common enough, I can't think of many other uses. (We might also note that when people want operator overloading, it's collections and user defined vector types they will take as examples).

So if the standard library is defining some vector types, or the use of real vector types becomes dominant then the interoperability use case might completely go away.

In any case, this is one more feature that seemed really cool to have, but ended up being less useful than expected.

(It might be somewhat useful to have compile time function that determines if two types are structurally identical though, as this allows you to build macros that work for a set of structurally equivalent types, should you ever want to)

Comments


Comment by Christoffer Lernö

Structural casts are now gone from C3. This was the ability to do this:

struct Foo { int a; int b; }
struct Bar { int x; int y; }

fn void test(Foo f)
{
    // Actual layout of Foo is the same as Bar
    Bar b = (Bar)f;
    // This also ok:
    int[2] x = (int[2])f;
}

Although I think that in some ways this is a good feature, it is too permissive to be good: it's not always clear that the structural cast is even intended, and yet it suddenly allows a wide range of (explicit) casts. While doing a pointer case like (Bar*)&f would usually raise all sorts of warning flags, one would typically assume a value cast to be fairly safe and intentional. Structural casting breaks that.

The intention was a check that essentially confirms that bitcasting from one type to the other will retain match the internal data. This could then be combined with an @autocast attribute allowing something like this:

fn void foo(@autocast Foo f) { ... }

fn void test()
{
    Bar b = { 1, 2 };
    foo(b); // implicitly foo((Foo)b) due to the @autocast
}

The canonical use for this was when an external API could be used with a structurally equivalent internal type: For example you use a library which takes a Vector2 everywhere, and maybe there is another library in use that has it's Vector2D. And finally there is a Vec2 used internally in the application. With structural casts, these could be used interchangeably as long as they were structurally equivalent.

However, there are other solutions: transparent unions (see the GCC feature), macro forwarding wrappers and user definable conversions.

Then there is the question of use cases: while this vector case is common enough, I can't think of many other uses. (We might also note that when people want operator overloading, it's collections and user defined vector types they will take as examples).

So if the standard library is defining some vector types, or the use of real vector types becomes dominant then the interoperability use case might completely go away.

In any case, this is one more feature that seemed really cool to have, but ended up being less useful than expected.

(It might be somewhat useful to have compile time function that determines if two types are structurally identical though, as this allows you to build macros that work for a set of structurally equivalent types, should you ever want to)