NNLang Documentation
Welcome to the NNLang documentation. NNLang is a declarative language for defining neural network architectures, paired with the nnc compiler that produces standalone, zero-dependency native binaries with embedded weights.
Key Features
- No runtime dependencies — compiled models are self-contained static binaries
- No heap allocation — all memory is statically allocated at compile time
- Human-readable source — model architectures are defined in plain-text
.nnlfiles - Systems-first —
nnctargets bare-metal-capable output
Documentation Sections
- Getting Started — installation and your first model
- Language Reference — complete syntax reference
- CLI Reference — all compiler commands
- Examples — complete working examples
- Code Generation — how the compiler works
Quick Links
Getting Started
This guide will help you install and run your first NNLang model.
Installation
From crates.io
cargo install nnlang
From source
git clone https://github.com/gdesouza/nnl
cd nnl
cargo install --path .
Pre-built binaries
Download the latest release from GitHub:
# Linux
curl -L https://github.com/gdesouza/nnl/releases/latest/download/nnc-*-x86_64-unknown-linux-gnu.tar.gz | tar xz
sudo mv nnc /usr/local/bin/
# macOS
curl -L https://github.com/gdesouza/nnl/releases/latest/download/nnc-*-x86_64-apple-darwin.tar.gz | tar xz
sudo mv nnc /usr/local/bin/
Quick Start
1. Create a model file
Save this as model.nnl:
version 0.2;
model my_model {
config {
weights: "./weights";
io: "stdio";
}
layer input = Input(shape: [4]);
layer fc1 = Dense(units: 3, activation: "relu");
layer fc2 = Dense(units: 2);
}
2. Create weight files
Create a weights/ directory with:
weights/fc1.weight.npy— [4, 3] matrixweights/fc1.bias.npy— [3] vectorweights/fc2.weight.npy— [3, 2] matrixweights/fc2.bias.npy— [2] vector
3. Compile
nnc compile model.nnl --emit exe -o model
4. Run inference
# Input: 4 floats
echo -n -e '\x00\x00\x80\x3f\x00\x00\x00@\x00\x00@@\x00\x00\x80@' > input.bin
./model < input.bin > output.bin
Or test with known input/output:
nnc test model.nnl --input test_input.npy --expected expected_output.npy
Next Steps
- Language Reference — full language syntax
- CLI Reference — all commands
- Examples — complete model examples
NNLang Language Reference (v0.2)
A practical reference for writing .nnl model files consumed by the nnc compiler.
File Structure
An NNL file has a fixed top-level structure:
version 0.2;
model <name> {
config { ... }
layer <id> = <LayerType>(<params>);
...
connections { ... } // optional
}
| Section | Required | Purpose |
|---|---|---|
version | No (warns if absent) | Declares the NNL spec version. |
model | Yes | Names the model. Determines the generated C symbols (e.g., model_name_infer). |
config | Yes | Global compilation settings (precision, weights path, target, etc.). |
| Layers | Yes | One or more layer declarations defining the network. |
connections | No | Explicit data-flow graph. Omit for simple sequential models. |
Comments
// Line comment — extends to end of line
/* Block comment —
can span multiple lines */
Config Block
The config block sets compilation and runtime parameters.
config {
precision: "float32";
weights: "./weights/mnist.npz";
target: "generic";
align: 64;
batch: 1;
preprocess: "normalize_0_1";
io: "stdio";
}
| Key | Type | Required | Default | Description |
|---|---|---|---|---|
precision | String | No | "float32" | Tensor data type. "float32", "float64", "int8". |
weights | String | Yes | — | Path to weights: directory of .npy files, .npz archive, or .onnx file. |
target | String | No | "generic" | SIMD optimization target. "generic", "avx2", "avx512", "arm_neon". |
align | Number | No | 64 | Memory alignment in bytes for weight and workspace buffers. |
batch | Number | No | 1 | Inference batch size. Determines static buffer dimensions. |
preprocess | String | No | "none" | Input preprocessing. "none", "normalize_0_1", "standardize". |
preprocess_mean | Shape | No | — | Per-channel mean for "standardize" (e.g., [0.485, 0.456, 0.406]). |
preprocess_std | Shape | No | — | Per-channel std for "standardize" (e.g., [0.229, 0.224, 0.225]). |
io | String | No | "stdio" | I/O mode for --emit exe binaries. Currently only "stdio". |
Preprocessing modes:
"normalize_0_1"— divides each input element by 255.0."standardize"— applies(x - mean) / stdper channel; requirespreprocess_meanandpreprocess_std.
Layer Types
Every layer is declared as:
layer <id> = <LayerType>(<param>: <value>, ...);
The layer <id> is used for connections and for matching weight tensors (weights are looked up as {id}.{param_name} in the weight source).
Input
Entry point of the network. Defines the input tensor shape (excluding batch dimension).
| Parameter | Type | Required | Default |
|---|---|---|---|
shape | Shape | Yes | — |
Output shape: the declared shape.
layer input = Input(shape: [28, 28, 1]);
Dense
Fully connected layer: Y = activation(W·X + B).
| Parameter | Type | Required | Default |
|---|---|---|---|
units | Integer | Yes | — |
activation | String | No | "none" |
activation accepts "none", "relu", "sigmoid", "softmax".
Weight files: {id}.weight (shape: input_dim × units), {id}.bias (shape: units).
Output shape: [units].
layer fc1 = Dense(units: 128, activation: "relu");
Conv2D
2D spatial convolution.
| Parameter | Type | Required | Default |
|---|---|---|---|
filters | Integer | Yes | — |
kernel | Integer or Shape | Yes | — |
stride | Integer | No | 1 |
padding | String | No | "valid" |
padding accepts "valid" (no padding) or "same" (zero-pad to preserve spatial dims).
Weight files: {id}.weight (shape: filters × in_channels × kH × kW), {id}.bias (shape: filters).
Output shape (HWC):
"valid":[⌊(H - kH) / stride⌋ + 1, ⌊(W - kW) / stride⌋ + 1, filters]"same":[⌈H / stride⌉, ⌈W / stride⌉, filters]
layer conv1 = Conv2D(filters: 32, kernel: 3, stride: 1, padding: "valid");
MaxPool2D
Spatial max pooling.
| Parameter | Type | Required | Default |
|---|---|---|---|
kernel | Integer or Shape | Yes | — |
stride | Integer | No | kernel size |
Weight files: none.
Output shape: [⌊(H - kH) / stride⌋ + 1, ⌊(W - kW) / stride⌋ + 1, C]
layer pool1 = MaxPool2D(kernel: 2);
AvgPool2D
Spatial average pooling.
| Parameter | Type | Required | Default |
|---|---|---|---|
kernel | Integer or Shape | Yes | — |
stride | Integer | No | kernel size |
Weight files: none.
Output shape: same formula as MaxPool2D.
layer pool1 = AvgPool2D(kernel: 2, stride: 2);
Flatten
Reshapes a multi-dimensional tensor into a 1D vector.
No parameters.
Weight files: none.
Output shape: [H × W × C] (product of all input dimensions).
layer flat = Flatten();
BatchNorm
Batch normalization (inference mode — uses stored running statistics).
| Parameter | Type | Required | Default |
|---|---|---|---|
epsilon | Number | No | 1e-5 |
Weight files: {id}.gamma, {id}.beta, {id}.running_mean, {id}.running_var (each shape: channels).
Output shape: same as input.
layer bn1 = BatchNorm();
layer bn2 = BatchNorm(epsilon: 1e-6);
Dropout
Identity pass-through at inference time. Exists so that models exported from training frameworks can be represented without editing.
| Parameter | Type | Required | Default |
|---|---|---|---|
rate | Number | No | 0.5 |
The rate parameter is ignored during compilation.
Weight files: none.
Output shape: same as input.
layer drop = Dropout(rate: 0.25);
Add
Element-wise addition of two or more inputs. Requires explicit connections.
No parameters.
Constraint: all inputs must have identical shapes.
Weight files: none.
Output shape: same as each input.
layer res = Add();
Concat
Channel-wise concatenation of two or more inputs. Requires explicit connections.
| Parameter | Type | Required | Default |
|---|---|---|---|
axis | Integer | No | -1 |
Constraint: all inputs must have identical shapes except along the concatenation axis.
Weight files: none.
Output shape: input shape with dimension along axis summed across inputs.
layer merged = Concat();
layer merged = Concat(axis: -1);
ReLU
Standalone activation: max(0, x).
No parameters. No weight files.
Output shape: same as input.
layer relu1 = ReLU();
Sigmoid
Standalone activation: 1 / (1 + exp(-x)).
No parameters. No weight files.
Output shape: same as input.
layer sig = Sigmoid();
Softmax
Normalized exponential activation.
| Parameter | Type | Required | Default |
|---|---|---|---|
axis | Integer | No | -1 |
No weight files.
Output shape: same as input.
layer sm = Softmax();
Connections
Implicit Sequential
When the connections block is omitted, layers are connected in declaration order — each layer receives the output of the previous layer. This is the simplest form and works for linear stacks:
model simple {
config { weights: "./weights"; io: "stdio"; }
layer input = Input(shape: [4]);
layer fc1 = Dense(units: 8, activation: "relu");
layer output = Dense(units: 2);
}
// Equivalent to: input -> fc1 -> output
Explicit Graph
When a connections block is present, it fully defines the data flow. Use this for skip connections, branches, and multi-input layers.
connections {
input -> conv1;
conv1 -> bn1;
bn1 -> relu1;
relu1 -> output;
}
Multi-Input Syntax
Layers like Add and Concat accept multiple inputs using bracket syntax:
[input, bn2] -> res; // feeds both 'input' and 'bn2' into 'res'
Complete Examples
Simple MLP
A minimal multi-layer perceptron:
version 0.2;
model mlp {
config {
weights: "./weights";
io: "stdio";
}
layer input = Input(shape: [4]);
layer fc1 = Dense(units: 16, activation: "relu");
layer fc2 = Dense(units: 8, activation: "relu");
layer output = Dense(units: 3, activation: "softmax");
}
CNN with Pooling
An MNIST digit classifier with convolution and pooling:
version 0.2;
// MNIST handwritten digit classifier
model mnist_classifier {
config {
precision: "float32";
weights: "./weights";
target: "avx2";
batch: 1;
preprocess: "normalize_0_1";
io: "stdio";
}
layer input = Input(shape: [28, 28, 1]);
layer conv1 = Conv2D(filters: 32, kernel: 3, stride: 1, padding: "valid");
layer pool1 = MaxPool2D(kernel: 2);
layer flatten = Flatten();
layer fc1 = Dense(units: 128, activation: "relu");
layer output = Dense(units: 10, activation: "softmax");
}
ResNet Block with Skip Connections
A residual block using explicit connections and Add:
version 0.2;
model resnet_block {
config {
precision: "float32";
weights: "./weights";
target: "generic";
io: "stdio";
}
layer input = Input(shape: [32, 32, 64]);
layer conv1 = Conv2D(filters: 64, kernel: 3, stride: 1, padding: "same");
layer bn1 = BatchNorm();
layer relu1 = ReLU();
layer conv2 = Conv2D(filters: 64, kernel: 3, stride: 1, padding: "same");
layer bn2 = BatchNorm();
layer res = Add();
layer relu2 = ReLU();
connections {
input -> conv1;
conv1 -> bn1;
bn1 -> relu1;
relu1 -> conv2;
conv2 -> bn2;
[input, bn2] -> res; // skip connection
res -> relu2;
}
}
CLI Reference
nnc is the NNL compiler. It compiles .nnl neural network definitions into standalone, zero-dependency native artifacts.
nnc compile
Compile an NNL model to a native artifact.
nnc compile <source.nnl> [--emit exe|obj|lib|shared|header] [-o <output>] [--target-triple <triple>]
Flags
| Flag | Description | Default |
|---|---|---|
--emit <format> | Output format: exe, obj, lib, shared, header | exe |
-o, --output <path> | Output file path | Source stem with appropriate extension |
--target-triple <triple> | Target triple for cross-compilation | Host platform |
Emit formats
| Format | Output | Extension | Notes |
|---|---|---|---|
exe | Standalone executable with main() (reads stdin, writes stdout) | <stem> | Default |
obj | Relocatable object file | <stem>.o | Also generates <stem>.h |
lib | Static archive | lib<stem>.a | Also generates <stem>.h |
shared | Shared library | lib<stem>.so | Also generates <stem>.h |
header | C header only | <stem>.h | No compilation step |
For obj, lib, and shared, a .h header declaring the public C API is generated alongside the output.
Examples
# Compile to a standalone executable (default)
nnc compile mnist.nnl
# Compile to a static library + header
nnc compile mnist.nnl --emit lib -o build/libmnist.a
# Compile to a shared library
nnc compile mnist.nnl --emit shared
# Compile to an object file
nnc compile mnist.nnl --emit obj -o mnist.o
# Generate only the C header
nnc compile mnist.nnl --emit header
# Cross-compile for ARM Cortex-M
nnc compile mnist.nnl --emit obj --target-triple thumbv7em-none-eabi
# Cross-compile for bare-metal ARM
nnc compile model.nnl --emit lib --target-triple arm-none-eabi
nnc inspect
Print a model summary: layers, types, output shapes, parameter counts, and memory estimates.
nnc inspect <source.nnl>
Example
nnc inspect mnist.nnl
Example output:
Model: mnist_classifier (version 0.2)
Precision: float32 | Target: avx2 | Batch: 1
Layer Type Output Shape Params
──────────────────────────────────────────────────────
input Input [28, 28, 1] 0
conv1 Conv2D [26, 26, 32] 320
pool1 MaxPool2D [13, 13, 32] 0
flatten Flatten [5408] 0
fc1 Dense [128] 691,328
output Dense [10] 1,290
──────────────────────────────────────────────────────
Total params: 692,938
Weight memory: 2.64 MB
Workspace: 86.5 KB (static buffer)
nnc import
Convert an ONNX model into NNL format with extracted weight files.
nnc import <model.onnx> [-o <output.nnl>] [--weights-dir <dir>]
Flags
| Flag | Description | Default |
|---|---|---|
-o, --output <path> | Output .nnl file path | Source name with .nnl extension |
--weights-dir <dir> | Directory to write extracted .npy weight files | ./weights |
Notes
- Each ONNX initializer is extracted as a separate
.npyfile in the weights directory. - Unsupported ONNX operators are emitted as comments in the generated
.nnlfile.
Examples
# Import with defaults (resnet.nnl + ./weights/)
nnc import resnet.onnx
# Specify output path and weights directory
nnc import resnet.onnx -o models/resnet.nnl --weights-dir models/weights
nnc test
Compile a model, run inference on a given input, and compare the output element-wise against expected values.
nnc test <source.nnl> --input <input.npy> --expected <expected.npy> [--tolerance <tol>]
Flags
| Flag | Description | Default |
|---|---|---|
--input <path> | Path to input tensor (.npy, float32) | Required |
--expected <path> | Path to expected output tensor (.npy, float32) | Required |
--tolerance <tol> | Maximum allowed absolute difference per element | 1e-5 |
Behavior
- Compiles the model to a temporary executable.
- Feeds the input tensor via stdin as raw float32 bytes.
- Reads the output tensor from stdout.
- Compares each element against the expected tensor.
- Reports up to 10 individual mismatches, then a summary.
Examples
# Test with default tolerance (1e-5)
nnc test mnist.nnl --input test_input.npy --expected test_output.npy
# Test with relaxed tolerance
nnc test mnist.nnl --input test_input.npy --expected test_output.npy --tolerance 1e-3
Example pass output:
PASS: 10/10 elements within tolerance 1.0e-5 (max diff: 3.42e-7)
Example fail output:
mismatch at [3]: got 0.72341299, expected 0.72345012, diff 3.71e-5
mismatch at [7]: got 0.10002345, expected 0.10010000, diff 7.66e-5
FAIL: 2/10 elements exceed tolerance 1.0e-5 (max diff: 7.66e-5)
Exit Codes
| Code | Meaning |
|---|---|
0 | Success (compilation succeeded, test passed, import/inspect completed) |
1 | Error (syntax error, validation failure, compilation error, test mismatch, I/O error) |
Environment
| Requirement | Purpose |
|---|---|
| Rust toolchain | Building nnc from source |
C compiler (cc, gcc, or clang) on PATH | Used by nnc compile to produce native artifacts |
Cross-compiler (e.g., arm-none-eabi-gcc) | Required when using --target-triple for cross-compilation |
Examples
The examples/ directory contains complete, self-contained models with pre-generated weights and test data. Each example includes:
- A
.nnlmodel definition - A
weights/directory with.npyweight files test_input.npyandexpected_output.npyfor verification
Simple MLP (examples/model/)
Architecture: [4] → Dense(3) → Dense(2)
A minimal multi-layer perceptron with no activation functions — useful as a smoke test for the compiler pipeline.
Model definition
version 0.2;
model test_mlp {
config {
weights: "./weights";
io: "stdio";
}
layer input = Input(shape: [4]);
layer fc1 = Dense(units: 3);
layer fc2 = Dense(units: 2);
}
- Input: 4 floats
- fc1: Dense layer with 3 units (no activation), weights:
fc1.weight.npy[4×3],fc1.bias.npy[3] - fc2: Dense layer with 2 units (no activation), weights:
fc2.weight.npy[3×2],fc2.bias.npy[2] - Output: 2 floats
Compile and test
# Compile to a standalone executable
nnc compile examples/model/model.nnl --emit exe -o mlp
# Verify against known test data
nnc test examples/model/model.nnl \
--input examples/model/test_input.npy \
--expected examples/model/expected_output.npy
MNIST CNN (examples/mnist/)
Architecture: [28,28,1] → Conv2D(32) → MaxPool2D(2) → Flatten → Dense(128, relu) → Dense(10, softmax)
A convolutional neural network for MNIST handwritten digit classification.
Model definition
version 0.2;
// MNIST handwritten digit classifier
model mnist_classifier {
config {
precision: "float32";
weights: "./weights";
target: "avx2";
batch: 1;
preprocess: "normalize_0_1";
io: "stdio";
}
layer input = Input(shape: [28, 28, 1]);
layer conv1 = Conv2D(filters: 32, kernel: 3, stride: 1, padding: "valid");
layer pool1 = MaxPool2D(kernel: 2);
layer flatten = Flatten();
layer fc1 = Dense(units: 128, activation: "relu");
layer output = Dense(units: 10, activation: "softmax");
}
Layer breakdown
| Layer | Operation | Output shape | Notes |
|---|---|---|---|
input | Input | [28, 28, 1] | Single-channel grayscale image (HWC) |
conv1 | Conv2D | [26, 26, 32] | 32 filters, 3×3 kernel, valid padding |
pool1 | MaxPool2D | [13, 13, 32] | 2×2 pooling window |
flatten | Flatten | [5408] | 13 × 13 × 32 = 5408 |
fc1 | Dense + ReLU | [128] | Fully connected with ReLU activation |
output | Dense + Softmax | [10] | 10-class probability distribution |
Preprocessing
preprocess: "normalize_0_1" divides each input pixel by 255.0, mapping raw [0, 255] byte values to [0.0, 1.0] floats. This is applied automatically in the generated inference code.
Compile and test
nnc compile examples/mnist/mnist.nnl --emit exe -o mnist
nnc test examples/mnist/mnist.nnl \
--input examples/mnist/test_input.npy \
--expected examples/mnist/expected_output.npy
ResNet Block (examples/resnet_block/)
Architecture: A residual block with skip connection using explicit connections and Add.
This example demonstrates non-sequential layer graphs — the connections block allows arbitrary wiring between layers, including multi-input layers like Add.
Model definition
version 0.2;
model resnet_block {
config {
precision: "float32";
weights: "./weights";
target: "generic";
io: "stdio";
}
layer input = Input(shape: [32, 32, 64]);
layer conv1 = Conv2D(filters: 64, kernel: 3, stride: 1, padding: "same");
layer bn1 = BatchNorm();
layer relu1 = ReLU();
layer conv2 = Conv2D(filters: 64, kernel: 3, stride: 1, padding: "same");
layer bn2 = BatchNorm();
layer res = Add();
layer relu2 = ReLU();
connections {
input -> conv1;
conv1 -> bn1;
bn1 -> relu1;
relu1 -> conv2;
conv2 -> bn2;
[input, bn2] -> res;
res -> relu2;
}
}
Skip connection explained
The key line is [input, bn2] -> res; — this feeds both the original input and the output of bn2 into the Add layer, creating the residual shortcut:
input ──→ conv1 → bn1 → relu1 → conv2 → bn2 ──┐
│ │
└──────────────────────────────────────────→ Add → relu2
Without the connections block, layers are connected sequentially in declaration order. The connections block overrides this default with explicit wiring.
Weight files
BatchNorm layers require four weight files each:
bn1.gamma.npy,bn1.beta.npy— learned scale and shiftbn1.running_mean.npy,bn1.running_var.npy— running statistics from training
Compile and test
nnc compile examples/resnet_block/resnet_block.nnl --emit exe -o resnet_block
nnc test examples/resnet_block/resnet_block.nnl \
--input examples/resnet_block/test_input.npy \
--expected examples/resnet_block/expected_output.npy
ONNX Import (examples/import_test/)
Demonstrates the round-trip workflow: generate an ONNX model in Python, import it into NNL, compile, and verify.
Architecture: [4] → Dense(3, relu) → Dense(2)
Step 1: Generate the ONNX model
cd examples/import_test
python3 gen_mlp.py
This creates:
model.onnx— the ONNX model with embedded weightsinput.npy— test input[1.0, 2.0, 3.0, 4.0]expected.npy— expected output computed from the same weights
Step 2: Import into NNL
nnc import examples/import_test/model.onnx \
-o examples/import_test/model.nnl \
--weights-dir examples/import_test/weights
This produces a .nnl file and extracts weight tensors into the weights/ directory as .npy files.
Step 3: Compile
nnc compile examples/import_test/model.nnl --emit exe -o import_mlp
Step 4: Test
nnc test examples/import_test/model.nnl \
--input examples/import_test/input.npy \
--expected examples/import_test/expected.npy
What gen_mlp.py does
The script builds a two-layer MLP with fixed weights using the ONNX helper API:
- Layer 1:
Gemm(matrix multiply + bias) →Relu - Layer 2:
Gemm
It uses deterministic weights so the expected output can be computed exactly and verified after the NNL round-trip.
Creating Your Own Model
1. Write the .nnl file
Define your architecture with layer declarations and an optional connections block:
version 0.2;
model my_model {
config {
weights: "./weights";
io: "stdio";
}
layer input = Input(shape: [784]);
layer fc1 = Dense(units: 64, activation: "relu");
layer fc2 = Dense(units: 10, activation: "softmax");
}
2. Create the weights directory
Each layer expects specific .npy files named <layer_id>.<param>.npy:
| Layer type | Weight files |
|---|---|
| Dense | <id>.weight.npy, <id>.bias.npy |
| Conv2D | <id>.weight.npy, <id>.bias.npy |
| BatchNorm | <id>.gamma.npy, <id>.beta.npy, <id>.running_mean.npy, <id>.running_var.npy |
3. Generate weights with NumPy
import numpy as np
np.save("weights/fc1.weight.npy", np.random.randn(784, 64).astype(np.float32))
np.save("weights/fc1.bias.npy", np.zeros(64, dtype=np.float32))
np.save("weights/fc2.weight.npy", np.random.randn(64, 10).astype(np.float32))
np.save("weights/fc2.bias.npy", np.zeros(10, dtype=np.float32))
4. Compile
nnc compile my_model.nnl --emit exe -o my_model
5. Test
Generate test inputs and expected outputs, then verify:
nnc test my_model.nnl --input test_input.npy --expected expected_output.npy
The default tolerance is 1e-5 (element-wise). Adjust with --tolerance if needed.
Code Generation
How It Works
nnc generates C source code from the NNL model, then invokes the system C compiler (cc/gcc/clang) to produce the final artifact. This approach is documented in DESIGN.md as ADR-001: C Codegen Backend.
The generated C contains:
static const floatweight arrays (placed in.rodataviaconst)- Statically-allocated workspace buffers for activations
- The inference function body with kernel calls in topological order
- A
.hheader declaring the public API
Pipeline
.nnl source → nnc frontend → IR → C source → cc/gcc/clang → native binary
- Frontend — parses the
.nnlfile into an AST (src/syntax/) - Semantic analysis — validates layer types, resolves connections, infers shapes (
src/sema/) - IR — builds a typed model graph with topological ordering (
src/ir/) - Weights — loads
.npy/.npz/ ONNX weight tensors (src/weights/) - C emitter — generates a
.csource file and.hheader (src/codegen/emit.rs) - Toolchain — invokes
cc/gcc/clangandarto produce the requested artifact (src/codegen/toolchain.rs)
Generated C API
For a model named my_model, nnc generates:
#ifndef MY_MODEL_H
#define MY_MODEL_H
#include <stdint.h>
int my_model_infer(const void *input, void *output);
int my_model_input_size(void); // total float elements in input tensor
int my_model_output_size(void); // total float elements in output tensor
#endif /* MY_MODEL_H */
input/outputare rawfloatarrays in row-major (HWC) layout- Returns
0on success - No heap allocation during inference — all buffers are
static - All weights are embedded as
static const floatarrays in.rodata
Output Formats
--emit flag | File type | What’s generated | Use case |
|---|---|---|---|
exe | Standalone binary | Binary with main() that reads stdin / writes stdout | Quick testing, CLI inference |
obj | .o relocatable object | Object file + .h header | Linking into a larger C/C++ project |
lib | .a static archive | Static library + .h header | Distribution as a self-contained library |
shared | .so shared library | Shared object + .h header | Dynamic linking, plugins |
header | .h file only | Header with API declarations | Inspection, IDE integration |
Under the hood, these map to standard compiler/archiver invocations:
exe→cc -O2 -o output source.c -lmobj→cc -O2 -c -o output.o source.clib→cc -O2 -c+ar rcs output.a output.oshared→cc -O2 -shared -fPIC -o output.so source.c -lmheader→ direct file copy
Integration Example
Compile a model as a static library:
nnc compile my_model.nnl --emit lib -o libmy_model.a
This produces libmy_model.a and my_model.h in the same directory. Link them into your C project:
#include "my_model.h"
float input[784], output[10];
int main(void) {
// ... fill input[] with preprocessed data ...
int rc = my_model_infer(input, output);
if (rc != 0) return rc;
// ... use output[] ...
return 0;
}
Compile and link:
gcc -O2 -o app app.c -L. -lmy_model -lm
Alternatively, link a .o object directly:
nnc compile my_model.nnl --emit obj -o my_model.o
gcc -O2 -o app app.c my_model.o -lm
Cross-Compilation
When --target-triple is specified, nnc invokes the corresponding cross-compiler instead of cc:
nnc compile model.nnl --emit exe --target-triple arm-none-eabi -o model
# invokes: arm-none-eabi-gcc -O2 -o model model.c -lm
Combine with a SIMD target in the model config for architecture-specific optimizations:
config {
target: "arm_neon";
}
This adds -mfpu=neon to the compiler flags. Available targets and their flags:
Config target | Compiler flag |
|---|---|
"generic" | (none) |
"avx2" | -mavx2 |
"avx512" | -mavx512f |
"arm_neon" | -mfpu=neon |
Memory Model
- Static workspace buffers — all activation memory is statically allocated (
static floatarrays). Nomallocis ever called. - Liveness-based buffer reuse — the codegen performs liveness analysis on the layer graph and reuses buffer slots when a layer’s output is no longer needed, minimizing total activation memory.
- Weights in read-only data — all weight arrays are
static const floatwith alignment attributes, placed in the.rodatasection by the C compiler. - Alignment — buffers and weight arrays use
__attribute__((aligned(N)))for SIMD-friendly access patterns.
Weight Files
Supported Formats
| Format | Description |
|---|---|
Directory of .npy files | Each file named {layer_id}.{param}.npy (e.g., fc1.weight.npy, fc1.bias.npy) |
.npz archive | Keys must match {layer_id}.{param} (e.g., fc1.weight, fc1.bias) |
Naming Convention
The weights config key points to the weight source. nnc resolves it relative to the .nnl file’s directory.
[config]
weights = "weights/" # directory of .npy files
# or
weights = "model.npz" # single .npz archive
Expected Shapes Per Layer
| Layer | Parameter | Shape |
|---|---|---|
| Dense | weight | [input_dim, units] |
| Dense | bias | [units] |
| Conv2D | weight | [filters, in_channels, kH, kW] |
| Conv2D | bias | [filters] |
| BatchNorm | gamma | [channels] |
| BatchNorm | beta | [channels] |
| BatchNorm | running_mean | [channels] |
| BatchNorm | running_var | [channels] |
Data Types
| Precision | Weight dtype |
|---|---|
"float32" | float32 |
"float64" | float64 |
Generating Test Weights (Python)
import numpy as np
# Create weights matching a Dense layer with 784 inputs and 128 units
np.save("fc1.weight.npy", np.random.randn(784, 128).astype(np.float32))
np.save("fc1.bias.npy", np.zeros(128, dtype=np.float32))
# Or bundle into an .npz archive
np.savez("model.npz",
**{"fc1.weight": np.random.randn(784, 128).astype(np.float32),
"fc1.bias": np.zeros(128, dtype=np.float32)})
Error Messages
| Error | Meaning | Fix |
|---|---|---|
| E003: missing weight | A layer expects a weight file or key that was not found in the weight source. | Ensure the weight source contains an entry named {layer_id}.{param} for every parameterised layer. |
| Shape mismatch | The shape of a loaded weight does not match what the layer definition expects (e.g., expected [784, 128] but found [128, 784]). | Regenerate or transpose the weight so its shape matches the table above. |
ONNX Import
Overview
nnc import converts ONNX models to NNL format with extracted weights.
nnc import model.onnx -o model.nnl
Supported ONNX Operators
| ONNX Op | NNL Layer |
|---|---|
| Gemm / MatMul | Dense |
| Conv | Conv2D |
| MaxPool | MaxPool2D |
| AveragePool | AvgPool2D |
| Flatten | Flatten |
| BatchNormalization | BatchNorm |
| Dropout | Dropout |
| Add | Add |
| Concat | Concat |
| Relu | ReLU |
| Sigmoid | Sigmoid |
| Softmax | Softmax |
Weight Handling
- Weights are extracted from ONNX initializers and saved as individual
.npyfiles. Gemmnodes withtransB=1have their weights automatically transposed to the NNL[in, out]layout.- The batch dimension is stripped from input shapes.
Unsupported Operators
Operators without a mapping are emitted as comments in the generated .nnl file:
// UNSUPPORTED: Reshape(reshape_0)
These require manual resolution — replace the comment with an equivalent NNL layer or restructure the model before export.
Round-Trip Workflow
- Train your model in PyTorch, TensorFlow, or another framework.
- Export to ONNX (e.g.,
torch.onnx.export(model, dummy, "model.onnx")). - Import into NNL:
nnc import model.onnx -o model.nnl - Compile and run:
nnc compile model.nnl -o model && ./model - Test outputs against the original framework to verify correctness.
Limitations
- Only
float32weights are supported. - External data is not supported — weights must be embedded in the
.onnxfile. - Dynamic shapes are not supported; all dimensions must be fixed at export time.
NNL Design Decisions
MVP Constraints (v0.2)
These constraints scope the first end-to-end release of nnc:
- float32 only. int8 quantization requires unspecified metadata (scaling, saturation semantics) and is deferred.
- generic target first. AVX2/AVX512/ARM NEON specialization is Phase 7 work.
- Single input tensor, single output tensor. The C ABI (
model_name_infer(const void*, void*)) assumes one of each. Multi-input/output is a future spec extension. - HWC tensor convention. Activation shapes exclude batch;
config.batchis prepended internally. Shapes like[28, 28, 1]follow height × width × channels ordering. - Weight format priority: directory of
.npyfiles →.npzarchives →.onnxinitializers. - Bare
.npyis valid only when exactly one tensor is expected; otherwise a clear error is emitted.
ADR-001: C Codegen Backend
Status: Accepted
Date: 2026-04-18
Context
nnc must produce standalone native artifacts (executables, object files, static libraries, shared libraries) with embedded weights, zero heap allocation, and a C-ABI inference function. Three backend strategies were evaluated:
- Direct machine code emission — write raw instructions and ELF/Mach-O/PE object files using a crate like
object. - LLVM or Cranelift — generate IR for an existing compiler backend.
- Emit C source → invoke system C compiler.
Decision
Emit C source code and invoke the host (or cross) C compiler (cc/gcc/clang) to produce final machine code.
nnc generates a .c file containing:
static const floatweight arrays (placed in.rodataviaconst)- A statically-allocated workspace buffer
- The inference function body with kernel calls in topological order
- A
.hheader declaring the public API
Then it invokes the system C compiler to produce the requested output format.
Rationale
Free optimizations. gcc -O2 / clang -O2 provides vectorization, loop unrolling, constant folding, and register allocation. These are professional-grade optimizations that would take months to replicate.
Trivial output format support. All --emit modes map to standard compiler/archiver invocations:
--emit obj→cc -c--emit exe→cc(with a generatedmain())--emit lib→cc -c+ar rcs--emit shared→cc -shared
Cross-compilation for free. --target-triple thumbv7em-none-eabi simply invokes arm-none-eabi-gcc. The entire cross-compilation ecosystem already exists.
Automatic C ABI compliance. The spec requires a C calling convention. Emitting actual C guarantees correct struct layout, alignment, and calling convention — no manual ABI bugs.
SIMD via intrinsics. Target-specific kernels (Phase 7) emit C intrinsic calls like _mm256_fmadd_ps(). The C compiler handles register allocation and instruction scheduling.
Debuggability. The generated .c file is human-readable and inspectable. An --emit-c debug flag can preserve it for verification.
Small implementation surface. The C emitter is ~1000–2000 lines of Rust write!() calls. An LLVM backend would be 5–10× larger.
Tradeoff
The only real downside is a runtime dependency on a C compiler on the build machine. This is the same requirement as cargo (which invokes cc for build scripts and -sys crates), Go’s cgo, and Zig’s build system. For nnc’s embedded/safety-critical audience, a C cross-toolchain is always present.
When to Reconsider
Replace the C backend only if:
- Embedding very large weight arrays as C literals exceeds C compiler memory limits.
- Fused kernels require control that C intrinsics cannot express.
nncmust be fully self-contained with zero external tool dependencies.
At that point, only the last codegen stage is replaced — the frontend, IR, semantic analysis, and memory planner remain unchanged.