Quickstart
Build a worker, attach it from Haybarn, and call a function. About five minutes.
Prerequisites
- JDK 21+ (JDK 22+ to enable the shared-memory transport).
- Haybarn (or any DuckDB engine with the
vgiextension) — see the callout below. - The
examples/project from this repo.
Don't have Haybarn yet?
Haybarn is Query Farm's DuckDB-derived engine; it ships the vgi extension in its community channel. Run its shell with whichever tool you already have — no separate install step:
npx haybarn@rc # via Node (the @rc tag is the current release)
uvx haybarn-cli # via uv (install: curl -LsSf https://astral.sh/uv/install.sh | sh)Inside the shell, enable the extension once per session:
INSTALL vgi FROM community;
LOAD vgi;The vgi extension currently ships for Haybarn; a DuckDB release is on the way, and a worker you write now will work with it unchanged.
1. Add the dependency
A worker needs exactly one dependency.
New to Gradle?
Gradle is the build tool most JVM projects use. You don't install it — the examples/ project ships a wrapper script (./gradlew) that downloads the right version on first run. The build.gradle.kts file below declares your project: where to fetch libraries (mavenCentral()), which ones (dependencies { … }), and how to package it (application). The coordinate farm.query:vgi:0.1.0 is group:artifact:version — Gradle resolves it from Maven Central. Running ./gradlew installDist then produces a self-contained, runnable worker.
plugins { application }
repositories { mavenCentral() }
dependencies {
implementation("farm.query:vgi:0.1.0")
runtimeOnly("org.slf4j:slf4j-simple:2.0.16") // any SLF4J binding
}
application {
mainClass.set("farm.query.vgi.examples.AllInOneWorker")
applicationDefaultJvmArgs = listOf(
"--add-opens=java.base/java.nio=ALL-UNNAMED",
"--enable-native-access=ALL-UNNAMED",
)
}Those two JVM flags are required — Arrow needs java.nio access and the shared-memory transport makes native calls. The -parameters compiler flag is also required; see JVM flags.
Prefer Maven?
The dependency is the same coordinate; only the build file differs. In pom.xml:
<dependency>
<groupId>farm.query</groupId>
<artifactId>vgi</artifactId>
<version>0.1.0</version>
</dependency>Pass the JVM flags via the exec-maven-plugin (or your run script) and the -parameters flag through maven-compiler-plugin's <parameters>true</parameters>. The Gradle examples/ project is the supported path; Maven works identically at the library level.
2. Write a worker
A worker is a main that registers functions and calls runFromArgs:
// VGI-Java example: a scalar function.
//
// A scalar function maps each input row to one output row. You extend
// `ScalarFn` and write a single `compute()` method; the framework reads its
// parameter annotations to derive the SQL signature, the output type, and the
// per-batch dispatch. There is no schema boilerplate to write by hand.
//
// Run it on its own:
// ./gradlew runScalar --args="--unix /tmp/scalar.sock --idle-timeout 60"
// then from Haybarn:
// ATTACH 'demo' AS demo (TYPE vgi, LOCATION 'launch:/abs/path/bin/runScalar');
// SELECT demo.upper_case('hello'); -- HELLO
package farm.query.vgi.examples;
import farm.query.vgi.Worker;
import farm.query.vgi.scalar.ScalarFn;
import farm.query.vgi.scalar.Vector;
import org.apache.arrow.vector.VarCharVector;
import java.nio.charset.StandardCharsets;
import java.util.Locale;
/** {@code upper_case(value VARCHAR) -> VARCHAR}: ASCII/Unicode uppercase. */
public final class ScalarExample extends ScalarFn {
@Override public String name() { return "upper_case"; }
@Override public String description() { return "Uppercase a string"; }
// One `@Vector` input column + one trailing (unannotated) output vector.
// The framework allocates `result`, sized to the batch row count, and
// writes whatever you put into it back across the wire.
//
// Parameter rules in one breath:
// @Vector -> a per-row input column (the Arrow vector type is the SQL type)
// @Const -> a bind-time constant arg (long/double/String/boolean/byte[])
// @Setting -> a session setting (SET demo.foo = ...)
// last unannotated vector = the output (framework-allocated)
public void compute(@Vector VarCharVector value, VarCharVector result) {
int rows = value.getValueCount();
result.allocateNew();
for (int i = 0; i < rows; i++) {
if (value.isNull(i)) { result.setNull(i); continue; }
String up = new String(value.get(i), StandardCharsets.UTF_8).toUpperCase(Locale.ROOT);
byte[] bytes = up.getBytes(StandardCharsets.UTF_8);
result.setSafe(i, bytes, 0, bytes.length);
}
}
public static void main(String[] args) {
Worker.builder()
.catalogName("demo")
.registerScalar(new ScalarExample())
.runFromArgs(args); // handles --unix / --http / --idle-timeout / stdio
}
}3. Build it
cd examples
./gradlew installDistThat produces a launch script at build/install/vgi-java-examples/bin/vgi-java-examples. (./run.sh does this and prints the SQL for you.)
4. Attach from Haybarn
INSTALL vgi FROM community;
LOAD vgi;
-- Use the ABSOLUTE path to the launch script.
ATTACH 'demo' AS demo (TYPE vgi,
LOCATION 'launch:/abs/path/build/install/vgi-java-examples/bin/vgi-java-examples');
SELECT demo.upper_case('hello'); -- HELLOWhy launch:?
A cold JVM takes seconds to start. The launch: LOCATION scheme starts the worker once behind a flock-coordinated Unix socket and reuses it across every query — and across every engine process on the machine. Without it, each query would pay the full JVM startup cost. You almost always want launch:.
Other LOCATION schemes exist (a bare path forks a subprocess per attach; http://host:port talks to a long-running HTTP worker). See CLI & environment.
5. Try all five kinds
The AllInOneWorker from the examples registers one function of each kind:
-- VGI-Java quickstart — run in a Haybarn shell.
--
-- Prereq: build the worker first (`./gradlew installDist` in ../), then replace
-- the LOCATION path below with the absolute path printed by `../run.sh`.
--
-- The vgi extension must be available:
INSTALL vgi FROM community;
LOAD vgi;
-- 'launch:' starts the JVM worker once and pools it across queries.
ATTACH 'demo' AS demo (TYPE vgi,
LOCATION 'launch:/ABSOLUTE/PATH/TO/build/install/vgi-java-examples/bin/vgi-java-examples');
-- scalar — one row in, one row out
SELECT demo.upper_case('hello'); -- HELLO
-- table — a set-returning generator, streamed in batches
SELECT * FROM demo.numbers(5) ORDER BY n; -- 0,1,2,3,4
SELECT count(*) FROM (SELECT * FROM demo.numbers(1000000) LIMIT 7); -- 7 (LIMIT pushdown)
-- table-in-out — a streaming relation transform
SELECT n FROM demo.echo((SELECT * FROM demo.numbers(3))) ORDER BY n; -- 0,1,2
-- aggregate — parallel partial aggregation
SELECT g, demo.vgi_sum(v)
FROM (VALUES (1,10),(1,20),(2,5)) t(g,v) GROUP BY g ORDER BY g; -- 1->30, 2->5
-- buffering — must see all input before producing output
SELECT n FROM demo.collect((SELECT * FROM demo.numbers(4))) ORDER BY n; -- 0,1,2,3
DETACH demo;What just happened
Three things, and together they're the whole protocol in miniature:
Worker.builder()...runFromArgs(args)parsed--unix/--idle-timeout(added bylaunch:) and served theAF_UNIXtransport.- The engine called the worker's
init/bindRPCs to learn each function's schema, then streamed Arrow batches for execution. - Your
compute()saw whole Arrow vectors and wrote whole Arrow vectors back — no row-by-row marshalling anywhere in the path.
