Skip to content

Exporting Labels

Once visitors have been answering challenges, labels accumulate. When enough answers on an item agree, the consensus engine finalizes a label. You can export finalized labels at any time — they’re ready for model training.

Quick export

Terminal window
# JSONL to stdout
hiveguard labels export --fmt jsonl
# Save to file
hiveguard labels export --fmt csv --output labels.csv
# JSON, pipe through jq
hiveguard labels export --fmt json | jq '.[] | select(.confidence > 0.9)'

Export formats

FormatBest for
csvpandas, scikit-learn, spreadsheets
jsonllarge datasets, streaming pipelines
jsonsmall datasets, inspection

Export from a specific dataset

Terminal window
hiveguard datasets export DATASET_ID --fmt csv --output dataset_labels.csv

This streams directly from the server — safe for datasets with millions of rows.

What each label contains

FieldTypeDescription
item_idUUIDThe item that was labeled
data_refstringURL of the item content
modalitystringimage, text, or audio
labelstringThe consensus answer
confidencefloatFraction of solvers who agreed (0.0–1.0)
solver_countintNumber of human solvers who answered
created_atISO 8601When the label was finalized

Filtering by confidence

High-confidence labels are more reliable. Filter before feeding to a training pipeline:

Terminal window
# Only labels where ≥90% of solvers agreed
hiveguard labels export --fmt jsonl | \
jq -c 'select(.confidence >= 0.9)' > high_conf.jsonl

Or in Python:

import pandas as pd
df = pd.read_csv("labels.csv")
train_df = df[df["confidence"] >= 0.8]
# Ready for model training
print(f"{len(train_df)} high-confidence labels")

Automating regular exports

Schedule exports in a cron job or CI pipeline:

#!/bin/bash
DATE=$(date +%Y%m%d)
hiveguard labels export --fmt jsonl --output "labels_${DATE}.jsonl"

This pattern works well for nightly training runs: export the current labels, train, deploy.

Labels are living data

HiveGuard continues collecting labels after export. Labels with low confidence are automatically re-queued for additional labeling through re-validation. If a previously exported label later loses agreement, its confidence drops — that’s a signal to re-export and re-check.

For production pipelines, export on a schedule rather than once. The dataset improves over time.