Skip to content

Uploading Training Data

This is the first thing you do: bring your unlabeled data. HiveGuard takes it in, adds it to the challenge pool, and starts serving items to real visitors as labeling tasks.

Supported formats

The CLI accepts JSONL and CSV files.

JSONL

One item per line:

{"dataset_id": "DS_ID", "data_ref": "https://example.com/cat.jpg", "modality": "image"}
{"dataset_id": "DS_ID", "data_ref": "https://example.com/dog.jpg", "modality": "image"}

Required fields:

FieldTypeDescription
dataset_idUUIDThe dataset this item belongs to
data_refstringURL or identifier for the item content
modalitystringimage, text, or audio

Optional fields:

FieldTypeDescription
challenge_typestringType of question to ask (dataset default if omitted)
promptstringOverride the default prompt for this item

CSV

Same fields, as columns:

dataset_id,data_ref,modality
00000000-0000-0000-0000-000000000001,https://example.com/cat.jpg,image
00000000-0000-0000-0000-000000000001,https://example.com/dog.jpg,image

Creating a dataset first

Items belong to a dataset. A dataset groups items of the same modality and configures the question to ask. Create one before uploading:

Terminal window
hiveguard datasets create "Product Images" --modality image
# id: 00000000-0000-0000-0000-000000000001

You can have multiple datasets — one for each distinct labeling task. A dataset for image classification, another for text sentiment, another for audio identification. Traffic is distributed across whichever datasets are active.

Uploading

Terminal window
hiveguard items upload my_items.jsonl

Output:

Uploading 2500 items in 3 batches...
✓ Batch 1/3: 1000 items created
✓ Batch 2/3: 1000 items created
✓ Batch 3/3: 500 items created
Total: 2500 items created

Items are sent in batches of 1,000. If a batch fails, the error is reported and remaining batches continue.

Validation

The CLI validates each item before sending:

  • dataset_id must be a valid UUID
  • modality must be image, text, or audio
  • CSV files must have all required columns

Errors show the line number:

Error on line 3: invalid modality "video" (must be image, text, or audio)

Large datasets

For datasets with tens of thousands of items, the batching keeps memory usage flat. The CLI streams line-by-line and never loads the whole file into memory.

Practical throughput: around 2,000–5,000 items per minute depending on network and server load.

Verifying the upload

Terminal window
hiveguard items list | head -20
hiveguard datasets show DS_ID

The show command reports how many items are in the dataset and how many have been labeled so far.

What happens after upload

Items enter the unknown pool immediately. HiveGuard starts serving them in challenges as traffic arrives. Each visitor who solves a challenge answers one unknown item, and their answer is recorded as a vote. When enough votes accumulate and agree, the label is finalized.

You don’t need to do anything else. Labeling runs passively as long as traffic flows.

Adding ground-truth items

Ground-truth items are items you already know the answer to. HiveGuard uses them to verify that visitors are paying attention. Without them, random clickers can contaminate your labels.

To mark an item as ground truth, include the known answer at upload time:

{"dataset_id": "DS_ID", "data_ref": "https://example.com/img.jpg", "modality": "image", "is_ground_truth": true, "ground_truth_answer": "cat"}

Aim for at least 5–10 ground-truth items per dataset before going live. See Ground Truth Items for guidance on what makes a good GT item.