Uploading Training Data
This is the first thing you do: bring your unlabeled data. HiveGuard takes it in, adds it to the challenge pool, and starts serving items to real visitors as labeling tasks.
Supported formats
The CLI accepts JSONL and CSV files.
JSONL
One item per line:
{"dataset_id": "DS_ID", "data_ref": "https://example.com/cat.jpg", "modality": "image"}{"dataset_id": "DS_ID", "data_ref": "https://example.com/dog.jpg", "modality": "image"}Required fields:
| Field | Type | Description |
|---|---|---|
dataset_id | UUID | The dataset this item belongs to |
data_ref | string | URL or identifier for the item content |
modality | string | image, text, or audio |
Optional fields:
| Field | Type | Description |
|---|---|---|
challenge_type | string | Type of question to ask (dataset default if omitted) |
prompt | string | Override the default prompt for this item |
CSV
Same fields, as columns:
dataset_id,data_ref,modality00000000-0000-0000-0000-000000000001,https://example.com/cat.jpg,image00000000-0000-0000-0000-000000000001,https://example.com/dog.jpg,imageCreating a dataset first
Items belong to a dataset. A dataset groups items of the same modality and configures the question to ask. Create one before uploading:
hiveguard datasets create "Product Images" --modality image# id: 00000000-0000-0000-0000-000000000001You can have multiple datasets — one for each distinct labeling task. A dataset for image classification, another for text sentiment, another for audio identification. Traffic is distributed across whichever datasets are active.
Uploading
hiveguard items upload my_items.jsonlOutput:
Uploading 2500 items in 3 batches... ✓ Batch 1/3: 1000 items created ✓ Batch 2/3: 1000 items created ✓ Batch 3/3: 500 items createdTotal: 2500 items createdItems are sent in batches of 1,000. If a batch fails, the error is reported and remaining batches continue.
Validation
The CLI validates each item before sending:
dataset_idmust be a valid UUIDmodalitymust beimage,text, oraudio- CSV files must have all required columns
Errors show the line number:
Error on line 3: invalid modality "video" (must be image, text, or audio)Large datasets
For datasets with tens of thousands of items, the batching keeps memory usage flat. The CLI streams line-by-line and never loads the whole file into memory.
Practical throughput: around 2,000–5,000 items per minute depending on network and server load.
Verifying the upload
hiveguard items list | head -20hiveguard datasets show DS_IDThe show command reports how many items are in the dataset and how many have been labeled so far.
What happens after upload
Items enter the unknown pool immediately. HiveGuard starts serving them in challenges as traffic arrives. Each visitor who solves a challenge answers one unknown item, and their answer is recorded as a vote. When enough votes accumulate and agree, the label is finalized.
You don’t need to do anything else. Labeling runs passively as long as traffic flows.
Adding ground-truth items
Ground-truth items are items you already know the answer to. HiveGuard uses them to verify that visitors are paying attention. Without them, random clickers can contaminate your labels.
To mark an item as ground truth, include the known answer at upload time:
{"dataset_id": "DS_ID", "data_ref": "https://example.com/img.jpg", "modality": "image", "is_ground_truth": true, "ground_truth_answer": "cat"}Aim for at least 5–10 ground-truth items per dataset before going live. See Ground Truth Items for guidance on what makes a good GT item.