Introduction
HiveGuard is a self-hosted data labeling engine built on top of a reverse proxy. You upload your unlabeled dataset — images, text snippets, audio clips — and HiveGuard serves items from it as short verification tasks to visitors passing through your site. Visitors answer them, answers accumulate, and the consensus engine produces high-confidence labels.
The result: a continuously growing labeled dataset, produced entirely from traffic that was already happening. No separate annotation platform. No crowdsourcing fees. No idle queue.
The core idea
Traditional annotation pipelines have a split: your application runs over here, your labeling job runs over there, and you pay to connect them. HiveGuard collapses that split. The labeling task is the traffic gate. Your visitors are the annotators.
Your dataset (unlabeled) │ ▼HiveGuard serves items as challenges to real visitors │ ▼Visitor answers → label recorded │ ▼Consensus engine aggregates answers across visitors │ ▼High-confidence label finalized → ready for exportEvery challenge simultaneously does two things: it verifies the visitor is human (useful for bot protection), and it collects a label for your dataset (useful for model training). You get both for free.
What you upload
Your dataset consists of items — individual pieces of media to be labeled:
- Image items: a URL pointing to an image, plus the question to ask (“Is this a cat or a dog?”)
- Text items: a text snippet, plus the classification task (“Is this review positive or negative?”)
- Audio items: an audio URL, plus the identification task (“Is the speaker under 30?”)
Some items are ground truth — you already know the correct answer. These are used to verify that the visitor is paying attention (not just clicking randomly). Unknown items are the ones you actually want labeled.
What you get back
Finalized labels: a CSV, JSON, or JSONL file with each item’s consensus answer and a confidence score. The confidence reflects how many solvers agreed and how consistent they were.
Labels are ready to pipe straight into a training pipeline.
What you need to get started
- A dataset of items you want labeled (images, text, or audio)
- A running HTTP service to proxy (even a static site works)
- About 10 minutes
Continue to Quick Start to deploy HiveGuard and start collecting labels.