Basic Concepts

Polyphony is a platform for collaborative text annotation. It lets you upload text data, define what should be annotated, assign work to human annotators or LLM pipelines, and analyse the results. The following concepts are central to everything in the application.

Corpus

A corpus is a named collection of documents belonging to a project. You create a corpus by uploading a CSV or XLSX file. Every column in the file becomes a column in the corpus; one column is designated the primary text column and its content is used as the document text.

Document

A document is a single row in a corpus. It has a text field (the main content shown to annotators) and an arbitrary set of metadata columns taken from the uploaded file. Documents are immutable once created except through explicit edits or bulk operations.

Partition

A partition is a named subset of documents within a corpus — for example train, dev, and test splits, or batch-1 and batch-2 for staggered annotation campaigns.

Partitions can be created manually or automatically from a partitions column in the uploaded CSV/XLSX file. Human workflows and LLM pipelines can be scoped to a single partition so that annotators only see the documents they are responsible for.

Variable

A variable defines a single annotation dimension. It has a name and one of six types:

TypeDescription
Single categoricalChoose exactly one option from a list
Multi-categoricalChoose one or more options from a list
Likert scaleRate on a numbered scale (e.g. 1–5)
TextFree-form text response
IntegerWhole number, optionally bounded
FloatDecimal number, optionally bounded

Variables are defined at the project level and can be reused across multiple annotation forms and workflows.

Annotation Form

An annotation form is a UI layout composed of layout blocks. Blocks can be:

  • Document viewer – displays the document text (and optionally metadata columns)
  • Input component – collects a value for one variable
  • Static text – instructions or labels for the annotator

Forms are reusable: the same form can be used by multiple human workflows.

Human Workflow

A human workflow links an annotation form to a corpus or partition and assigns it to one or more annotators. When a workflow is activated, the system creates individual annotation tasks — one per document per assigned user (in overlap mode) or distributed round-robin (in split mode).

Annotators work through their tasks in the annotator interface and submit values. The workflow is complete when every task has been submitted.

LLM Pipeline

An LLM pipeline is a collection of prompts, each tied to one variable. Running a pipeline sends each document through the prompts in sequence and stores the model's response as an LLM output — the LLM equivalent of a human annotation task value.

Gold Standard

A gold standard is a column derived from human annotation values by applying an aggregation strategy (majority vote, mean, median). It turns multiple annotators' responses into a single reference label per document.

Project

A project is the top-level container. It belongs to a team and groups together corpora, variables, annotation forms, workflows, and pipelines. Access control is managed at the project level.