Corpus

The corpus view is where you upload, browse, filter, and extend your text data.

Creating a Corpus

Click New corpus on the corpus list page and upload a CSV or XLSX file. The first row must be a header row. One column is automatically chosen as the primary text column (the system looks for a column named text, content, or body, then falls back to heuristics).

All columns are stored in the corpus and shown in the document table.

Partitions Column

If your upload includes a column named partitions, its values are interpreted as comma-separated partition names (e.g. train, dev). The system creates those partitions automatically and assigns each document to the listed partitions.

Browsing Documents

The document table shows paginated rows with all corpus columns. You can:

  • Search across all columns or within a specific column
  • Sort by any column (ascending or descending)
  • Filter by partition — select one or more partitions in the left panel, or show only documents not assigned to any partition

Click a cell to edit it inline. Click the row ID to open a detail dialog for that document.

Managing Columns

From the Columns panel you can:

  • Rename a column — the new name is reflected across all documents immediately
  • Delete a column — removes the key from every document's metadata
  • Reorder columns — drag to change the display order

The primary text column cannot be deleted.

Add Rows

To append new documents to an existing corpus, open the corpus and click Add rows.

Workflow

  1. Download template — Download a CSV or XLSX template containing the exact column headers the system expects. The template includes data columns, a partitions column, and optional annotation:{variable} and llm:{variable} columns for pre-loaded annotation values.
  2. Prepare your file — Fill in the template. You may include only a subset of columns; missing columns will be stored as empty values in the new documents.
  3. Validate — Upload the file to the validate step to check for errors and warnings before committing.
  4. Upload — Confirm and upload. The system appends the new rows.

Validation Rules

RuleBehaviour
Unknown columnUpload fails with an error listing the unrecognised column name
Duplicate IDUpload fails if an id column value already exists in the corpus
Invalid variable valueUpload fails if a column matches a variable name and the value is not a valid label

Warnings

If the corpus has active annotation workflows or fully completed LLM pipelines, the system shows confirmation warnings before uploading:

  • "This will create new annotation tasks for your annotators" — new tasks are generated automatically for every active workflow that covers the corpus or partition.
  • "The LLM annotation will change its state from finished to in progress" — adding rows leaves new documents without LLM outputs, reverting the pipeline to incomplete.

Annotation Columns

Use annotation:{variable_name} columns to pre-populate human annotation values for new documents. The values must use human-readable labels (e.g. positive), not internal IDs. Tasks created for these documents are immediately marked as annotated.

Use llm:{variable_name} columns to pre-populate LLM output values for new documents.

Export

Click Export to download the entire corpus as an XLSX file. The export includes:

  • All data columns in corpus order
  • The partitions and document_id columns
  • Human annotation columns named {variable} ({annotator}) for every annotator who has submitted at least one annotation in a workflow covering this corpus