What is auto-evaluation?
Auto-evaluation is a way of making your evaluations smarter. Stock evaluation uses string comparison to determine if the task’s output matches exactly to what the correct answer was defined as during annotation. This is done as pure== so variations like punctuation, formatting and capitalization will affect the results.
Auto-evaluation puts another layer of LLM into the process and lets you define a task within a task. This way you can instruct the LLM to accept or reject task results with more nuance.
Let’s try it out!
Generating the Auto-Evaluation task
- Start autoeval: Open the command line in VSCode and run workflowai
workflowai task autoeval [sampleTaskName]
- Ensure you have enough annotated runs (skip if you already have 15 rated runs): Auto Eval needs 15 rated runs to be created. If you have less than 15, you’ll get a message saying you need more and WorkflowAI can generate the remaining runs. If you see this message:
- Enter
Yto have WorkflowAI generate a batch of runs.
- Once the runs are generated, click the link to rate the new task-runs.

- Complete the annotations in the webapp.
- Return to the CLI in VSCode and enter
Yto continue.
- Enter
- Complete
autoevalcreation: WorkflowAI will generate a new evaluation sub-task that can be added to our parent task’s .py file.
Adding autoeval to your parent task
Now that we have a new autoevalution task, we can add it our parent tasks .py file.
- Copy/paste the autoeval instructions into your city_to_capital_task.py task file. You need to include these items:
- The
importsat the top - The class information to the final class of the city_to_capital_task.py file
- The
Improving the evaluation task
sample textBoneyard:
