What is auto-evaluation?

Auto-evaluation is a way of making your evaluations smarter. Stock evaluation uses string comparison to determine if the task’s output matches exactly to what the correct answer was defined as during annotation. This is done as pure == so variations like punctuation, formatting and capitalization will affect the results.

Auto-evaluation puts another layer of LLM into the process and lets you define a task within a task. This way you can instruct the LLM to accept or reject task results with more nuance.

Let’s try it out!


Generating the Auto-Evaluation task

  1. Start autoeval: Open the command line in VSCode and run workflowai workflowai task autoeval [sampleTaskName]
workflowai task autoeval [sampleTaskName]
  1. Ensure you have enough annotated runs (skip if you already have 15 rated runs): Auto Eval needs 15 rated runs to be created. If you have less than 15, you’ll get a message saying you need more and WorkflowAI can generate the remaining runs. If you see this message:
    1. Enter Y to have WorkflowAI generate a batch of runs.
    2. Once the runs are generated, click the link to rate the new task-runs.
    3. Complete the annotations in the webapp.
      Learn how to annotate runs here
    4. Return to the CLI in VSCode and enter Y to continue.

Now that we’ve completed our annotation, we can complete the autoeval setup back in the CLI.

  1. Complete autoeval creation: WorkflowAI will generate a new evaluation sub-task that can be added to our parent task’s .py file.

The newly created evaluation task will look something like this:


Adding autoeval to your parent task

Now that we have a new autoevalution task, we can add it our parent tasks .py file.

  1. Copy/paste the autoeval instructions into your city_to_capital_task.py task file. You need to include these items:
    1. The imports at the top
    2. The class information to the final class of the city_to_capital_task.py file

Improving the evaluation task

sample text


Boneyard:

You need 15 annotated runs before auto-eval setup can continue. See annotation instructions here.