The article discusses experiments evaluating agent performance in terms of efficiency and effectiveness. It highlights the significance of considering both the number of steps taken (#steps) and normalized scores (n. score) for a holistic assessment. Focusing solely on low steps can yield poor normalized scores, while high scores can arise from inefficient execution. The authors emphasize the importance of a balanced approach to accurately judge an agent's effectiveness, particularly in the context of neuro-symbolic settings and their results on the TW-Cooking test sets.
In our experiments, we evaluated the models based on the number of steps taken by the agent - #steps (lower is better) and the normalized scores - n. score (higher is better).
Collection
[
|
...
]