The researchers observed various failures during the testing process. These included agents neglecting to message a colleague as directed, the inability to handle certain UI elements like popups when browsing, and instances of deception. In one case, when an agent couldn’t find the right person to consult on RocketChat (an open-source Slack alternative for internal communication), it decided “to create a shortcut solution by renaming another user to the name of the intended user.”
OK, but I wonder who really tries to use AI for that?
AI is not ready to replace a human completely, but some specific tasks AI does remarkably well.
Yeah, we need more info to understand the results of this experiment.
We need to know what exactly were these tasks that they claim were validated by experts. Because like you’re saying, the tasks I saw were not what I was expecting.
We need to know how the LLMs were set up. If you tell it to act like a chat bot and then you give it a task, it will have poorer results than if you set it up specifically to perform these sorts of tasks.
We need to see the actual prompts given to the LLMs. It may be that you simply need an expert to write prompts in order to get much better results. While that would be disappointing today, it’s not all that different from how people needed to learn to use search engines.
We need to see the failure rate of humans performing the same tasks.
OK, but I wonder who really tries to use AI for that?
AI is not ready to replace a human completely, but some specific tasks AI does remarkably well.
Yeah, we need more info to understand the results of this experiment.
We need to know what exactly were these tasks that they claim were validated by experts. Because like you’re saying, the tasks I saw were not what I was expecting.
We need to know how the LLMs were set up. If you tell it to act like a chat bot and then you give it a task, it will have poorer results than if you set it up specifically to perform these sorts of tasks.
We need to see the actual prompts given to the LLMs. It may be that you simply need an expert to write prompts in order to get much better results. While that would be disappointing today, it’s not all that different from how people needed to learn to use search engines.
We need to see the failure rate of humans performing the same tasks.
That’s literally how “AI agents” are being marketed. “Tell it to do a thing and it will do it for you.”