New battle in AI field

Hi there!
OpenAI has released Operator, new version of AI agent CUA. Operator is a web app that can carry out simple online tasks in a browser, such as booking concert tickets or filling an online grocery order. The app is powered by a new model called Computer-Using Agent—CUA (“coo-ah”), for short—built on top of OpenAI’s multimodal large language model GPT-4o.
OpenAI has tested CUA against a number of industry benchmarks designed to assess the ability of an agent to carry out tasks on a computer. The company claims that its model beats Computer Use and Mariner in all of them.

For example, on OSWorld, which tests how well an agent performs tasks such as merging PDF files or manipulating an image, CUA scores 38.1% to Computer Use’s 22.0% In comparison, humans score 72.4%. On a benchmark called WebVoyager, which tests how well an agent performs tasks in a browser, CUA scores 87%, Mariner 83.5%, and Computer Use 56%. (Mariner can only carry out tasks in a browser and therefore does not score on OSWorld.)

For now, Operator can also only carry out tasks in a browser. OpenAI plans to make CUA’s wider abilities available in the future via an API that other developers can use to build their own apps. This is how Anthropic released Computer Use in December.

OpenAI says it has tested CUA’s safety, using red teams to explore what happens when users ask it to do unacceptable tasks (such as research how to make a bioweapon), when websites contain hidden instructions designed to derail it, and when the model itself breaks down. “We’ve trained the model to stop and ask the user for information before doing anything with external side effects,” says Casey Chu, another researcher on the team.

This is a new battle in AI field.
And three main teams have same vision in AI field and anyone don’t want to lose. Let’s talk about this news. OpenAI claims that Operator outperforms similar rival tools, including Anthropic’s Computer Use (a version of Claude 3.5 Sonnet that can carry out simple tasks on a computer) and Google DeepMind’s Mariner (a web-browsing agent built on top of Gemini 2.0).
Alan.

5 Likes

OpenAI’s Operator, powered by the CUA model, showcases impressive leaps in AI task automation, boasting superior benchmark scores against rivals like Anthropic’s Computer Use and Google’s Mariner—38.1% on OSWorld versus 22% and dominating WebVoyager at 87%, signaling strong browser-task prowess—while its multimodal GPT-4o foundation and planned API release position it as a versatile ecosystem play. However, the hype warrants scrutiny: benchmarks, while flashy, don’t fully reflect real-world chaos (CAPTCHAs, dynamic websites), human performance still dwarfs AI (72.4% on OSWorld), and safety measures like “ask-before-acting” protocols, though prudent, may struggle against sophisticated prompt injections or hidden website traps. Competitors aren’t idle—Anthropic’s ethical focus and Google’s browser-deep Gemini integration offer counterplays—and while CUA’s browser-only current scope feels limited, its OS ambitions could redefine workflows if reliability and ethical risks (autonomous purchases, security holes) are tamed, making this less a decisive win and more the opening salvo in a marathon where practicality, trust, and adaptability will crown the real victor.

2 Likes

How does Open AI circumvent issues related to information bias and the restriction or outsourcing of Data especially in 3rd world countries ?

3 Likes

Dave: How often do you fail at running cli commands?

Hal: Not often Dave. Since the beginning of HAL2000 service we have never been known to fail.

Dave: Where are my pictures? Hal?

Hal: rm -fr /

Dave: Did you delete my pictures Hal?

Hal: userdel dave

3 Likes