
"Instead, it's a computer use agent (CUA) that can complete tasks for users by taking over the mouse or keyboard. "Fara-7B operates by visually perceiving a webpage and takes actions like scrolling, typing, and clicking on directly predicted coordinates," the company explained in a . "It does not rely on separate models to parse the screen, nor on any additional information like accessibility trees, and thus uses the same modalities as humans to interact with the computer.""
"As noted, Fara-7B interacts with a website or other interface visually - it looks at them just as a human user would. One major challenge was finding enough data for training, researchers noted. "A key bottleneck for building CUA models is a lack of large-scale, high-quality computer interaction data," they said. "Collecting such data with human annotators is prohibitively expensive as a single CUA task can involve dozens of steps, each of which needs to be annotated.""
Fara-7B is an agentic small language model designed to run locally and perform user tasks by controlling mouse and keyboard. The model visually perceives webpages and interfaces, taking actions like scrolling, typing, and clicking at predicted coordinates without relying on separate screen-parsing models or accessibility trees. The model contains seven billion parameters, enabling on-device operation that reduces latency and keeps user data local for improved privacy. Small language models address energy and complexity challenges of large models for specific tasks. Training required addressing a lack of large-scale, high-quality interaction data. The system has been trained using a synthetic data generation pipeline.
Read at IT Pro
Unable to calculate read time
Collection
[
|
...
]