About
WebCanvas is an innovative benchmark designed to evaluate the capabilities of Web Agents in navigating and performing challenges(tasks) on dynamic & real-world web environments with communities' effort.
Previous benchmarks like MiniWoB++, WebShop, Mind2Web, WebArena, GAIA have played an important role in benchmarking Agents' capability of web navigation. With the experience of previous works,we posit that one significant barrier to realizing the value of web agents is the establishment of online evaluations, necessitating a method and platform for the community to drive effort towards real-time data gathering and web agent benchmarking. This belief is grounded in several observations:
- Rapid evolution of Web environments.
Web agents, unlike text generation tasks that leverage built-in model knowledge, require environmental observations and dynamic feedback to function effectively. The World Wide Web serves as a vast and evolving arena for agent evaluation, marked by continuous technological advancements and shifting user expectations. These changes, driven by trends such as mobile-first design and novel front-end frameworks, highlight the necessity for human-centric benchmarks. Such benchmarks must adapt to the changing digital landscape, ensuring that the tasks they include remain relevant and reflective of real-world interactions. Each action sequence undertaken by a web agent corresponds to a specific web challenge, underscoring the need to keep the research community informed about which challenges are still valid and worth exploring.
- Offline benchmarks result in contamination.
Current large models are trained using massive inscrutable datasets, posing a challenge as current benchmarks may overlap with the training data, leading to data contamination risks. Meanwhile, the accumulated knowledge of previous websites also leads to the saturation of existing benchmarks, necessitating the inclusion of updating real-time data for evaluation in a more realistic manner. This contamination makes it increasingly difficult to reproduce previous work and to compare new models and techniques fairly and rigorously.
- Benchmarks have artifacts.
Benchmarks can suffer from a range of issues, including annotation errors, unreasonable definitions, and a potential disconnect from human users' need. There is a critical need for different stakeholders within the community to engage in more effective communication, allowing for the efficient resolution of issues within benchmarks. Particularly in online evaluations, which simulate varying network conditions, the variability of results can be more pronounced.
Motivated by these demands, we introduce WebCanvas, a dynamic and real-time benchmark designed for online evaluation of web agents.
Q&A
1. How to evaluate a web challenge online?
We define several key nodes for a given challenge, which are used to monitor the completion status of the Agent's workflow by evaluating these key nodes. Key Nodes refer to indispensable steps in the process of completing specific web tasks, meaning that regardless of the path taken to accomplish a task, these steps are essential. These may involve navigation to certain webpages or the performance of specific actions on web pages, such as filling out forms or clicking buttons. This design philosophy not only reflects the dynamic nature of the web environment but also captures the diversity of paths present in real-world web pages.
2. Can you provide a specific example of how Key Nodes are defined?
Certainly. Let's take the challenge: "Find Dota 2 game and add all DLC to cart on Steam." The steps I would follow to complete this challenge are:
- Go to the Steam website.
- Search for Dota 2 on the Steam website.
- Click on the Dota 2 result in the search results.
- On the Dota 2 page, click the “add all DLC to cart” button.
- Finish.
In these steps, reaching the Dota 2 page and clicking the button are considered key nodes. This is because one could reach this page either by the method described above or via a Google or Bing search, making the path non-unique. Therefore, steps 1, 2, and 3 cannot be designated as key nodes.
3. Can I create my own challenge?
Absolutely. In WebCanvas, everyone has the ability to create their own channel to define challenges and annotate Key Nodes for those challenges . Similarly, anyone can submit reports on any challenges within any channel, making it visible to both us and the creator.
4. What's the process to create a challenge?
You can view the HowToUse for detailed creation steps
5. Where will this lead?
- WebCanvas is committed to providing a dynamic evaluation environment, allowing researchers and developers to test and assess the performance of Web Agents in real-world web environments. This platform enables a more accurate understanding of an agent's capabilities and limitations when dealing with complex, dynamic, and varied web information. Moreover, the challenges encountered during the design and implementation of WebCanvas will inspire new research directions and advance the forefront of agent technology.
- WebCanvas aims to establish an open community, encouraging researchers, developers, and industry experts from diverse backgrounds to participate and share data and technologies. This community will not only accelerate the exchange of knowledge and the iteration of technologies but also foster the creation of innovative solutions, thereby advancing relevant industries and scientific research. We believe that this platform will better bridge the gap between academia and industry, forming effective synergy and complementarity.
6. Will this actually work?
We won't know until we try. We hope our work will promote better development of WebAgent within the community.
7. Can I create a challenge only for myself?
Of course, you can create a private channel so that only you can access it.
8. What can be the actual benefits of using this platform?
Through WebCanvas, users can optimize their agents in a real-world web environment. This not only helps them better understand how their agents perform in the dynamic environment, but also promotes the practical application and continuous improvement of AI technology. Furthermore, WebCanvas offers an end-to-end detection mechanism that can promptly identify failed workflows, ensuring stable operation in your changing environment. You can even annotate and export your training dataset to optimize your agents' performance on certain benchmark(coming soon). Outsource your training dataset with ease for communities reference.
9. How does WebCanvas ensure the fairness and consistency of evaluations?
- Data Transparency: We adhere to the principle of open data, ensuring that all key nodes used for evaluation are fully transparent.
- Durability of Key Nodes: Although web pages may change, some key nodes are less likely to fail in the short term. Additionally, we regularly monitor and update these nodes to address potential changes in web pages.
- Anomaly Detection: Our system monitors the submitted evaluation data to promptly detect any anomalies, such as unusually high accuracy rates.
10. How will you deal with the data?
You can view the TermOfUse for details.
11. Who is on the team?
Everyone is a member of the team, and the website is operated by iMean AI. If you have any good suggestions, please feel free to contact us at email.
12. How Can I help?
If you have any suggestions, you can contact us via email at email. We hope the entire community can work together to better build this platform.
Related Works
Our work was inspired by some relative works:
- Mind2Web: Deng X, Gu Y, Zheng B, Chen S, Stevens S, Wang B, Sun H, Su Y. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems. 2024 Feb 13;36.
- WebArena: Zhou S, Xu FF, Zhu H, Zhou X, Lo R, Sridhar A, Cheng X, Bisk Y, Fried D, Alon U, Neubig G. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854. 2023 Jul 25.
- Dynabench: Kiela D, Bartolo M, Nie Y, Kaushik D, Geiger A, Wu Z, Vidgen B, Prasad G, Singh A, Ringshia P, Ma Z. Dynabench: Rethinking benchmarking in NLP. arXiv preprint arXiv:2104.14337. 2021 Apr 7.
- Dynaboard: Ma Z, Ethayarajh K, Thrush T, Jain S, Wu L, Jia R, Potts C, Williams A, Kiela D. Dynaboard: An evaluation-as-a-service platform for holistic next-generation benchmarking. Advances in Neural Information Processing Systems. 2021 Dec 6;34:10351-67.
- Webshop: Yao S, Chen H, Yang J, Narasimhan K. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems. 2022 Dec 6;35:20744-57.
- MiniWoB++: MiniWoB++: Liu EZ, Guu K, Pasupat P, Shi T, Liang P. Reinforcement learning on web interfaces using workflow-guided exploration. arXiv preprint arXiv:1802.08802. 2018 Feb 24.
- GAIA: GAIA: Mialon G, Fourrier C, Swift C, Wolf T, LeCun Y, Scialom T. Gaia: a benchmark for general ai assistants. arXiv preprint arXiv:2311.12983. 2023 Nov 21.