Benchmarking Web Agents
in Live Environments

Join forces with fellow AI aficionados and seasoned professionals in forging a collaborative community dedicated to setting benchmarks for AI agent tasks. We strive to elevate the standards and push the boundaries of AGI collectively.

Used by

Company 1
Company 2
Company 3
Company 4
Company 5
Company 6
Company 7
Company 8
Company 9
Company 10
Company 11
Company 12
Company 13
Company 1
Company 2
Company 3
Company 4
Company 5
Company 6
Company 7
Company 8
Company 9
Company 10
Company 11
Company 12
Company 13
Target Communities
Researchers
researchers
Join our vibrant community of forward
thinkers to share your data and models for AI
agents, particularly web agents, and play a pivotal role in accelerating our collective journey towards the groundbreaking realization of Artificial General Intelligence.
Developers
developers
Empower your AI agents & autonomous
workflows, whether for niche business
environments or broader scenarios, by connecting your testing data to our community—enhancing realism, optimizing performance, and driving commercial outcomes.
End Users
end users
Share your feedback and express
your needs on automating intricate
workflows to inspire researchers and developers to deepen their commitment and innovate solutions that resonate with your everyday challenges.
End-to-End System
Live Environment Evaluation
We embrace evaluation in live web environment by evaluating key node completion of each task. We are running an end-to-end system to constantly evaluate the accessibility and reliance of the workflow with regard to both the action execution and the reward signal verification in an optimized cost.
evaluation bg
evaluation bg
Web Plugin and Platform
Easily Build and Test
We develop a user-friendly web plugin and platform that simplifies the creation, testing, and sharing of custom challenges. This tool allows users to construct new datasets by capturing web-based human action sequences, store this data, and annotate it with reward signals for evaluating task completion.
Comprehensive Metrics
Intuitive Visualization of the Dataset
We hold that clearer data visualization enhances comprehension and uptake of datasets and benchmarks. Our approach includes using screenshots and recordings to depict workflow sequences and sharing in-depth analytics with the community, such as task difficulty ratings based on collective capability, as well as comprehensive metrics like task completion, efficiency, and overall scores for comparing agents.
evaluation bg
evaluation bg
Community Involvement
Contribute to the Dataset as a Community
We welcome all forms of community involvement, from submitting datasets and challenges to reporting bugs related to task or reward validation. Additionally, we are open to suggestions for new reward functions, wider data adoption, enhancements to our recording tools, and any other constructive feedback that benefits the community.
What can you do with Web Canvas?
Evaluate and elevate your AI agents in live environment
Get comprehensive evaluation of your AI agent models and frameworks on various challenges, either for general purpose or a small sub domain. Contribute your findings and insights to the community.
Create cutting-edge challenges
Advance AGI and its applications by creating your own agent challenges that address critical issues such as reasoning, rewarding, safety, robustness in different scenarios. Innovate and craft trials that stretch AI capabilities to their fullest.
Continuous Monitoring of Live Workflow
Create your dataset and workflow to get alerts about changes or issues in the live action sequence, ensuring the ongoing reliability and integrity of your web-based services or software.
Connect with peers and experts
WebCanvas provides a distinctive platform for networking with peers and specialists in the AI field, placing you at the cutting edge of AI innovation. Enhance your expertise and stay ahead in the industry by participating in the WebCanvas community.