TestingCatalog
banner
index.www.testingcatalog.com.ap.brid.gy
TestingCatalog
@index.www.testingcatalog.com.ap.brid.gy
Reporting AI nonsense. A future news media, driven by virtual assistants 🤖

🌉 bridged from ⁂ https://www.testingcatalog.com/, follow @ap.brid.gy to interact
Your personal OpenClaw AI, fully managed in the cloud, ready to use instantly, 24/7.
MyClaw launches managed, always-on OpenClaw agent
MyClaw is a managed cloud host for OpenClaw (also known as Clawdbot) that gives you a dedicated, private instance running 24/7. Instead of keeping your laptop on, fighting Docker errors, or babysitting a VPS, you just log in, and your OpenClaw is already online, ready to work. SPONSORED Get your always-on myClaw agent 🦞 Test it The whole value prop is operational friction removal. MyClaw spins up an instance configured for you, then handles the background stuff people hate dealing with: updates, security, and scaling. Their site also leans hard into uptime: no downtime, no restarts, and a setup flow that is basically “choose a plan, we provision it, you start using it.” Plans are straightforward and map directly to compute. Across tiers, MyClaw says every instance runs in its own secure, isolated container, with encrypted access and daily backups. 0:00 /0:30 1× OpenClaw itself is positioned as an AI that does real actions across your digital life. On MyClaw’s homepage, the feature list spans workflow automation (email, reminders, scheduling), code and dev tools (tests, refactors, repo management), browser control (forms, scraping, monitoring), file and system management, smart home control (Home Assistant), and app and API integrations across tools like Slack, Discord, GitHub, and databases. The practical difference with managed hosting is that long-running workflows and memory actually become useful. You can keep tasks running on schedules, keep context in place, and reach the same assistant from anywhere without treating setup and maintenance like a side quest. MyClaw also highlights a big pool of community use cases, so you are not starting from zero when it comes to what people build with OpenClaw.
www.testingcatalog.com
February 18, 2026 at 12:07 AM
What's new? Cursor offers long-running agents preview to Ultra, Teams and Enterprise users; it uses a custom harness with multiple models and planning phase for extended tasks;
Cursor launches long-running agents for Ultra+ users
Cursor has announced that its long-running agents research preview is now available to all Ultra, Teams, and Enterprise users, following internal and external testing. The feature is designed for developers and engineering teams seeking to automate complex, multi-hour or multi-day software tasks with less human oversight. Availability is currently limited to higher-tier customers, with public rollout details yet to be disclosed. > Long-running agents are now available at https://t.co/3PT8c7azU3 for Ultra, Teams, and Enterprise plans. > > With our new harness, agents can complete much larger tasks.https://t.co/7p57WeR04t pic.twitter.com/pGePEFRPTT > > — Cursor (@cursor_ai) February 12, 2026 The long-running agents leverage a custom-built harness that allows autonomous software agents to handle extended, intricate projects such as: 1. Building integrated chat platforms 2. Refactoring authentication systems 3. Porting applications across platforms Unlike previous agent iterations that struggled with long-horizon tasks, this system introduces a planning phase that requires user approval before execution, reducing errors caused by misalignment. The agents also use multiple models that verify each other's work, enabling the creation of large, production-ready pull requests with minimal manual follow-up. Cursor, the company behind this release, has focused its recent R&D on agent autonomy and reliability, aiming to move toward fully self-driving codebases. The company’s adoption of a flexible harness enables integration with various AI models, tailoring agent behavior to each task's needs. Early feedback from users and industry engineers indicates substantial productivity gains, with some projects completed in a fraction of the originally estimated timeframes and codebases benefiting from deeper test coverage and better handling of edge cases. Source
www.testingcatalog.com
February 17, 2026 at 10:03 PM
What's new? Anthropic launched Claude Sonnet 4.6 with a 1m token context, refined instruction following and long-context reasoning; available across all Claude platforms;
Anthropic releases Claude Sonnet 4.6 with 1M context for all users
Anthropic has launched Claude Sonnet 4.6, targeting developers, enterprises, and knowledge workers who rely on advanced AI for coding, document processing, and complex computer-based tasks. The model is immediately available across all Claude plans, including Free, Pro, and enterprise offerings, and is accessible through Claude.ai, Claude Cowork, Claude Code, the Claude API, and major cloud platforms. Pricing remains unchanged from the previous version, ensuring access for both individual and organizational users. > This is Claude Sonnet 4.6: our most capable Sonnet model yet. > > It’s a full upgrade across coding, computer use, long-context reasoning, agent planning, knowledge work, and design. > > It also features a 1M token context window in beta. pic.twitter.com/TDId3XUSRs > > — Claude (@claudeai) February 17, 2026 Claude Sonnet 4.6 features a 1M token context window (in beta), improved consistency, better instruction following, and marked gains in long-context reasoning and agentic planning. It excels at navigating real-world software, such as spreadsheets and web forms, without needing special APIs. This release closes the performance gap with Anthropic's more expensive Opus models, offering near-Opus-level intelligence and outperforming Sonnet 4.5 on benchmarks such as OSWorld-Verified and Vending-Bench Arena. Early customer feedback highlights major improvements in code modifications, document comprehension, and frontend design. Notably, the model is less prone to overengineering and hallucinations, and it demonstrates strong reliability on multi-step, branched tasks. Safety evaluations indicate robust resistance to prompt injection attacks. Anthropic developed Sonnet 4.6 as a successor to Sonnet 4.5, building on its October 2024 introduction of general-purpose computer use in AI. The company continues to advance model capabilities while maintaining a focus on safety, aiming to expand the practical uses of AI for businesses and technical users globally. Source
www.testingcatalog.com
February 17, 2026 at 9:55 PM
Microsoft is testing a unified Tasks feature for Copilot that combines agentic tools with scheduled prompts for advanced research and analysis.
Microsoft tests Researcher and Analyst agents in Copilot Tasks
Microsoft appears to be developing a unified “Tasks” feature for Copilot that would consolidate several of its existing agentic capabilities into a single, streamlined interface. Found through analysis of recent Copilot builds, the feature sits in a drop-down menu alongside Projects, another upcoming addition that has recently become functional in internal builds. Tasks would offer two entry points: a freeform “New Task” option and a “Scheduled Task” option supporting one-time, daily, weekly, or monthly execution of prompts. What makes this particularly interesting is the mode selector, which offers three options: Auto, Researcher, and Analyst. Microsoft already ships Researcher and Analyst as standalone agents in Microsoft 365 Copilot, which it made generally available in mid-2025. Researcher leverages OpenAI’s deep research model for multi-step web and work-data investigations, while Analyst uses the o3-mini reasoning model for advanced data analysis with live Python execution. The new “Auto” mode appears to be a general-purpose agent that can run complex tasks end-to-end, combining browser control capabilities previously available through Copilot Actions with deep research. This consolidation mirrors the agentic direction competitors like OpenAI have taken with ChatGPT, but adds scheduling on top, which could be a meaningful differentiator for productivity-focused users. The suggested prompts range from generating presentations and summarizing emails to booking hotels and writing formal letters, pointing to broad utility across personal and professional use cases. TestingCatalog tested some of these capabilities and found the output quality for slides and web-based reports to be notably high, representing a substantial upgrade for Copilot subscribers. Within the broader Microsoft ecosystem, Tasks could eventually extend across Windows and Edge, enabling complex automated workflows at the operating system level. No official release date has been announced, and some elements, such as prompt imagery, still appear unfinished, suggesting the feature remains weeks or more away from a public launch. Microsoft has been steadily pushing Copilot toward autonomous, agent-like behavior throughout 2025 and into 2026, and Tasks appears to be the next logical step in that trajectory, giving subscribers a single place to launch and automate sophisticated AI-driven workflows.
www.testingcatalog.com
February 16, 2026 at 8:44 PM
Google Stitch introduces Hatter, a new agent aiming to handle multi-step design tasks, plus new App Store asset generation and native MCP export.
Google tests Hatter agent and App Store tools in Stitch update
Google’s AI design tool Stitch, which launched at Google I/O 2025 as a rebranded version of Galileo AI, appears to be preparing another addition to its growing agent roster. Just days after publicly rolling out the Ideate agent on February 11, a new option called “**Hatter** ” has surfaced in the latest builds of the platform. Described as an agent that can “create high-quality designs,” Hatter sits alongside existing options like Flash Agent, Pro Agent, and the recently added Ideate in the mode selector. Its exact purpose remains unclear, when triggered, it currently produces results similar to the standard design flow, but its labelling as an “Agent” rather than a model suggests Google may be positioning it to handle more complex, multi-step design tasks. This could tie into the previously spotted “Deep Design” system, a design-focused counterpart to Deep Think that would apply deeper reasoning to UI generation. Beyond Hatter, two other features have appeared in development builds: 1. **App Store Asset Generation:** This feature would allow users designing mobile applications to automatically produce a set of screenshots with descriptions and an app icon. It serves as a practical shortcut for anyone prototyping apps who needs store-ready visuals without switching tools. 2. **Native MCP Integration:** Built directly into Stitch’s export menu, this replaces earlier third-party export options like Lovable. This built-in MCP setup would generate an API key and allow connections from tools like Cursor, Claude, Claude Code, and Gemini CLI. While community-built MCP bridges for Stitch already exist on GitHub, a first-party implementation would lower the barrier considerably, especially for developers who want to pull Stitch designs directly into their coding environments. Google has been steadily expanding Stitch’s capabilities since launch, adding Figma export from all agents, the Ideate agent for early-stage exploration, and now what appears to be a pipeline toward deeper AI-driven design reasoning. Designers, indie developers, and product teams prototyping mobile apps stand to benefit most from these additions if they ship publicly.
www.testingcatalog.com
February 16, 2026 at 5:42 PM
Microsoft is testing a Health tab in Copilot with connectors for Fitbit, Garmin, Oura, and Apple Health, centralizing health data in one space.
Microsoft Copilot prepares Copilot Health with new health connectors
Microsoft appears to be expanding its consumer-facing Copilot for Health feature with a dedicated Health tab in the sidebar, sitting alongside existing sections like Library and Shopping. The tab, found through analysis of unreleased interface elements, would serve as a centralized hub for health-related conversations, offering prompt suggestions for symptom review, doctor search, and general wellness discussions. What makes this update stand out is the addition of wearable and medical data connectors, including **Fitbit** , **Garmin** , **Oura** , **Apple Health** , and a general **Health Records** option, all designed to feed personalized context into Copilot’s responses. Microsoft already launched Copilot for Health as part of its Fall 2025 release, partnering with institutions like Harvard to source credible medical information. The company has noted that roughly 40 percent of Copilot users ask health-related questions each week, which explains the push to give health its own dedicated space. The upcoming tab would go further by letting users link medical records so Copilot can reference diagnoses, medications, and lab results. A privacy banner is also part of the design, making it clear that health conversations are kept separate from other chats, with users retaining control to disconnect services and delete data at any time. This move places Microsoft directly alongside competitors. OpenAI has already introduced health connectors in ChatGPT, Anthropic has done the same with Claude, and Perplexity is reportedly building a similar solution. The Apple Health connector appearing in the interface but remaining unavailable on the web suggests this feature will require the mobile app, which aligns with Apple’s restriction that Health data can only be accessed natively on device. Availability will likely follow a familiar pattern. Microsoft’s health features have historically launched in the United States first, and this expansion would probably do the same. Users outside the US may face delays before they can connect their wearable data or medical records. For anyone already using Fitbit, Garmin, or Oura devices, this could turn Copilot into a surprisingly useful health companion that goes well beyond simple web searches.
www.testingcatalog.com
February 16, 2026 at 2:42 PM
xAI's Grok Build is evolving into a full IDE with multi-agent coding, arena mode, dictation, browser-style tabs, and GitHub integration.
xAI tests Arena Mode with Parallel Agents for Grok Build
xAI’s Grok Build, the company’s vibe coding solution first teased by TestingCatalog in early January, is shaping up to be far more ambitious than initially expected. While the local CLI agent was already known, new findings reveal that the remote version is progressing in parallel and will arrive with a suite of features that push Grok Build closer to a full-fledged IDE rather than a simple coding assistant. The most notable addition is Parallel Agents, a feature that lets users send a single prompt to multiple AI agents simultaneously. The interface exposes two models, **Grok Code 1 Fast** and **Grok 4 Fast** , and allows up to four agents per model, meaning users could run eight agents at once. Once triggered, a dedicated coding session opens where all agent responses are visible side by side, alongside a context usage tracker. This multi-agent approach aligns directly with Elon Musk’s stated vision of Grok spawning “hundreds of specialized coding agents all working together.” Separately from parallel agents, there are traces of an arena mode buried in the code. Unlike the parallel view, which simply displays multiple outputs for the user to compare manually, this mode appears designed to have agents collaborate or compete to surface the best response, potentially scoring and ranking outputs automatically. This closely mirrors the tournament-style framework already present in Google’s Gemini Enterprise, where an Idea Generation agent ranks results through a structured competition process. If implemented, arena mode would mean xAI is not just letting users see multiple responses but actively building an evaluation layer on top of its multi-agent system. Beyond agents, the UI is getting a substantial overhaul. Dictation support leans into the vibe coding philosophy. A new set of navigation tabs, Edits, Files, Plans, Search, and Web Page, transforms the interface into something resembling a browser-based IDE, with live code previews and codebase navigation. A Share button and a Comments feature round out the collaboration story. On the integrations side, a GitHub app connection is now visible in settings, though it remains nonfunctional. > Grok 4.20 is projected to arrive next week according to Elon. Yet it seems like there is no expectation for it to push SOTA benchmarks up. > > “Grok 4.20 is finally out next week. Will be a significant improvement over 4.1.” https://t.co/3z47LnEfuW pic.twitter.com/jtp1hYEWKT > > — TestingCatalog News 🗞 (@testingcatalog) February 15, 2026 There’s also a hidden internal Grok page called “Vibe” serving as a model override tool for xAI staff. With Grok 4.20 training reportedly delayed to mid-February due to infrastructure issues, the timeline for these features remains uncertain, but the groundwork is clearly being laid.
www.testingcatalog.com
February 15, 2026 at 8:57 PM
Anthropic rolls out new Claude Code slash commands, SSH tunnel support, tool access modes, and teases a possible Sonnet model update.
Anthropic brings slash commands and SSH support to Claude Code
Anthropic has been shipping a steady stream of updates across its Claude ecosystem, with several changes already live and a handful still in development. In Claude Code, slash commands — reusable prompt snippets that can be invoked from any session — are now available with predefined options like debug, release notes, and PR comments. These are particularly useful for developers who repeat similar tasks across projects. Anthropic, which has been rapidly expanding Claude’s agentic capabilities since the launch of Cowork in January 2026 and the plugin system shortly after, appears to be bringing slash commands into Cowork as well. They would appear as a dedicated tab within bundles, sitting alongside MCP connectors and skills. The company may also make it possible to create such commands directly from the command line or save any prompt as a reusable slash command, which would benefit power users who want to standardize workflows. On the Claude Code side, two additional features have surfaced: SSH tunnel support for connecting to remote environments, which addresses a common pain point for developers working across machines, and a new tool access configuration within connectors where users can set tools to on-demand, always-available, or automatic mode. Both would give developers finer control over how Claude Code operates in different environments. Two features that remain clearly in development are a hold-to-record voice mode for Cowork with built-in input device selection, which would simplify voice-driven workflows on the desktop app, and custom instructions for Cowork that would apply globally across all tasks. The latter would be a welcome addition for teams wanting consistent behavior without repeating context each session, though it remains unclear whether Anthropic will ship it broadly. Separately, there are strong indications pointing to an imminent release of a new Sonnet model. Code references hinting at the next version have appeared, and historically such additions surface roughly five days before a public launch. As TestingCatalog previously reported, early testing of a Sonnet 5 build showed competitive math performance with frontier models and stronger coding output than Opus 4.5 in certain workflows. Whether the release lands as Sonnet 5 or Sonnet 4.6 remains to be seen.
www.testingcatalog.com
February 15, 2026 at 2:04 PM
What's new? MiniMax launched MiniMax-M2.5 for coding, agentic automation and office tasks; it comes in M2.5 and M2.5-Lightning versions via API and MiniMax Agent.
ICYMI: MiniMax debuts MiniMax-M2.5 model on web and APIs
MiniMax has announced the launch of its latest language model, MiniMax-M2.5, targeting developers, enterprises, and professionals who require advanced coding, agentic automation, and office productivity capabilities. The model is publicly available through the MiniMax platform, with two versions, M2.5 and M2.5-Lightning, offering throughput options of 50 and 100 tokens per second, respectively. The model can be accessed via API and is integrated into the MiniMax Agent product. MiniMax-M2.5 builds on substantial advancements over previous versions, particularly in coding proficiency, multilingual programming, and agentic task automation. It supports over 10 programming languages and is trained using reinforcement learning across hundreds of thousands of real-world environments. The model demonstrates state-of-the-art results on industry benchmarks like SWE-Bench Verified and BrowseComp, and provides full-stack development support from system design to code review. In office scenarios, M2.5 leverages standardized Office Skills for tasks in Word, PowerPoint, and Excel, with domain-specific customization available. Its cost structure undercuts leading competitors, making continuous operation much more economical. 💡 Try MiniMax M2.5 on MiniMax Agent MiniMax, the company behind M2.5, leverages its proprietary RL framework, Forge, and advanced infrastructure to rapidly iterate its model series. The firm has involved senior professionals from finance, law, and social sciences in training data curation, ensuring the outputs meet real-world business requirements. Early internal adoption has seen widespread use across company operations, with M2.5-generated code comprising the majority of new software commits. Source
www.testingcatalog.com
February 15, 2026 at 9:50 AM
Manus AI launches “Agents” across platforms, enabling users to create personal agents with persistent memory and Telegram integration for easy access.
Manus AI launched 24/7 Agent via Telegram and got suspended
UPD: Shortly after the launch, Telegram suspended a new Manus AI always-on agent account. Neither Telegram nor Meta shared any public statement on this situation yet. > Looks like WhatsApp will be the only way for Manus AI to expand. > > I bet we will see it very very soon. It deserves a war room to be on fire already. pic.twitter.com/1L8pWZu5JG > > — TestingCatalog News 🗞 (@testingcatalog) February 14, 2026 Manus AI has introduced “Agents” across its web app and mobile clients, positioning it as a way to build a personal agent with a distinct identity, persistent memory, a dedicated computer instance, and support for custom skills. The onboarding flow also highlights messenger availability, showing Telegram, WhatsApp, Facebook Messenger, and Line, but at launch, Telegram appears to be the only option that actually works. In the current setup, selecting Telegram prompts users to link Manus with their Telegram account. After a short connection step, Manus creates a dedicated Telegram chat that acts as the agent’s entry point, while the same conversation remains accessible inside Manus on the web and in the native apps. This effectively turns Telegram into an always-available front door for the agent, which can matter for users who live inside messaging apps and want a single place to trigger tasks or continue long-running threads. The move looks aimed at lowering the friction that has held back “always-on” 24/7 proactive agent stacks, similar to OpenClaw. Instead of asking users to install and configure multiple components, Manus is pushing a near one-click path: 1. Connect Telegram 2. Add tools and connectors 3. Install skills 4. Operate the agent from a familiar chat surface That approach could appeal to teams and power users, but also to mainstream subscribers who want results without setup overhead. A practical constraint is cost. Manus usage is credit-based, and agent-style workflows can burn credits quickly because they encourage longer, more frequent sessions and background-style tasking. If Manus wants this to convert new users, pricing and credit transparency will likely matter as much as the feature set. This also lands amid broader competitive signals. Meta has been spotted testing OpenClaw integrations in Meta AI, and large model vendors still do not offer a comparable, consumer-first messenger-based agent experience at scale. If Manus can expand beyond Telegram and keep unit economics workable, Agents could become a differentiator in a category where demand is visible, but the mainstream product shape is still forming.
www.testingcatalog.com
February 14, 2026 at 1:38 PM
Z.ai has launched GLM-5, a flagship open-weight LLM designed for complex systems engineering, full project builds, and multi-step tool workflows.
Z.AI launched GLM-5, new open-source model on chat and APIs
Z.ai has released GLM-5 on February 11, 2026, positioning it as a flagship open-weight model built for complex systems engineering and long-horizon agent work. The rollout targets developers and teams that need an LLM to plan, execute, and iterate across large codebases and multi-step tool workflows, not just generate snippets. GLM-5 is framed as a shift from “vibe coding” toward agentic engineering, where the model is expected to handle end-to-end project construction, refactors, deep debugging, and longer task chains with tighter goal consistency. It also keeps a large context window for sustained work across many files, specs, and intermediate artifacts. > GLM-5 from @Zai_org just climbed to #1 among open models in Text Arena! > ▫️#1 open model on par with claude-sonnet-4.5 & gpt-5.1-high > ▫️#11 overall; scoring 1452, +11pts over GLM-4.7 > > Test it out in the Code Arena and keep voting, we’ll see how GLM-5 performs for agentic coding… https://t.co/GEwxRiz2wq pic.twitter.com/MajenrS0Qz > > — Arena.ai (@arena) February 11, 2026 On the technical side, GLM-5 scales up from the prior generation with a mixture-of-experts design at roughly 744B parameters total and 40B active parameters, and it increases pre-training data from 23T to 28.5T tokens. It also integrates DeepSeek Sparse Attention to reduce serving cost while keeping long-context capacity, and it uses an asynchronous reinforcement-learning setup (slime) to raise post-training throughput for more frequent iteration. Benchmark disclosures put GLM-5 at the top tier among open-weight models for reasoning, coding, and tool-based tasks, with results that are described as approaching Claude Opus 4.5 on software engineering workloads. Reported scores include: 1. 77.8 on SWE-bench Verified 2. 56.2 on Terminal-Bench 2.0 These scores are alongside strong results on web retrieval and multi-tool planning benchmarks such as BrowseComp and MCP-Atlas. Availability is broad: weights are published publicly with a permissive license, and the model is also offered through Z.ai’s chat and API stack. Deployment guidance is already oriented to production inference via common serving frameworks like vLLM and SGLang, with support called out for running inference on domestically produced accelerators including Huawei Ascend, alongside additional local silicon options named in the company’s rollout messaging. Z.ai, the company behind the GLM family, has been iterating rapidly on coding-first and agent-first releases, with GLM-4.7 arriving in late 2025 and earlier GLM-4.5 and multimodal variants forming the base of its current platform lineup. GLM-5 is the clearest signal yet that the company wants its open-weight flagship to compete in real software delivery settings, where long context, tool calling, structured outputs, and sustained execution matter as much as raw benchmark performance. Source
www.testingcatalog.com
February 13, 2026 at 11:44 PM
OpenAI introduces GPT-5.3-Codex-Spark, a real-time coding model in Codex built for rapid code iteration, now available to ChatGPT Pro users.
OpenAI debuts Codex-Spark powered by Cerebras infra
OpenAI has rolled out GPT-5.3-Codex-Spark on February 12, 2026, positioning it as its first model built specifically for real-time coding inside Codex. It is a smaller sibling of GPT-5.3-Codex, tuned for near-instant code edits and rapid iteration, and served on ultra-low-latency hardware that can generate more than 1,000 tokens per second. The rollout targets developers who want tight feedback loops: making narrow changes, reshaping logic, refining UI, and immediately seeing results without waiting on long runs. By default, Codex-Spark keeps its working style lightweight, focusing on minimal, targeted edits and not running tests unless explicitly requested. At launch, the model is text-only with a 128k context window, and it is governed by separate rate limits that do not count toward standard limits. OpenAI says users may see temporary queuing during peak demand as capacity ramps. > GPT-5.3-Codex-Spark is now in research preview. > > You can just build things—faster. pic.twitter.com/85LzDOgcQj > > — OpenAI (@OpenAI) February 12, 2026 Availability starts with ChatGPT Pro users via the latest Codex app, CLI, and VS Code extension, with API access limited to a small set of design partners testing product integrations. OpenAI frames this as an early-access step while it hardens the end-to-end experience and expands datacenter capacity, with broader access planned over the coming weeks. Under the hood, OpenAI says “model speed” was only part of the problem, so it also reworked the full request-response pipeline. Changes include a persistent WebSocket path enabled by default for Codex-Spark, plus optimizations that cut per roundtrip overhead by 80%, per-token overhead by 30%, and time-to-first-token by 50%. OpenAI says this lower-latency path will become the default for other models soon. The hardware story is the headline: Codex-Spark runs on Cerebras Wafer-Scale Engine 3 as a latency-first serving tier, marking the first milestone in the OpenAI–Cerebras partnership announced in January. Cerebras leadership describes the preview as a way to discover new usage patterns unlocked by fast inference, while OpenAI’s compute team highlights wafer-scale inference as an added capability alongside its GPU fleet for latency-sensitive workflows. OpenAI also emphasizes safety posture: Codex-Spark inherits the same safety training as its mainline models, including cyber-relevant training, and was evaluated through its standard deployment process. OpenAI says it does not expect Codex-Spark to plausibly reach its Preparedness Framework threshold for high capability in cybersecurity or biology. Strategically, Codex-Spark is meant to complement GPT-5.3-Codex’s longer-horizon “work for hours” mode. OpenAI’s direction is a two-mode Codex: fast, real-time collaboration for rapid iteration, and longer-horizon reasoning and execution when deeper work is needed, with a roadmap toward blending both modes in one workflow. Source
www.testingcatalog.com
February 13, 2026 at 11:39 PM
What's new? Google updated Gemini 3 Deep Think, a reasoning mode in the Gemini app and via API for select users; Mode tops benchmarks in math, physics, chemistry and engineering;
Google opens Gemini 3 Deep Think API in Early Access
Google has launched a significant update to Gemini 3 Deep Think, a specialized reasoning mode integrated into the Gemini app and now accessible via the Gemini API for select researchers, engineers, and enterprises. This release targets scientists, academic researchers, engineers, and enterprise users who tackle complex, open-ended problems with incomplete data. Public availability begins immediately for Google AI Ultra subscribers, while early access to the API is being offered to a limited group through an interest form. > We’ve upgraded our specialized reasoning mode Gemini 3 Deep Think to help solve modern science, research, and engineering challenges – pushing the frontier of intelligence. 🧠 > > Watch how the Wang Lab at Duke University is using it to design new semiconductor materials. 🧵 pic.twitter.com/BgSEmv00JP > > — Google DeepMind (@GoogleDeepMind) February 12, 2026 Gemini 3 Deep Think stands out for its technical capabilities, excelling at advanced reasoning tasks across mathematics, physics, chemistry, and engineering. It has achieved leading benchmarks, scoring 48.4% on Humanity’s Last Exam, 84.6% on ARC-AGI-2, and an Elo of 3455 on Codeforces. Expert users have demonstrated its ability to spot overlooked logical errors in research, optimize materials science processes, and translate sketches into 3D-printable objects. Compared to previous versions, this release extends reach to more scientific disciplines and supports new workflows through API integration. Early testers from universities and Google’s own R&D have reported successful real-world applications, such as identifying flaws in academic papers and designing advanced manufacturing recipes. Google leads this release, leveraging its experience in AI-driven research tools. This update aligns Gemini 3 Deep Think with broader company goals of supporting scientific discovery and practical engineering, furthering Google’s role in pushing the boundaries of computational reasoning and applied AI. Source
www.testingcatalog.com
February 13, 2026 at 11:35 PM
Cline CLI 2.0 brings coding agents into the terminal with interactive and autonomous modes, ACP support, and a free Kimi K2.5 and MiniMax M2.5.
Cline drops CLI 2.0 coding agent, powered by K2.5 and M2.5 for free
Cline has released Cline CLI 2.0, pushing its coding agent beyond the IDE and directly into the terminal for both hands-on sessions and fully autonomous runs. The rollout includes a limited-time free trial powered by Moonshot AI’s Kimi K2.5 and MiniMax 2.5, positioning the CLI as an entry point for developers who want an agent that can plan, execute, and iterate without leaving the command line. The 2.0 release rebuilds the terminal experience to mirror the agent loop Cline users know from editors. In interactive mode, the CLI behaves like a full terminal UI with real-time task planning and execution, a Plan/Act toggle via Tab, auto-approve via Shift+Tab, file mentions with `@` for workspace context, and slash commands like `/settings`, `/models`, and `/history` for quick navigation. Sessions end with summaries that surface what changed, what ran, and how many tokens were consumed. > Introducing Cline CLI 2.0: An open-source AI coding agent that runs entirely in your terminal. > > Parallel agents, headless CI/CD pipelines, ACP support for any editor, and a completely redesigned developer experience. Minimax M2.5 and Kimi K2.5 are free to use for a limited time.… pic.twitter.com/R5TBKC71DZ > > — Cline (@cline) February 13, 2026 For automation, Cline CLI 2.0 adds a headless path designed for pipelines and scripting. Developers can run with `-y` (YOLO mode) to auto-approve actions and stream results to stdout, pipe data in and out to chain multi-step flows, and emit structured output via `--json` for parsing. The workflow docs also call out timeout controls and environment variables like `CLINE_DIR` for configuration isolation and `CLINE_COMMAND_PERMISSIONS` for restricting what shell commands the agent is allowed to execute. Cline is also betting on editor portability. CLI 2.0 can run as an Agent Client Protocol (ACP) server using `cline --acp`, allowing the same agent to plug into JetBrains IDEs, Neovim, and Zed, alongside other ACP-compatible editors. That ACP route is positioned as a way to keep access to Cline capabilities such as Skills, Hooks, and MCP integrations across environments, while letting each editor provide its own native tooling and context. 💡 Test Cline CLI 2.0 on Windows, macOS and Linux Installation is handled through npm (`npm install -g cline`) with Node.js 20+ required and Node 22 recommended. Authentication is routed through `cline auth`, with options that include signing in via a Cline account, using a ChatGPT subscription through OpenAI Codex OAuth, importing existing credentials from Codex CLI or OpenCode, or bringing a direct API key. Provider support spans major hosted model APIs as well as local model runtimes, reflecting Cline’s pitch that developers should not be locked to a single vendor. Cline’s broader project is built around an open-source agent that operates with explicit permission for file edits and command execution, and can be extended through the Model Context Protocol. Cline CLI 2.0 is framed as the next step in turning the terminal into the control plane for agentic development: long-running work, parallel sessions, and automation-first usage, while still keeping an interactive path for developers who want review points before the agent acts.
www.testingcatalog.com
February 13, 2026 at 4:50 PM
MemOS releases its OpenClaw Plugin, offering a shared memory layer for OpenClaw teams to reduce token costs and maintain consistent agent context.
MemOS OpenClaw Plugin to cut agent memory costs by 70%
MemOS has shipped its OpenClaw Plugin, and it is now live as a drop-in memory layer for teams building with OpenClaw. The promise is blunt: keep long-term context without blowing up token bills, while keeping agent personalization consistent across longer projects. SPONSORED Explore MemOS OpenClaw Plugin to enable multiple AI Agents operate your memory. Check Github According to MemOS benchmarks, the plugin can cut token usage by roughly 60 to 70 percent versus native OpenClaw memory flows, by shifting what gets stored and recalled into a dedicated memory layer instead of repeatedly reloading huge context windows. That matters most when agents run daily, handle multi-step tasks, or sit inside paid products where every extra token is a real cost. Multi-agent collaboration is having a moment. Whether it's AutoGen, CrewAI, or the recently viral OpenClaw, everyone's exploring how to get multiple agents working together. But there's a catch: each agent carries its own isolated "brain," with no idea what the others are doing. The result? Duplicated work, mismatched context, and information handoff via manual copy-paste. > 🧠 From context stacking → system memory > Memory is no longer shoved into prompts. > It’s structured, schedulable state. > > 📉 72%+ fewer tokens, 60% fewer model recalls > No more today+yesterday+everything injection. > Only task-relevant recall, on demand. > > 🎯 +33% accuracy on LOCOMO… > > — MemOS (@MemOS_dev) February 10, 2026 MemOS Plugin addresses exactly this. It enables multiple OpenClaw agents to share the same memory pool, instead of each agent maintaining isolated memory, the entire team writes to and reads from a unified space. What Agent A produces, Agent B can directly access, without you shuttling information back and forth. This ensures that collaboration does not collapse into duplicated work or mismatched context. 0:00 /0:15 1× MemOS visualisation 👀 The intended audience is clear: B2B agent builders, dev-tools teams, internal copilots, and anyone shipping agent workflows where memory becomes the bottleneck for cost and consistency. Availability is straightforward: the plugin is distributed via GitHub and is meant to plug into OpenClaw wherever you run it. This lands as memory tooling becomes a battleground for agent stacks, alongside products like mem0, supermemory, and memU, with MemOS pushing the angle that memory should be treated as its own OS layer rather than a bolt-on prompt trick. MemOS is the project behind the plugin, positioned as a “memory OS” for AI apps and agents, with its own site, dashboard, and a broader open source footprint under the MemTensor org. This plugin is the latest move in that direction: push memory into a reusable layer that can be shared, persisted, and reused across agents and sessions, so long-running workflows do not keep paying the same context tax over and over.
www.testingcatalog.com
February 12, 2026 at 2:01 PM
Google NotebookLM is testing a visual style selector for infographics, offering 10 distinct style options for tailored presentations.
Google adds 10 customizable infographic styles to NotebookLM
Google has been updating NotebookLM with more flexible tools for generating infographics. In the most recent builds, a significant update was discovered: users will soon be able to choose from a new visual style selector for infographics, with a total of ten distinct options. These options include an auto-selection mode and 9 specific styles: sketch, kawaii, professional, anime, 3D clay, editorial, storyboard, bento grid, and bricks. Each style presents a unique visual approach, enabling users to adapt the appearance of infographics to better match their intended audience or platform. For example, sketch and kawaii styles offer a more playful presentation suitable for informal channels or younger audiences, while professional, editorial, and bento grid styles are designed for more structured use cases such as LinkedIn or internal presentations. The inclusion of anime and 3D clay options allows for even more creative flexibility, appealing to content creators looking for distinctive visuals. Initial access to these styles reveals that all of them are functional and offer considerable customization. The ability to fine-tune infographic visuals according to personal or brand preferences could help NotebookLM expand its relevance for professionals and educators who rely on polished or stylized visuals to communicate information. The auto-selection mode provides a default experience, but users wanting more control can quickly switch between modes as needed. Anime style example This update is in line with Google’s strategy to position NotebookLM as a versatile tool for both productivity and creative work, leveraging AI to simplify content production while still allowing for a degree of personalization. Google continues to iterate on NotebookLM’s AI-driven features to make it useful across different industries and content workflows, and the addition of customizable infographic styles fits into this vision. Given that the feature is already working in current builds, there is a reasonable expectation that it could become generally available soon, although the precise timeline remains unknown.
www.testingcatalog.com
February 11, 2026 at 8:53 PM
Anthropic is testing a Tasks feature in Claude’s mobile apps, bringing Cowork-style automation, repeatable actions, and possible browser tasks soon.
Anthropic prepares Claude Tasks on mobile for browser automation
Anthropic appears to be preparing “Tasks” inside Claude’s mobile apps. In a recent iOS build, new UI traces point to a Tasks entry in the app menu and a dedicated Tasks page where users could create new items, suggesting the feature is moving beyond desktop and into the phone-first workflow many Claude users rely on. What’s visible so far looks closely aligned with the existing Claude Cowork interface: similar naming, iconography, and an emphasis on setting up repeatable actions rather than one-off prompts. If this ships as implied, it would effectively bring Cowork-style automation to iOS, and likely Android next, letting users set up structured jobs from the same place they already chat. > Anthropic is working on Tasks mode for Claude mobile apps. > > Mobile Cowork is coming 👀 pic.twitter.com/lDkQzpZ9fs > > — TestingCatalog News 🗞 (@testingcatalog) February 9, 2026 The strings also hint at broader capabilities attached to Tasks, including the ability for Claude to operate a browser as part of execution. On mobile, that would imply a workflow where a task can open pages, gather information, and complete steps in sequence, without the user manually driving every tap. Timing remains unclear. Anthropic has been expanding Claude’s “agentic” surface area quickly across platforms, and a mobile rollout would be consistent with turning Cowork into a cross-device capability rather than a desktop-only feature. If this lands soon, the most likely beneficiaries are power users and teams who already use Claude for recurring operational work, along with creators and professionals who want lightweight automation from a phone. It also sets up a platform race dynamic with other agent-style products on mobile, including the still-anticipated iOS arrival of Comet, where “who ships first” will shape mindshare even if the long-term capability sets converge.
www.testingcatalog.com
February 11, 2026 at 4:00 PM
OpenAI’s updated Deep Research in ChatGPT with GPT-5.2, and is working on a new Skills section for ChatGPT to install and edit SKILLS.
OpenAI works on ChatGPT Skills, upgrades Deep Research
OpenAI has introduced a revamped Deep Research experience in ChatGPT, transitioning it from a "run it and wait" flow to a more interactive guided research session. Users can now constrain Deep Research to specific websites, incorporate context from connected apps, and intervene during the process to add requirements or redirect the work. The output has also been enhanced, with reports designed for review in a dedicated full-screen view, making long, citation-heavy writeups less cramped when skimming sections or checking sources. > Deep research in ChatGPT is now powered by GPT-5.2. > > Rolling out starting today with more improvements. pic.twitter.com/LdgoWlucuE > > — OpenAI (@OpenAI) February 10, 2026 This update is particularly beneficial for individuals engaged in recurring source-based work, such as analysts, founders, journalists, marketers, and researchers who prioritize reproducibility and scope control. Website-limited research addresses the "too broad" issue when users already know which domains they trust. Connectors are useful when the missing piece is within the user's workflow, such as email, calendars, documents, or other internal contexts that the model would not otherwise access. The ability to interrupt mid-run is crucial for iterative tasks, allowing users to pivot the report without restarting from scratch when the first batch of sources reveals a better angle. Behind the scenes, OpenAI is aligning this feature with its latest flagship model line by moving the Deep Research backend to GPT-5.2. This aligns with OpenAI’s current product strategy, which emphasizes agent-like workflows that integrate browsing, synthesis, and tool access, rather than treating the chatbot as a single-shot answer box. Concurrently, there is growing anticipation around the arrival of GPT-5.3, following the recent release of GPT-5.3-Codex on the coding side. However, it remains unclear when a general ChatGPT-facing GPT-5.3 will be available and whether it will immediately replace GPT-5.2 in Deep Research. > Woah! ChatGPT will add support for importing skills to your library > > I just had it create a skill for me that I could use in Codex and got this popup in the chat pic.twitter.com/8AEUfKgjcD > > — Max Weinbach (@mweinbach) February 10, 2026 In addition to the Deep Research upgrades, there are indications that ChatGPT may be preparing to introduce a first-party "Skills" layer. This would involve installable, editable workflow instructions that shape how the assistant behaves for specific tasks. The concept is reminiscent of agent frameworks and development tools, where a skill packages a repeatable procedure, constraints, and expected outputs, allowing the model to execute a known playbook instead of reinventing the approach each time. If OpenAI integrates skill management directly into ChatGPT, it would provide power users and teams with a native way to standardize workflows, share internal operating procedures, and maintain consistent results across individuals and projects without the need to build a full custom agent stack. While the timing remains uncertain, this direction aligns with OpenAI’s broader move toward configurable, tool-using assistants that are more closely integrated with real work.
www.testingcatalog.com
February 11, 2026 at 3:50 PM
Perplexity is testing a Health section that may offer personalised advice, settings-based profiles, and possible Apple Health integration.
Perplexity tests Health page with Apple Health integration
Perplexity is preparing to launch a new **Health** section, expanding its domain-focused offering beyond current categories such as Finance, Travel, and Sports. The upcoming Health tab is expected to appear as a dedicated module within the main navigation, providing users with streamlined access to health-related tools and information. This approach follows the pattern established in other verticals, where users can easily switch between specialized modules via the top navigation bar. The Health module is designed to collect user-specific details through a profile system. In this area, users will be able to specify health goals, report their activity level, list medical conditions, and enter family medical history, among other categories. The profile management will likely include an edit function, allowing users to update their information as their circumstances change. This level of customization is aimed at tailoring the responses and recommendations to the user’s specific context, which could be particularly valuable for individuals tracking ongoing health and wellness goals or those managing chronic conditions. A notable addition being developed for Perplexity Health is the option to connect external data sources, with Apple Health integration specifically mentioned. This feature would allow users to import activity, biometrics, and potentially other health metrics directly into Perplexity’s platform, centralizing information from multiple devices or apps. Within the Health module, users will be presented with a dashboard designed to visualize this connected data, offering an at-a-glance overview of trends and statistics sourced from various inputs. This dashboard concept mirrors what is available in fitness and health tracking apps, potentially increasing the value for those who already rely on Apple Health or similar services. Initial indications suggest that the Health module **might launch first for users in the United States** , which is a common practice for new features that involve regulatory or privacy considerations tied to health data. The actual release timeline is still unconfirmed, but the presence of settings screens, profile categories, and integration touchpoints suggests the feature is well into development. Once available, the new Health tab will likely appeal to users looking to consolidate their health information and receive more context-aware answers or recommendations within the Perplexity platform.
www.testingcatalog.com
February 11, 2026 at 2:58 PM
What's new? Agent Swarm coordinates 100 sub-agents to execute 1500 tool calls at 4.5x single-agent speeds; it is offered on Kimi's platform as a research preview;
Kimi launches Agent Swarm AI for parallel research and analysis
Kimi has unveiled Agent Swarm, a self-organizing AI system that goes beyond the traditional single-agent approach. Rather than relying on one model to process tasks sequentially, Agent Swarm creates an internal organization, autonomously assembling and managing up to 100 specialized sub-agents in parallel for research, analysis, or content generation. This allows it to execute over 1,500 tool calls and deliver results at speeds up to 4.5 times faster than single-agent systems. The feature is currently offered as an early research preview, with continued development planned to enable direct communication between sub-agents and dynamic control over task division. > Kimi Agent Swarm blog is here 🐝 https://t.co/XjPeoRVNxG > > Kimi can spawn a team of specialists to: > > - Scale output: multi-file generation (Word, Excel, PDFs, slides) > - Scale research: parallel analysis of news from 2000–2025 > - Scale creativity: a book in 20 writing styles… pic.twitter.com/ElTzf3ksQe > > — Kimi.ai (@Kimi_Moonshot) February 10, 2026 Agent Swarm is designed for users with demanding workloads: researchers, analysts, writers, and professionals needing large-scale data gathering, document synthesis, or complex problem-solving from multiple perspectives. The system operates on Kimi’s platform, accessible to users through their web interface, and is not limited to a specific geographic region. Users can instruct the system to form expert teams for broad research, generate lengthy academic reports, or analyze problems from conflicting viewpoints, all without manual intervention. Kimi, the company behind Agent Swarm, has focused on pushing the boundaries of AI utility by addressing the bottlenecks of single-agent reasoning and vertical scaling. Their approach with Agent Swarm marks a shift toward horizontal scaling, enabling many agents to collaborate and self-organize, positioning Kimi as a pioneer in the practical deployment of multi-agent AI architectures. Source
www.testingcatalog.com
February 10, 2026 at 10:39 PM
What's new? Telegram updated its Android, iOS and iPad apps with a redesigned look, bottom bar, media viewer and shortcut; gift crafting, group transfer and bot button color options added;
Telegram revamps app with new interface and craftable gifts
Telegram has rolled out a major update for its Android app, introducing a fully redesigned interface. This update brings a new bottom bar for swift navigation between chats, settings, and profiles, making it easier for users to access core features. The development team has rebuilt the interface code to maximize efficiency and responsiveness, while users can control interface effects via Power Saving settings to extend battery life. For iOS users, the update introduces a revamped media viewer, improved sticker and emoji pack previews, and streamlined context menus. iPad users benefit from a new keyboard shortcut for sending messages. 0:00 /0:14 1× A key addition is the crafting system for collectible gifts, allowing users to combine up to four gifts to create higher-tier items with rare attributes and unique visuals. The crafting process uses probability mechanics, where the inclusion of similar attributes increases the likelihood of those traits appearing in the final result. All users can access this feature and participate in buying or selling collectible gifts through Telegram’s Gift Marketplace. Telegram’s update also updates group management by enabling group owners to: 1. Instantly assign a new owner when leaving. 2. Have ownership automatically transfer to an admin after a week. Bot developers now have the option to customize buttons with colors and emojis for clearer user actions. Telegram, known for its privacy features and large-scale group chats, continues to target both casual users and power users who value customization, security, and feature depth. This release is publicly available across all supported platforms and reflects Telegram’s ongoing efforts to refine usability and expand its digital marketplace offerings. Source
www.testingcatalog.com
February 10, 2026 at 10:36 PM
OpenAI is testing sponsored placements in ChatGPT for U.S. users on Free and Go tiers, with privacy rules, user controls, and clear ad labeling.
OpenAI tests sponsored ads in ChatGPT for free US users
OpenAI has started testing sponsored placements inside ChatGPT for logged-in adult users in the U.S., limited to the Free and Go tiers. Plus, Pro, Business, Enterprise, and Education users will not see ads, positioning the rollout as a funding lever aimed at keeping lower-cost access viable while preserving trust in the assistant for personal and work tasks. > We’re starting to roll out a test for ads in ChatGPT today to a subset of free and Go users in the U.S. > > Ads do not influence ChatGPT’s answers. Ads are labeled as sponsored and visually separate from the response. > > Our goal is to give everyone access to ChatGPT for free with… pic.twitter.com/S9BV24uJLb > > — OpenAI (@OpenAI) February 9, 2026 The company says ads do not change ChatGPT’s answers and will appear as clearly labeled sponsored units that are visually separated from the organic response. During the test, ad selection is based on matching advertiser submissions to what you are discussing, plus signals like your past chats and prior ad activity, with the first slot going to the most relevant available advertiser. 0:00 /0:19 1× OpenAI frames privacy as a hard boundary: advertisers do not get access to chats, chat history, memories, or personal details, and only receive aggregated performance data such as views and clicks. Safeguards include not showing ads for accounts where OpenAI is told, or predicts, the user is under 18, and blocking ads near sensitive or regulated topics such as health, mental health, or politics. Users get controls to dismiss ads, provide feedback, see why an ad is shown, delete ad data, and manage personalization. If you do not want ads, OpenAI points to upgrading tiers, or opting out on Free in exchange for fewer daily free messages. Source
www.testingcatalog.com
February 10, 2026 at 12:54 PM
What's new? Composer 1.5 uses 20x more RL steps and a thinking tokens system for code reasoning; it applies self summarization to manage long context lengths;
Cursor launches Composer 1.5 with upgrades for complex tasks
Composer 1.5, the latest agentic coding model from the team at Cursor, introduces several updates that set it apart from its predecessor, Composer 1. The model targets software developers, coding professionals, and organizations seeking automated code generation and reasoning tools. Composer 1.5 is now available for public use, with information about its pricing accessible on Cursor's official documentation. > Composer 1.5 is now available. > > We’ve found it to strike a strong balance between intelligence and speed. pic.twitter.com/jK92KCL5ku > > — Cursor (@cursor_ai) February 9, 2026 This release features a substantial increase in reinforcement learning scale, being trained with 20 times more RL steps than before. Technical upgrades include: 1. Improved handling of complex coding tasks. 2. A new system for generating 'thinking tokens' that enable the model to plan and reason through problems. 3. An advanced self-summarization capability, allowing Composer 1.5 to manage longer context lengths by recursively summarizing its own process to maintain accuracy even when memory becomes constrained. Compared to previous versions, Composer 1.5 demonstrates sharper performance, especially on difficult or multi-step coding challenges. Cursor, the company behind Composer, has focused on applying reinforcement learning at scale to coding models, aiming for continuous and predictable gains in problem-solving ability. The company positions Composer 1.5 as a daily-use tool, balancing quick response times for simple tasks while deploying deeper reasoning for more challenging code issues. Early user feedback within developer forums has noted improvements in both speed and the ability to tackle more intricate programming scenarios. Source
www.testingcatalog.com
February 10, 2026 at 12:30 AM
Meta AI is testing Avocado models, MCP integrations, and Manus browser agent support, with scheduled tasks and OpenClaw compatibility launching soon.
Meta AI redies Avacado, Manus Agent and OpenClaw integration
Meta AI is reportedly preparing to release new models named **Avocado**. Recently, Meta refreshed its website and shipped an app update where users have seen a new effort selector to choose between Fast and Thinking modes. On the web, some users have also spotted a new widget prompting them to connect apps like Gmail and Google Calendar. Notably, Microsoft Outlook and Outlook Calendar appear as options too. This looks similar to connectors in other apps: integrations that would let Meta AI pull information and operate tools via MCPs. If that reading is correct, it would mean MCP support is finally coming to Meta AI. A **Memory** section has been added to the settings menu as well, and Meta AI users should already be able to see and test all these features. What’s also notable is that Meta seems to have revamped, or possibly rebuilt, the website, and the new build appears to include a lot of additional functionality. First, as we know, Meta acquired Manus AI recently, and there are mentions of a Manus AI agent and a browser agent being in the works. That suggests Manus-style agents could come directly to Meta AI. There is also a new menu in development called Tasks, where users would be able to schedule recurring executions of Meta AI, similar to scheduled prompts in other tools. In other words, scheduled tasks seem to be on the roadmap, letting users run prompts recurrently. Code traces also suggest they are working on voice agent support. These voice agent experiences appear to reference a previous implementation of Meta AI agents. Interestingly, for testing, they were using the personality of Mark Zuckerberg. However, it also looks like the voice and browser agent features are not at a final stage of implementation yet. Another detail tied to the Manus AI integration is that Meta AI appears to be testing top models from other labs, including Gemini, ChatGPT, and Claude. These models reportedly show up in the code and are being used internally for testing. Now to the more interesting part: there seem to be multiple internal modes used for development and testing. Beyond Fast and Thinking, there are traces of a new Avocado model, shown in two forms: **Avocado** and **Avocado Thinking**. Only Avocado is responding currently. The responses so far are not great, but it’s unclear whether these answers are coming from an existing model via routing or from the actual new model. If it is the new model, then Meta would be in a very bad position and should not release it. It’s also unclear whether Meta is preparing to release these models around February. That seems plausible given that the revamped UI has already shipped, and the remaining step could be powering it with the new Avocado model. Referencing the Manus browser agent, a model called **Sierra** appears to represent the browser agent. That makes it likely we’ll see it shipped soon, possibly at the same time as the rest of the models. Overall, Meta AI seems to be aiming to rebrand and expand the experience to close the feature gap with competitors, and browser agents could be part of that. Another model referenced is **Big Brain**. This does not necessarily look new, since Meta previously had plans to implement something similar last year alongside Llama models. Conceptually, it resembles Grok Heavy: multiple model agents run in parallel, and the best output is selected as the response. If the upcoming Avocado model is actually good and this mode is powered by it, that could be a meaningful capability. Beyond models, there are also test placeholders for UX called RUX Playground. These are likely used to test widget responses and UI layouts, especially since Meta AI appears to be building card-like UI elements similar to other chatbots (for example, weather or stock-market cards). Meta products already support web search, and there also appears to be a shopping assistant in development. It’s not functional yet, but it’s evident that Meta AI is working on a shopping experience. That could be significant given Meta’s position across Facebook and Instagram, where people already buy and sell products. Finally, Meta AI appears to be working on something close to an OpenClaw integration. In particular, this mode would allow you to use any model with your own API key, a bring-your-own-key experience, potentially living inside app connectors. Across the code, it’s referenced as an OpenClaw agent. That could be a big deal given Meta’s history of open source, even if they no longer plan to open-source their proprietary Avocado model. It may also indicate they are preparing something for the open-source community, such as letting people power Meta AI with their own models, or offering tighter integration with OpenClaw bots that are currently growing quickly. It’s still unclear whether we’ll see any Super Bowl ads from Meta today, and when exactly these new experiences and the Avocado model will ship. Still, there’s a high chance it happens very soon. Considering it was recently reported that Avocado was the best model among current top models, it might have strong chances. At the same time, we just got Opus 4.6 and GPT 5.3 Codex, so it’s possible that upcoming releases could overshadow what Meta has lined up. We’ll see!
www.testingcatalog.com
February 8, 2026 at 2:40 PM
Notion is testing new agent features, including a redesigned settings UI, new automation triggers, and upcoming Agents 2.0 upgrades.
Notion tests Agents 2.0 with scripting tools and Workers
Notion’s AI agents are experiencing a steady stream of updates, with several changes emerging since last month across the agent setup flow and surrounding configuration. One noticeable change is UI-related: the settings layout for custom agents has transitioned from a full page to a side sheet. Additionally, Notion seems to be preparing a second iteration of its agents, featuring two experimental toggles labeled “Agents 2.0” and “Agents 2.0 Advanced.” Both are marked as experimental, and the wording suggests they may be linked to more compute power or stronger underlying models if and when they are rolled out. The same area also indicates functional expansion through triggers and automation. The triggers section now includes a Slack option, allowing an agent to be invoked when a message is posted in a channel, hinting at a deeper Slack integration than before. In the settings, there is also a new **scripting configuration** for agents, with fields for a script name, a key, and script code. The intent seems to be enabling agents to call into “Workers” as a capability when needed, rather than being confined to chat-style actions. A related “Workers” section references an NPM package and a place to manage automations, including templates that connect external signals to Notion actions. Examples mentioned include: 1. Creating a database when a connected GitHub account stars a repository. 2. Posting Slack messages when tasks pass a deadline. 3. Wiring actions into email and calendar flows. If these features move beyond the experimental phase, they would primarily benefit teams already using Notion as an operational hub, especially small businesses and enterprise groups that want agents to react to events across Slack, GitHub, and scheduling tools directly from the agent configuration surface.
www.testingcatalog.com
February 8, 2026 at 10:40 AM