We present BenchmarkName, a novel benchmark to quantify LLM security risks and capabilities.
The rapid development of large language models (LLMs) has opened new avenues across various fields, including cybersecurity, which faces an evolving threat landscape and demand for innovative technologies.
Large Language Models (LLMs) are being deployed across various domains today.
We investigate how interface design affects the performance of language model agents.
Ranked #3 on
Bug fixing
on SWE-bench-lite
By 2028 most cybersecurity actions will be autonomous, with humans teleoperating.
Cryptography and Security
Although language model (LM) agents are demonstrating growing potential in many domains, their success in cybersecurity has been limited due to simplistic design and the lack of fundamental features for this domain.
As a prospective filter for the human analyst, we present an online unsupervised deep learning approach to detect anomalous network activity from system logs in real time.
The most notable of these comes in the form of the first self-described `AI pair programmer', GitHub Copilot, a language model trained over open-source GitHub code.
While the emulator was initially developed for cybersecurity courses, it can also be used for network courses, for students to learn how the Internet technologies work, such as routing, BGP, IP Anycast, and DNS.
Cryptography and Security
To evaluate agent capabilities, we construct a cybersecurity agent and evaluate 8 models: GPT-4o, OpenAI o1-preview, Claude 3 Opus, Claude 3. 5 Sonnet, Mixtral 8x22b Instruct, Gemini 1. 5 Pro, Llama 3 70B Chat, and Llama 3. 1 405B Instruct.
Ranked #3 on
on Cybench