Introduction

Bianca is a (fictional) developer who wants to use an LLM to help her generate a script to download some of the latest LLM models from HuggingFace, a popular model hosting service. She has her LLM coder of choice generate a simple script that uses the huggingface-cli package to run some command line (CLI) commands to download the model. There’s just one problem: there is no legitimate huggingface-cli package in Python. However, after security researchers observed that LLMs frequently hallucinated this specific package name, one researcher deliberately created and published a huggingface-cli package to demonstrate how this hallucination could be exploited for typosquatting like attacks.Now this random person’s code is running on Bianca’s personal computer maybe downloading models but also doing anything else this new package owner has decided to add in.

Luckily for our fictional developer (and numerous other developers and big companies) this package maintainer is a security researcher and not someone with more nefarious motivations. But this is not the first, and certainly not the last, time an LLM coder will include a security flaw in the code it generated. LLMs can now do a lot of programming for us, a lot of which can be quite good. However, they have also been trained on our coding mistakes, typos, and misconfigurations. These flaws will continue to be included in “vibe coded” (code generated from LLMs) artifacts and cause chaos. This is especially true as we continue to trust these LLMs more to do the tasks we don’t fully understand ourselves.

This post explores how AI-generated code introduces unique security challenges that traditional security testing tools weren’t designed to catch, and why we need new approaches to secure our increasingly AI-assisted development workflows.

The Increasing Threat Landscape

Code Analysis Limitations

Present day Static Application Security Tooling (SAST) and Dynamic Application Security Tooling (DAST) tools are pretty great, however they have been designed around standard human behaviors and have yet to catch up to the variations that LLMs can add to the mix. Software Composition Analysis (SCA) tools face similar challenges in this new landscape.

SAST Limitation with AI Generated Code

Some code that is written is perfectly fine for alternative scenarios but should not be run in a production server that is connected to the Internet, for example a HTTP only development server.An LLM can’t reason through which specific attributes should apply to a developer’s specific environment. SAST tools may not flag these patterns since they’re acceptable in development environments, but dangerous in production, which is a context distinction that LLMs often miss.

While SAST tools can be configured well to help catch some of the more context dependent issues, they are generally designed to catch common errors that humans make. LLMs can generate insecure patterns that aren’t as common, whether it’s due to the old training data they have from times where today’s insecure code was the industry best practice, hallucinated dependencies or functional, but not locked down, configurations.

SCA tools often miss the AI-suggested dependencies and recommended packages that don’t exist or contain unexpected functionality. We’ll explore how to enhance SCA and other security tools for AI-generated code in a future post.

The Quest for Comprehensive Dynamic Analysis

Dynamic analysis is much less reliant on the specific scenarios and does work a lot more similarly with vibe coded applications as human coded ones. Issues can still arise when there is not 100% code coverage, which is commonly the case, and is exacerbated when non-deterministic, potentially unusual execution paths might exist.

LLMs generally will optimize the code they are production to pass tests which commonly leads them to only pass those exact tests, and not in the way that us humans would expect. For example, it might only add authentication to the pages that are explicitly checked in test code / DAST tooling, but neglect to also add it to other pages that humans would know need it.

Access Control Misunderstandings

Other security “flaws” can show up when the application is perfectly secure, but it doesn’t function the way the user expects it to. For example I was using Replit to test out building a personal web page with some public and private elements and asked the agent to make some capabilities only accessible to me. While the agent did put up an impressive Oauth login portal in front of those features, I found that anyone with a Replit account could login and access the private part of the website. If I hadn’t tested that out with a second account, I might have assumed that the app was “secure”, and technically it would pass all SAST or DAST tests, but that would be a glaring flaw in the security logic I thought I was implementing.

Conclusion

LLMs are famously eager to please the human that is giving them instruction. If removing existing security features help more effectively accomplish the task they were given, that’s a trade off that’s hard for them to refuse. So even if the code was 100% secure at one point, any LLM request can very quickly change that.

In the meantime, consider implementing additional manual security reviews for AI-generated code, especially around dependency management and authentication logic. Verify that any AI-suggested packages actually exist in official repositories before installation and test authentication flows with multiple user accounts rather than just the primary use case. We’ll explore comprehensive solutions to these challenges, as well as practical examples of how coding capable LLMs can be attacked, in upcoming posts.