Why Attack Surface Management is Hard

Jeremiah Grossman December 18, 2020

Everyone agrees that attack surface management is critically important, as it is the very first step of any information security program. While enterprise interest and market traction for attack surface management is building, it’s curious why every organization doesn’t already have an up-to-date attack surface map. They should! It may sound strange, but I believe it’s because attack surface management is technologically a hard problem to solve – extremely hard. This is harder to solve than most any another other problem across the entire IT and IT security industry, and harder than any other problem I’ve personally worked on.

Consider for the last 20 years, at least as long as I’ve been around, many passionate and brilliant minds have attempted different approaches and created a variety of solutions. Unfortunately, none succeeded. If they had, every organization would be able to supply a complete attack surface map for any IT or IT security tool to leverage. But we all know that’s just not the case. An attack surface map is highly valuable for vulnerability management, network segmentation, endpoint protection, incident response, event monitoring, patch management, and so on. Those outside the company know important things about the company’s attack surface that those inside do not. So instead, every organization and IT security team just makes do with a hodgepodge of tools that generates incomplete and erroneous results.

Furthermore, the way the Internet and software is evolving, attack surface management is not only hard, it’s actually becoming much harder. The many challenges, which I’ll describe below, also reveal why it’s impossible for the legacy approach of on-demand scanning to solve the problem. Today, the best approach is to first begin by collecting anything and everything that’s on the entire Internet. Log connections to every IP-address and every DNS entry, gather every WHOIS record for every domain name on every TLD, collect records for every TLS certificate available, store all protocol headers & server banners, and save the massive volumes of HTML found along the way. Then from this absolutely massive data lake, apply complex attribution logic to surface which organization [likely] owns what on the Internet to generate the desired attack surface.

In short, the only way to find “everything” is to first have everything — on the Internet — saved. Starting from the beginning, let’s run through some of the many challenges along the way:

1) Massive Number of Internet-Connected Devices: The number of Internet-connected devices is in the billions (~4.5B), growing incredibly fast, and not just because of every new electronic device seems to connect to the Internet (e.g. Internet of Things). You must connect to every hostname-ip-address-port combination on the Internet, on IPv6 as well as IPv4, and do so routinely and within a reasonable amount of time (i.e., 30 days or faster). Sometimes connections will be blocked and sometimes the hosting providers will receive complaints from all the network traffic. Obviously managing this volume of network connections require a large, sophisticated, distributed, and purpose-built infrastructure.

2) Massive Meta-Data Collection: When connecting to an Internet-connected asset, the goal is to extract as much meta-data about it as possible. This includes which services are running, the software major and minor version numbers, IP-geolocation, development framework, programming language, TLS certification info, the HTML, and so on. And it’s not always easy to get an Internet-connect asset to give up meta-data about itself. Nonetheless, this meta-data is necessary to take action upon later as well as determining ownership. These assets could be websites, mail servers, name servers, IoT devices, SSH servers, VPNs, RDP services, development / staging systems, databases, random security products, and who knows what else. Collectively, it’s easily hundreds of terabytes of data to process following each run.

3) Necessity of Third-Party Data: No single company can possibly self-collect all the data about everything on the Internet. To cover the unavoidable data gaps, it’s necessary to work with a number of third parties to gather additional DNS, port scanning data, WHOIS, crawl data, RBL lists, and more. Of course, identifying all these necessary data partners isn’t easy, nor are those always easy to work with, and many are costly to license data from. And the finally, you’ll need to normalize and integrate their data in whatever format they have to a centralized location.

Challenges 1, 2, and 3 demonstrate that building an attack surface management solution the right way is unquestionably a “BIG DATA” problem, and you must account for every compute issue there is. Disk, memory, CPU, and bandwidth. Now assuming an Internet’s worth of data is in hand and indexed, the next steps are about surfacing assets belonging to a single company out of the billions.

4) Large, Distributed, and Disorganized Attack Surface: The average mid-to-large enterprise has tens or even hundreds of thousands of Internet-accessible assets, sometimes millions, strewn everywhere. Some assets are located on-premise, some in the cloud (AWS, Azure, AppEngine), some are hosted applications (Salesforce, Outlook 365, Google Apps, Workday, GitHub, etc.), labelled under a variety of subsidiaries & sub brands, physically located across geographically distributed data centers, and connected through dozens of non-contiguous IP-ranges. Think about this in terms of legacy, where companies, departments, and products are built and sold off, in terms of data center and cloud migrations over years and years. It’s a mess.

5) Unknown IP-Ranges and Registered Domain Names: One might think an organization would at least have a list of all the IP-ranges that have been assigned to them, or maybe a list of all their registered domain names. But you’d be wrong. Very wrong. Most have domain names registered through multiple domain registrars, some managed by the business, and some purchased directly by an employee through a personal email account — who may have changed roles or even left the company. Many companies use a dozen or more TLS CAs. Whatever the case, the locations used to begin the search for the attack surface will always be incomplete. From incomplete seed data, it will still be necessary to find the rest of the attack surface.

6) Multiple Asset Owners and Inference of Ownership: Internet-connected devices may have multiple organizational owners, and at the same time there is no Internet protocol or standard that says who owns what. An asset may also have little to nothing external indicating that it is owned by an organization. At the same time, if breached, it could be connected to something incredibly sensitive. For example, when Target was hacked through their HVAC which granted access to the payment network.

At best, inferences can be made with a varied degree of reliability. For example, WHOIS data may provide ownership clues, but is not as reliable since GDPR went info affect. ARIN and the other RIRs data can help, so do copyright notices, logos, assets on the same IP or IP-range, keywords found in the hostname or HTML content, etc. We’ve mapped a couple of dozen indictors where ownership can be inferred.

7) Rate of Attack Surface Change: New domain names can be registered at any time, and often are, for new product launches, marketing promotions, or just plain squatting. Internet-connected assets may be stood up and decommissioned without notice (Shadow Assets). New ports/services may be opened with little to no warning (Shadow Services). In addition, the software running on each surface may be updated as well (Shadow Software) — hopefully. Collectively, the attack surface of the average organization changes constantly. We think somewhere around 1-5% per month. And each change, each unaccounted for and unprotected difference, is potentially an opportunity for an adversary to take advantage of. When something belonging to the company pops-up somewhere randomly on the Internet, someone or something needs to find it as fast as possible.

These challenges and many more cannot be solved overnight, and even be well-understood without investing many years of dedicated research. Bit Discovery and the team has been at this for a long time, which is why the industry now is finally making progress.