Crawling Is the Wrong Way to do Attack Surface Mapping

This post is the third of a short series of posts that we have dubbed “Attack Surface Mapping the Wrong Way,” showing the wrong way that people/companies/vendors attempt to do attack surface mapping. Read the first in the series here. Next up is crawling and why it is the wrong way.

Crawling alone is flawed

The next most common method of DNS enumeration is through crawling.  Like those old scripts that would crawl a site looking for subdomains that may be correlated, modern crawlers are set loose on the entirety of the Internet, looking for anything they can find.  This would seem to be a much superior method of identifying assets over brute force DNS enumeration, and it does tend to be.  But it is far from ideal.  However, it has some advantages of being able to be used for other things, like finding malicious software because it sees a lot of webpages, crawling misses a lot.

  1. Crawling cannot go to 100% depth: Even a company like Google, which spends millions a year on infrastructure, gets nowhere near 100% crawl depth of a company. Think about how many pages are hidden behind login pages or registration forms? For that reason alone, crawling could never get anywhere close to identifying all the linked assets. Additionally, some types of application code suffer from the “calendaring problem,” which means that the crawler must have a hard-coded depth maximum or must have some software to detect it is hit something with infinite depth, like a calendar, that cannot be perfectly enumerated.  In either case, something might be at a different depth beyond where the crawler is willing to go, and therefore the crawler will miss things outside of that maximum crawl depth.
  2. Application code acts differently for crawlers: Companies like Arkose Labs make a living preventing robotic code from reaching certain parts of application logic that are sensitive to being repeatedly hit or costly. There is a ton of code out there that attempts to identify and thwart application crawlers, meaning that anything that is prevented from being crawled will not be identified.
  3. A crawler cannot crawl things it does not know about: This is a chicken and the egg problem. How can a crawler crawl something it does not know about?  Typically, crawlers start with a seed list of domains, subdomains, or both. For instance, Google started with something called DMOZ (an open directory of web links organized by humans).  For a crawler to be effective, it needs to have an extremely accurate seed list, and anything not in the seed list must be linked to by something in the seed list or must be in a link-chain with at least one node that is within the seed list. Unfortunately, there are countless examples of where this does not happen. For instance, who links to a printer, an IP telephone, or a firewall web interface? If it does happen, it is insanely rare; therefore, anything not linked to and falling outside of the seed list will effectively be invisible.

For the best crawler in the world, turn to Google–just because Google does not know about your data does not mean an attacker will not be able to find it in other ways.  The Target breach is a bit of a weird one because it was compromised by a 3rd party HVAC (heating, ventilation, and air conditioning) system company that retained a backdoor into the HVAC system for maintenance. Because the HVAC system company was compromised, Target was compromised.

The Target breach is not a great example on the surface of why an up-to-date attack surface map is important at face value. When discussing how to mitigate the risks, Target should have known that the HVAC system was publicly accessible. At a very minimum, it should have been locked down and allowed maintenance from only certain IP addresses and ideally only during agreed-upon maintenance windows.  Do you think Google would have shown the HVAC system’s web interface had you typed “” into the search bar? Absolutely not.

Crawlers are okay tools to leverage and even quite useful but are also simply not the right tool for the job if your goal is to know everything about your web presence. One simple way to think about a traditional asset inventory is to say, “Find everything that Google doesn’t know about,” and then add everything Google does know about.  You could say something similar about just about any one of the options in this series. Still, the problem is that Google only finds the “crawlable web,” and an asset inventory contains many assets that are not crawlable because no one links to them.

Think about how many items in the typical office environment have web interfaces and have no direct links to them.

  • Routers and switches
  • IDSs and Firewalls
  • VPNs
  • Webmail
  • Printers
  • HVAC systems
  • Elevator controls
  • Wifi
  • VOIP
  • Etc.

But even beyond those pieces of specialized systems, there is an even worse subclass of systems typically ignored, including the test systems, the staging systems, the QA systems, the administration consoles, and the like. These systems often have no links to them anywhere on the Internet because they are not meant to be publicly accessible. They are explicitly designed to be hidden from the public Internet, whether by network controls or by obfuscation.

If the asset inventory does not have these test/staging/QA/admin type servers in it, the system is almost useless.  The problem is that these systems are where an enormous number of vulnerabilities live.  Let us dig into a real-world use case. We have already talked about Equifax before, but what about the Sands casino?

Sands casino

The story of the 2014 Sands casino hack is quite a complex story that spans the seemingly unrelated arena of geopolitical anti-Semitism and computer security.  The Sands casino is owned by one of the wealthiest men in the world, Sheldon Adelson, a Jewish billionaire.  At least two hacking teams began to probe Adelson’s casino to punish him. According to Dell SecureWorks, the teams originated from Iran and were targeting him due to his religious beliefs and his place in the Jewish community.

Leading up to the attack, at least two distinct hacking teams were probing the VPN (Virtual Private Network) servers belonging to the Sands casino. They were attempting brute force attacks to log into the site and take control over user accounts.  The Sands casino’s security teams realized what was happening due to the vast number of failed login attempts. As a result, Sands hardened the application and applied additional levels of security.

However, when Sands doubled down on their VPN’s security, the hackers redirected their attack efforts.  This time, they looked at other assets that the Sands casino ran.  Presumably, the adversaries probed around until five days after the VPN brute force attempt, and the hackers found a test server that the Sands casino ran.  The purpose of the site was to test and review code before it went live and therefore had very few protections compared to the main web application used by the casino.

It was not that the casino did not know the attackers were attempting to breach them; they had no idea where the next attack would come. Because they did not treat their test site the same way they treated their production site, they were compromised. That site then allowed the attackers to pivot using a tool called Mimikatz to reveal usernames and passwords.  The attackers gained access to virtually every digital file within the Sands casino corporate Intranet. 

Eventually, they got the login credentials for a senior engineer whose access allowed the attackers to access the main gaming company’s servers. Ultimately, on February 10th, 2014, the attackers released a small piece of code that permanently wiped all the computers the hackers had accessed.  This was not a matter of extracting money – this was a vendetta-inspired attack aimed directly at Mr. Adelson’s bottom line.

The attackers won that round in what amounted to a very easily defendable attack against a target that should not have been publicly accessible in the first place.  With the millions of dollars at risk, and the normally strong security posture associated with a casino, there is no doubt that had the security teams known the risk associated with the site, they would have taken proactive measures. But how could they conceivably know the risks if they were not testing it?

If the asset inventory simply misses these critical assets, then you are likely missing a substantial percent of the interesting issues. Therefore, as you analyze the different methods of identification of assets, as we discussed with brute force previously, crawling should be one tool in the toolbox. However, if you or your vendor are using crawling exclusively, then you are likely missing a lot of assets. Crawling should not be used in a vacuum.

Want to talk about the right way to do attack surface management? We’ll show you. Get in touch with us here.


Post by Robert Hansen

March 23, 2021