HTML Search

June 7, 2021

Post by Robert Hansen

One of the most powerful features within Bit Discovery is an often overlooked one – the HTML search. It is so simple, yet so powerful. It gives you the unique ability to “see” what is on each homepage within your environment without having to look at each page.  Think of it as a Google search on steroids because it can see even the pages that Google has not indexed. It can also see the HTML on the page, not just the raw text.

HTML search enables an enormous amount of use cases to look at the HTML on a page. HTML search allows you to find libraries, functions, API keys, text, titles, copyrights, misconfigurations, default landing pages, device signatures, application errors, and on and on.

We originally built it to find a particular use case for a large company that wanted to find any page that did not have the word “GDPR” on it and set a cookie – a strong sign that their lawyers have not audited that asset. We were able to build that custom subscription exceptionally easily, even though it was not something we knew to look for naturally.  Similarly, let us say you sold a company and need to find references to the old brand; while it is not something we would know to write a subscription for, it is trivial to build using the “HTML contains” filter.

The HTML search allows you to fingerprint devices with ease and even obscure devices and applications that may not make even the top 100 things you would naturally write a filter for. It allows you to dive into the long tail of search without worrying about how the index works or finding it important enough to crawl.  It really is one of the most practical and powerful features of the product, giving you nearly infinite flexibility to build whatever you need to and report on it expeditiously.

One reason the HTML search within Bit Discovery works so well is that we send “Host” headers on each request which properly exercises the application logic. This is just one example. Many traditional scanners don’t send any host header or send the host header of the IP address, which doesn’t correctly exercise things behind load balancers and crypto clusters.  What makes the HTML filter work so well is a marriage of knowing where everything is already and then indexing what we find when we connect to it.