Unveiling Secrets with Content Discovery

Today, we are going to talk about content discovery, and how it is essential when it comes to cyber security. When talking about content discovery, we can’t deny how many organizations are vulnerable due to the fact that they expose some content which are considered as sensitive information whether intentional or unintentional. But, before going any further, we need to know what the content discovery really is.

What is content discovery?

The definition of content in this context is very broad. It encompasses a wide range of data, whether it is multimedia files, web pages, documents, text files, spreadsheets, configuration files, system logs, and many more. However, the existence of the content is likely to be hidden and not so obvious. Oftentimes, the content is also not for public access. Nevertheless, it doesn’t mean that it’s undiscoverable.

A common practice for adversaries before going on any further step is to look for the hidden content or searching for even the slightest information available in order to leverage the information to their advantage. It could be a page that is intended for staff only, like the admin panel, configuration file, the server kernel version or operating system, etc. So, how can they be able to retrieve such content or information?

Discovering website content

To cover a wider audience, it is very common for a company or organization to have a website. Hence, fairly often it becomes the primary target for adversaries. Nevertheless, there are still a lot more resources to take advantage of. In this article, we will focus on only targeting the website.

When trying to discover the content of the website, there are three main ways to follow. Let’s break it down.

Manual discovery

Manual discovery is done by simply learning how the target web application is structured and works using several pieces of information that can normally be found in the website itself.

Robots.txt

If you are writing a blog or have your own website, it is likely that you’ve heard about robots.txt. A robots.txt file plays an important role to tell search engines whether search engine crawlers are allowed to access a specific page on your site or not. However, it [is not supposed to be a mechanism to exclude a page](https://developers.google.com/search/docs/crawling- indexing/robots/intro) from a certain search engine like Google. Google recommends using the noindex on a meta tag instead. For this reason, lots of websites still get their supposedly hidden pages exposed to the public.

On the other hand, restricting certain areas or pages on the website is a common practice. This is specifically applied to some areas or pages that are protected by password or members area only. In other words, the website owner doesn’t want these areas to be publicly available or discoverable to the unintended.

Sitemap.xml

Sitemap.xml is a structured list for every page the website owner is willing to be indexed by the search engine. As a penetration tester, you might find something interesting from this file. For example, some pages may no longer be used but still exist in the sitemap which sometimes give you a better insight about the target website.

Favicon

Several CMS (Content Management System) like Wordpress are very popular on the internet. If the website is built on this CMS and the website owner or the developer decide to not change the default favicon, it can give information about what framework or CMS is used by the website. However, it is not limited to Wordpress only. There are also several frameworks that have a favicon which is identifiable by its md5sum information which you can find and compare with the OWASP favicon database.

HTTP headers

When you are interacting with the website and making a request to the server, sometimes you might find something useful in the response header. The response header may reveal the type and version of the server software being used, such as Apache or Nginx. Other than that. response header may also contain information about technologies and framework (or scripting language) employed by the web application or server.

The adversaries or penetration tester can also identify the existence of several security-related headers such as Content-Security Policy (CSP) or Strict-Transport-Security (HSTS). The absence of these security headers could be an indication of potential security weaknesses.

Framework stack

In certain cases, the web page itself may contain valuable framework information, which can be found in elements like comments, copyrights notices, and more. This information alone can potentially serve as a starting point for gathering additional insights.

OSINT (Open-Source Intellegence)

Utilizing external resources for gathering supplementary information about your target is another possibility. This process is commonly known as Open- Source Intelligence (OSINT). There are various approaches available for conducting information gathering in this context.

Search-engine based

Google dorking

Google dorking leverages advanced search engine functionalities to refine and obtain more precise search results. In essence, it involves utilizing specific keywords like site:, intitle:, inurl:, and others to filter and narrow down the search results. We also have an entire article about Google dorking which provides a thorough explanation of what it is and how you can use it [here](https://www.binaryte.com/blog/unlocking-the-secrets-of-the-internet- with-google-dorking).

Shodan and Censys

Shodan and Censys are both powerful tools used in the field of OSINT (Open- Source Intelligence). Shodan is a search engine that scans and indexes devices connected to the internet, including servers, routers, webcams, and more. It provides detailed information about these devices, such as open ports, services running, and even vulnerabilities. This data can be useful for security researchers, network administrators, and hackers alike.

Censys, on the other hand, is a search engine that focuses on scanning and analyzing the security of internet-connected devices and systems. It provides information about SSL certificates, websites, domains, and various other network-related data. Censys helps identify vulnerabilities, misconfigurations, and potential security issues that can be used to improve the overall security posture of organizations.

Internet archive

The Wayback Machine is an extensive web archive capturing websites since the late 90s. By searching a domain name, you can explore the various instances when the service archived and preserved the webpage’s contents. This valuable tool aids in discovering old pages that might still exist within the present-day website.

Cloud-based repository

Github is a web-based platform that builds upon the functionality of Git, a version control system used to track file changes in a project. It simplifies collaborative work by allowing team members to monitor each other’s edits and modifications to files. After completing their changes, users “commit” them with a message and “push” them to a central repository. Others can then “pull” these changes to their local machines. GitHub, as a hosted version of Git, provides additional features and capabilities. Repositories can be public or private, with various access controls. By leveraging GitHub’s search feature, you can locate repositories associated with specific companies or websites, potentially revealing source code, passwords, or other valuable content that you hadn’t yet discovered.

Framework and technology identifier

When it comes to exploring the technology or framework stack behind a website, Wappalyzer is a reliable tool worth considering. It excels at this task and even offers a convenient browser extension for easier identification. Additionally, another useful website called WhatCMS can provide valuable assistance in this regard.

Reverse image lookup

If you have images that you want to investigate, it is most likely that you might turn to Google for assistance. Unfortunately, there are occasions when the results from Google may not be satisfactory. In such cases, it is recommended to utilize multiple search engines for better outcomes. Yandex can prove to be a valuable option in this regard. Additionally, a dedicated website called TinEye is specifically designed for reverse image searches and can be another excellent choice to explore.

Automated Discovery

While OSINT can be done with several techniques used directly from the browser, it is also possible to do it automatically with the help of several tools which come pre-installed in any distribution specialized for penetration testing.

Brute forcing

In Kali or any other Linux distribution, there are invaluable tools available that greatly assist in automating the content discovery process, eliminating the need for manual effort. These automation tools empower pentesters to send a significant volume of requests to the server, ranging from hundreds to thousands or even millions. Just so you know, brute forcing the target is an alternative option you can choose at the cost of negatively affecting the target server.

In our very recent article, about [ffuf](https://www.binaryte.com/blog/fuzzing-the-right-way-maximizing-results- with-ffuf) we have talked a lot about how this kind of tool can help you discover some pages or areas that are hidden by utilizing the wordlist. If you don’t know what the wordlist is, it is basically a set of characters that are structured in a long list used for various use cases containing the list of password, username, SQL injection payload, etc. As an alternative to ffuf, some tools are available such as dirb or Gobuster that work in a similar fashion.

OSINT-based analysis

theHarvester is one of the tools that could help pentester to look for information on specified targets. It is not limited to only discovering any domain and subdomain, but also has the ability to extract additional data like email, usernames, IPs, and URLs by using multiple sources available on the internet. Alternatively, you can use a framework that looks and feels similar to [Metasploit](https://www.binaryte.com/blog/metasploit-101-a-basic-tutorial- for-penetration-testing) as it is designed specifically for reconnaissance purposes. Additionally, you could also use a pre-installed tool called dnsenum which gathers much information about a particular domain.

Framework stack identification

The insight about the framework behind a web application plays an important role in determining the most suitable approach to penetrate the system. To achieve this, tools such as WhatWeb can be utilized for accurate framework identification.

Firewall

It is very common for a website to be protected with a firewall. However, not every website is protected. Of course, the absence of a firewall will make the penetration attempt going much easier. It is also important to note that we want to gather the most thorough information about the target, and the insight of whether the target is behind the firewall is not an exception. For this, you will need a tool called wafw00f.

Conclusion

That being said, this article only focuses on targeting the website. It is important to acknowledge that in the real world, adversaries may employ a wide range of techniques and tools to target the server itself. These methods can vary significantly from the ones discussed here. Therefore, it is crucial for security practitioners to stay updated with evolving attack vectors and maintain a holistic approach to server security. By continuously adapting and fortifying both the website and server defenses, organizations can better protect themselves against potential threats.