Unlocking the Secrets of the Internet with Google Dorking

Undoubtedly, Google is the most used search engine and extremely powerful to search on the internet. Even with the rise of AI, most people still use it on a daily basis. As you can search for a keyword simply by typing it right away on the search box, do you know with some technique, you can search for anything more accurately based on your intent? Do you know that the hacker or an attacker may escalate the Google capability to its full potency in order to look for something that is not supposedly accessible by the public? Do you know that the search engine can help them expose some sensitive data to security misconfiguration made by the developer?

Without a doubt, Google is the most widely used search engine on the internet, renowned for its unmatched search capabilities. Despite the rise of AI technology, a majority of people still rely on it on a daily basis. But did you know that with a few techniques, you can use Google to search for anything more accurately based on your intent?

The ugly truth is that it’s not just ordinary users who are leveraging the full potential of this search engine with the techniques we are going to learn. Hackers and attackers are increasingly using it to uncover data that isn’t meant for public access, exploiting security misconfigurations made by developers. Are you aware that some sensitive data can be inadvertently exposed by using the search engine in this way?

What is Google dork?

Google Dorking, also referred to as “Google hacking,” is a method of utilizing advanced search operators in the Google search engine to uncover sensitive data on websites, which may have been made vulnerable due to security flaws, misconfigurations, or other vulnerabilities that were caused by developers or administrators.

Why does Google Dork technique exist?

Around 2000, Johnny Long initiated the practice of collecting Google search queries that revealed vulnerable systems or exposed sensitive information. The lists are then stored in a database that we now know as Google Hacking Database (GHDB).

Originally the term “Googledork” refers to the inept or foolish person whose information is revealed by Google due to unintentional misconfiguration on a program installed by the user. Eventually the meaning shifts to search queries that are used to expose sensitive information due to website vulnerability.

You may be curious as to why Google has not taken action against the use of Google Dorking. Although technically legal, the implications of this technique remain questionable, as it is often misused and can lead to violations of the Computer Fraud and Abuse Act. Therefore, it is difficult to make a definitive statement on the matter.

Google Dorking can be seen as two sides of coins. While this is true that the cybercriminals and the malicious actors can abuse it for malicious actions, it can also be used by journalists, security researchers, or simply curious users for good deeds.

For example, security researchers may use it to protect their system from leaking sensitive data and exposing it to the public. Journalists may use it to enrich their knowledge about a particular topic for reaching their goals. On the other hand, curious users can search for the or the fact which is simply difficult to get with general web browsing.

How does Google Dork work?

Technically, Google Dork is the action users conducting web search by using Google advanced features. Although this feature is available whenever you use the Google search engine, it remains relatively underutilized by most users. The reason for this could be the need to set specific parameters before conducting a search, which is quite cumbersome to be honest.

In fact, using Google’s interface isn’t the only way to conduct searches. One of the most common methods involves typing certain operators and terms directly into the search query, enabling you to achieve fine-grained control over the results you receive.

In the following sections, we’ll delve deeper into this technique.

Utilizing Google Dorking Technique

Here, we collect all the important lists that could be useful for you as security researchers and ethical hackers with the actual purpose of each queries. Ensure that you don’t use space for the first word on the query to get the correct result (inurl:admin is valid, inurl: admin is not valid).

Terms

Terms	Purpose	Example
`site:`	Provides results on a specific domain	`site:example.com`
`inurl:`	Filter results to a specific URL of a website (if two words are use,
shows either or both words)	`inurl:login`
`allinurl:`	Filter results of URL containing both words (only results that
contain both words)	`allinurl: auth login`
`intitle:`	Filter results to titles web pages only	`intitle:”index of” etc`
`intext:`	Filter specific words contained in the context	`intext:admin`
`filetype:`	Filter results based on the file type	`filetype:pdf`
`related:`	Search for related sites.	`related:example.com`
`cache:`	Search for cached version of specific website.	`cache:example.com`

Operators

Operator	Purpose	Example
`“ ”`	Search for exact match on the results	`“ethical hacking”`
`*`	Shows “anything” you can find before or after a specified query.

If you need more, you can check the Google Hacking Database that you can find here which contains a whole lot more query gathered by tons of contributors worldwide.

How to protect your information from Google Dork?

Protecting your information is at its core is preventing the corresponding information from being indexed by Google. Therefore, the most critical action to do is take care of the robots.txt file sitemap, and deter Google web crawler with noindex.

Robots.txt

If you have your own website, you probably already know what it is and how it works. Simply put, robots.txt is used to tell Google whether a page is allowed to be indexed or not. It must be served in the website root directory to do the job correctly.

In fact, it serves a broader purpose beyond enabling or restricting page indexing; it also extends to regulating access to files and directories. Take an example of this website robots.txt file.

# *
User-agent: *
Allow: /blog/*

# Sitemaps
Sitemap: https://www.binaryte.com/sitemap.xml
Sitemap: https://www.binaryte.com/server-sitemap-index.xml

With this robots.txt file, we allow any bot to have access to this website by using wildcard (*). We also specify what pages are allowed to be indexed. In this case, we limit indexing under the “blog” section only.

Say that we don’t want Googlebot to index secret pages located in /secret- for-google/, but let the other bot index it. To do this, you can use “Disallow” and the above example should look like this.

# *
User-agent: *
Allow: /blog/*

User-agent: googlebot
Disallow: /secret-for-google/*

# Sitemaps
Sitemap: https://www.binaryte.com/sitemap.xml
Sitemap: https://www.binaryte.com/server-sitemap-index.xml

Noindex

robots.txt file does tell the search engine whether it can access your website or not. However, it is not really a mechanism to hide your web pages from Google search results. Instead, it’s main purpose is to avoid requests overload caused by the crawlers. Therefore, it is still important to block indexing directly with noindex attributes.

noindex is a rule specifically used to prevent search engines from indexing the content. There are two ways to implement noindex, either you use it as an <meta> tag or use it as an HTTP response header.

< meta> tag

This one is quite easy to do. You can just simply place the following tag into the section in your page.

<meta name="robots" content="noindex">

You can also specify a certain bot only (e.g. Google web crawler) from indexing the page.

<meta name="googlebot" content="noindex">

Since this tag is only applied to a specific page, it can only prevent that particular page where it is present.

HTTP response header

You can also choose to return an X-Robots-Tag HTTP header with the value of none or noindex. This header can be attached for non HTML resources, such as PDF, video or image file. The HTTP response header should look like this.

HTTP/1.1 200 OK
...
X-Robots-Tag: noindex
...

Sitemap

A sitemap serves as a useful guide for crawlers to comprehend the structure of your website, facilitating their traversal of its content. By utilizing a sitemap, scraping vast amounts of content from a website becomes significantly more efficient, as the crawler no longer has to manually locate and scrape it. For this reason, it is important to remove unwanted page from the sitemap which is based on XML. You can learn it from opening the link to the sitemap attached in robots.txt example above.

Summary

Google Dorking is a technique that involves the use of advanced search operators in Google search to uncover information that is not readily accessible through conventional search queries. While it involves certain terms and operators, Google Dorking can be employed for both legal and illegal activities. Although the practice has generated controversy, it is widely recognized as a valuable tool for security professionals seeking to detect vulnerabilities and safeguard against potential cyber attacks in the future.