Undoubtedly, Google is the most used search engine and extremely powerful to search on the internet. Even with the rise of AI, most people still use it on a daily basis. As you can search for a keyword simply by typing it right away on the search box, do you know with some technique, you can search for anything more accurately based on your intent? Do you know that the hacker or an attacker may escalate the Google capability to its full potency in order to look for something that is not supposedly accessible by the public? Do you know that the search engine can help them expose some sensitive data to security misconfiguration made by the developer?
Without a doubt, Google is the most widely used search engine on the internet, renowned for its unmatched search capabilities. Despite the rise of AI technology, a majority of people still rely on it on a daily basis. But did you know that with a few techniques, you can use Google to search for anything more accurately based on your intent?
The ugly truth is that it’s not just ordinary users who are leveraging the full potential of this search engine with the techniques we are going to learn. Hackers and attackers are increasingly using it to uncover data that isn’t meant for public access, exploiting security misconfigurations made by developers. Are you aware that some sensitive data can be inadvertently exposed by using the search engine in this way?
What is Google dork?
Google Dorking, also referred to as “Google hacking,” is a method of utilizing advanced search operators in the Google search engine to uncover sensitive data on websites, which may have been made vulnerable due to security flaws, misconfigurations, or other vulnerabilities that were caused by developers or administrators.
Why does Google Dork technique exist?
Around 2000, Johnny Long initiated the practice of collecting Google search queries that revealed vulnerable systems or exposed sensitive information. The lists are then stored in a database that we now know as Google Hacking Database (GHDB).
Originally the term “Googledork” refers to the inept or foolish person whose information is revealed by Google due to unintentional misconfiguration on a program installed by the user. Eventually the meaning shifts to search queries that are used to expose sensitive information due to website vulnerability.
You may be curious as to why Google has not taken action against the use of Google Dorking. Although technically legal, the implications of this technique remain questionable, as it is often misused and can lead to violations of the Computer Fraud and Abuse Act. Therefore, it is difficult to make a definitive statement on the matter.
Google Dorking can be seen as two sides of coins. While this is true that the cybercriminals and the malicious actors can abuse it for malicious actions, it can also be used by journalists, security researchers, or simply curious users for good deeds.
For example, security researchers may use it to protect their system from leaking sensitive data and exposing it to the public. Journalists may use it to enrich their knowledge about a particular topic for reaching their goals. On the other hand, curious users can search for the or the fact which is simply difficult to get with general web browsing.
How does Google Dork work?
Technically, Google Dork is the action users conducting web search by using Google advanced features. Although this feature is available whenever you use the Google search engine, it remains relatively underutilized by most users. The reason for this could be the need to set specific parameters before conducting a search, which is quite cumbersome to be honest.
In fact, using Google’s interface isn’t the only way to conduct searches. One of the most common methods involves typing certain operators and terms directly into the search query, enabling you to achieve fine-grained control over the results you receive.
In the following sections, we’ll delve deeper into this technique.
Utilizing Google Dorking Technique
Here, we collect all the important lists that could be useful for you as
security researchers and ethical hackers with the actual purpose of each
queries. Ensure that you don’t use space for the first word on the query to
get the correct result (inurl:admin
is valid, inurl: admin
is not valid).
Terms
Terms | Purpose | Example |
---|---|---|
site: | Provides results on a specific domain | site:example.com |
inurl: | Filter results to a specific URL of a website (if two words are use, | |
shows either or both words) | inurl:login | |
allinurl: | Filter results of URL containing both words (only results that | |
contain both words) | allinurl: auth login | |
intitle: | Filter results to titles web pages only | intitle:”index of” etc |
intext: | Filter specific words contained in the context | intext:admin |
filetype: | Filter results based on the file type | filetype:pdf |
related: | Search for related sites. | related:example.com |
cache: | Search for cached version of specific website. | cache:example.com |
Operators
Operator | Purpose | Example |
---|---|---|
“ ” | Search for exact match on the results | “ethical hacking” |
* | Shows “anything” you can find before or after a specified query. |
site:*.com inurl:wp-admin
+
| Shows pages containing the specific word| ethical hacking +dork
-
| Avoid results with specific word| ethical hacking -dork
OR
| Combining two or more search query| bugbounty OR vdp
@
| Search on specific social media| hacking @twitter
If you need more, you can check the Google Hacking Database that you can find here which contains a whole lot more query gathered by tons of contributors worldwide.
How to protect your information from Google Dork?
Protecting your information is at its core is preventing the corresponding
information from being indexed by Google. Therefore, the most critical action
to do is take care of the robots.txt file sitemap, and deter Google web
crawler with noindex
.
Robots.txt
If you have your own website, you probably already know what it is and how it
works. Simply put, robots.txt
is used to tell Google whether a page is
allowed to be indexed or not. It must be served in the website root directory
to do the job correctly.
In fact, it serves a broader purpose beyond enabling or restricting page
indexing; it also extends to regulating access to files and directories. Take
an example of this website robots.txt
file.
# *
User-agent: *
Allow: /blog/*
# Sitemaps
Sitemap: https://www.binaryte.com/sitemap.xml
Sitemap: https://www.binaryte.com/server-sitemap-index.xml
With this robots.txt
file, we allow any bot to have access to this website
by using wildcard (*
). We also specify what pages are allowed to be
indexed. In this case, we limit indexing under the “blog” section only.
Say that we don’t want Googlebot to index secret pages located in /secret- for-google/
, but let the other bot index it. To do this, you can use
“Disallow” and the above example should look like this.
# *
User-agent: *
Allow: /blog/*
User-agent: googlebot
Disallow: /secret-for-google/*
# Sitemaps
Sitemap: https://www.binaryte.com/sitemap.xml
Sitemap: https://www.binaryte.com/server-sitemap-index.xml
Noindex
robots.txt
file does tell the search engine whether it can access your
website or not. However, it is not really a mechanism to hide your web pages
from Google search results. Instead, it’s main purpose is to avoid requests
overload caused by the crawlers. Therefore, it is still important to block
indexing directly with noindex
attributes.
noindex
is a rule specifically used to prevent search engines from indexing
the content. There are two ways to implement noindex
, either you use it as
an <meta>
tag or use it as an HTTP response header.
- < meta> tag
This one is quite easy to do. You can just simply place the following tag into the section in your page.
<meta name="robots" content="noindex">
You can also specify a certain bot only (e.g. Google web crawler) from indexing the page.
<meta name="googlebot" content="noindex">
Since this tag is only applied to a specific page, it can only prevent that particular page where it is present.
- HTTP response header
You can also choose to return an X-Robots-Tag
HTTP header with the value of
none
or noindex
. This header can be attached for non HTML resources, such
as PDF, video or image file. The HTTP response header should look like this.
HTTP/1.1 200 OK
...
X-Robots-Tag: noindex
...
Sitemap
A sitemap serves as a useful guide for crawlers to comprehend the structure of
your website, facilitating their traversal of its content. By utilizing a
sitemap, scraping vast amounts of content from a website becomes significantly
more efficient, as the crawler no longer has to manually locate and scrape it.
For this reason, it is important to remove unwanted page from the sitemap
which is based on XML. You can learn it from opening the link to the sitemap
attached in robots.txt
example above.
**Summary **
Google Dorking is a technique that involves the use of advanced search operators in Google search to uncover information that is not readily accessible through conventional search queries. While it involves certain terms and operators, Google Dorking can be employed for both legal and illegal activities. Although the practice has generated controversy, it is widely recognized as a valuable tool for security professionals seeking to detect vulnerabilities and safeguard against potential cyber attacks in the future.