Before we begin, I'm not responsible for what people do with the information ahead. I am also not the first to post about this and I certainly won't be the last. This article is simply a compilation of different techniques to adapt to various situations that could be encountered while penetration testing as well as highlighting how threat actors may go about using search engine as a weapon.

Dorking is a method of using search filters to narrow a search within a specified range in order to gain access to sensitive information. The aforementioned sensitive information can consist of customer data, internal documents, server credentials, etc. Having access to this kind of information lays out a very large attack surface against organizations. Defending against dorking can be extremely difficult especially since the cost of time and money is at stake where those might not be available.

Github dorking

I, personally, find Github to be the most interesting platform to dork on. For the uninitiated, Github provides internet hosting and version control for software, which is awesome because people can collaborate and share the amazing creations they make. However, sometimes they share a bit too much. As developers try to implement secure practices in their code, they leave a lot of sensitive information in what they push to their repositories such as credentials to web servers, access keys, and even passwords to accounts. It's worth noting that secure programming and repo management isn't taught very well in school. It's definitely not an excuse, but it is easy to see why it happens so often. It's especially easy to see when companies allow interns and new employees to commit cardinal sins like committing broken code directly to the master branch.

140k is a bit high as there can definitely be false positives

In the image above, I searched with a filter specifically for the filename wp-config.php. I've left out the other half of the screenshot since it shows information about active websites. Wp-config contains a Wordpress website's base configuration details, such as database connection information. This is extremely important since most content for one's Wordpress site is stored in a database. Therefor, if we gain access to it we can modify the content of a website. This is bad, if you haven't caught on yet... It's how attackers deface many websites. As a note, these dorks aren't restricted to Wordpress sites. They span almost all software that happens to either be opensource or has parts of source code publicly available.

This is one of the repos I found. The situation was clear. An elderly gentlemen had just completed his new novel that seemed to be the culmination of his life's work. He intended to have a website created for advertising and explaining his book to which he outsourced the website's creation for, what I can only imagine to be, dirt cheap. Figures, you get what you pay for. How did I know this? Well the vendor had their business name as their Github username. The business was from a third world country where I'm sure education on web development isn't exactly accessible. There was also another customer's Wordpress site under "Dum Dum's" username as well, which is concerning.

The Meats

The image above is the wp-config file with the database name, user, and password blurred. These were all directly to our new author's website as the name of the website was found elsewhere in the repository. Everything else in the above file is standard across every Wordpress site. If you're thinking of calling the cops about this, you need to stop being a karen and go look at Wordpress. Now some attackers will look for similar situations to generate targets. It's almost like dumpster diving through an office's paperwork. What are the implications at this scale? Minimal in comparison to when a developer leaves access credentials to aws instances or S3 buckets that run core corporate infrastructure (save your disbelief, I've found these before). Leaks like this can easily be show stopper and it requires little to know skill making the barrier to entry very low. For a quick list of Github dorks, ironically, check out the readme on this Github repo . I don't care very much for the small python script that's also in the repository, so realize I'm solely referencing their list of manual dorks.

Google is the Notorious B.I.G. of sites to dork. The granularity the search engine's filters offer are unmatched for gaining access to important documents or looking for extremely specific things. Since it is the world's largest search engine, there's hardly a website that isn't searchable. You can check here for a nice list of the main search filters used.

Above you can see that I've done a search for the keyword "budget" with the file extension of .xls (excel spreadsheet). Of course, it returns quite a bit because every organization has a budget they abide by. Whether they post it publicly or not is up to them, but this usually depends on the organization. A small municipality will probably post their budget publicly since they have a duty to their constituents to be transparent, but an insurance company might keep it under wraps due to fraudsters attempting to solicit more money. Obviously this can be problematic as it now gives attackers and pentesters a way of gaining insight into the organization's movement of money and operations. Other convenient filters include: equipment filetype:xls, taxes filetype:xls, ehr filetype:xlsx, and hie filetype:xlsx. Filters can vary on the industry and the target an attacker is going against. Being well versed in a target's lingo will allow an attacker to search more pertinent documents. For example, EHR and HIE are specific to healthcare. I'm not well versed in the medical industry, but my man Brian Hochstuhl is (@BrianHochstuhl) is an expert in hospital management. He gave me a rundown on how hospitals run, information they find to be important/sensitive, and how they generate enough money to keep the doors open. Brian was key in determining if the healthcare information found through my dorks were actually sensitive or not.

EHR stands for electronic health record. This stores patient data where, as you can imagine, HIPAA comes into the picture. An HIE is a Health Information Exchange which is a collection of EHRs. If documents pertaining to either of these are found, sensitive information is most likely going to be found. Hence why penetration testers should definitely check what hospitals store on their public facing websites. Due to the lack of cybersecurity in most hospitals, it is extremely likely we'll see hospitals become the target of cyber attacks in the near future. Cyber terrorism is on the horizon with critical infrastructure in its sights, healthcare being a major component of critical infrastructure. The world has already seen the effects of ransomware against healthcare when Wannacry hit. Wannacry forced hospitals, among hundreds of thousands of other businesses, around the world to either divert care, or not accept patients as their machines were locked down by the ransomware. 80 hospitals in the UK alone were entirely shutdown. Studies are still being done to determine how this specific attack affected the mortality rates of patients receiving care during the initial four day attack.  

When testing against a specific organization one might want to use a search like so: site:"https://www.organization.com/" AND filetype:"xls". This will give any excel documents that are accessible on www.organization.com, creating a much more targeted reconnaissance vector. Then combine the previous with: AND (before:2017-01-01) and you've got yourself all xls documents before 1 JAN 2017. A lot of companies tend to forget about the location of old documents and they become a treasure trove for attackers. Google dorking requires a good bit of creativity to be used to determine key terms of common documents created by an organization. It's almost like searching for needles in a hackstack, but the needles could actually be worth a lot of money to someone out there.

Shodan.io has been coined as the "hacker's google." Serving as an internet intelligence platform, it collects information about all devices found connected to the internet. Shodan continues to serve as a great way to do passive reconnaissance (recon that doesn't touch an attacker's target).  

The above shows filtering by city, device, and port. It is fairly specific, but it touches on searching by geographic location, device type, and specific ports open on a device. All can be used for target generation by attackers. Let's say an attacker has just created an exploit for HP printers that targets the internet printing protocol (port 631). They can use the Shodan query to then log the results of IP addresses, automate the process to attack every device found in the list of IP addresses, and fire it off without having to do any more work. Once the attack is launched, an attacker just has to wait for the successful resolution of their exploit and collect the loot. Shodan takes the heavy lift of port scanning off of pentesters' and attackers' shoulders as scanning can be resource and time consuming. I don't think there's a need to go incredibly deep on Shodan since Shodan safaris could take up an entire article of their own. If you're still interested, just search #shodansafari on twitter and check out all the cool finds people post about. For more filters on Shodan, check out this repo.

In doing prep for this article, I stumbled upon a LOT of sensitive information in a short amount of time (majority of the experience was left out for clear legal reasons). If I can find a password to a webserver within 10 minutes, imagine what a malicious actor is currently doing. Remediating this would be near impossible for search engines at this point, unless they restricted the functionality of their product. I'm sure they're not interested in doing this, so the best answer here is for companies and organizations to regularly inspect and prune their data. If they already need to complete compliance inspections (HIPAA, Privacy Act, GDPR), why aren't they taking the extra step to review the contents of their public facing services and workspaces? Some small organizations just may not know to do these inspections. Some large organizations may not have the bandwidth to go over everything stored on every single server they own. Hence why data loss and cyber insurance are now a thing. It's not the right answer, but it is an answer. Be aware of information that touches the internet because though these three search engines primarily serve the current state of a website, a website may be archived and permanently leave artifacts out there. Check out the WayBack machine to see archived versions of websites at different points in time.

As always, stay frosty.