A Return to Phishing

With Python Digital Forensics Cookbook officially out the door for some time now, I have rediscovered time to work on projects, including this one. With the book wrapping up successfully, I intend to rededicate myself back to bi-weekly posts and instructionals on various forensic topics. In this post, we will go over some new additions to the Go Phishing script introduced here a few months ago. The code can be found on the PST-Go-Phish Github repository. Join me on the development journey as we create a phishing discovery tool from scratch. Let’s get started!

Go Phish

I spent some time reorganizing some things an cleaning up the code – dev work that I won’t bore you with. Rather, let’s focus on some of the new additions namely, progress bars and link thresholds! As a reminder, the script processes PST or OST files and, based on various attributes for a given email, flags those emails it deems worthy of further review. The new command-line arguments can be seen in the code block below:

ubuntu@ubuntu:~/Desktop$ python pst_go_phish.py -h
usage: pst_go_phish.py [-h] [-i IGNORE] [-t THRESHOLD] [-l LINKS]
                       PST_FILE OUTPUT_DIR

PST Go Phishing..

positional arguments:
  PST_FILE              File path to input PST file
  OUTPUT_DIR            Output Dir for CSV

optional arguments:
  -h, --help            show this help message and exit
  -i IGNORE, --ignore IGNORE
                        Comma-delimited acceptable emails to ignore e.g.
                        (bounce lists, etc.)
  -t THRESHOLD, --threshold THRESHOLD
                        Flag emails where sender has only sent N email to the
                        mailbox (default 1)
  -l LINKS, --links LINKS
                        Flag emails where the link has only sent/received N
                        times (default 1)

Progress Bars – TQDM

What’s the big deal with progress bars? Is that really a question that needs answering? Every program that takes some time to process should have them, including this one. If you’ve read either of the books I’ve written with Chapin Bryce, you’ll know our favorite progress bar library is TQDM. It is incredibly simple and easy to customize. In this case, we initialize individual progress bars for each folder processed within the mailbox. This is a third-party library and can be installed the usual way for either Python 2.X or 3.X:

pip install tqdm

To create a progress bar with TQDM, we can wrap the tqdm() method around an iterable in a for loop. This is similar to using the built-in enumerate() method to create an auto-incrementing counter in a for loop. However, we can also supply various keyword arguments to customize our progress bar within the tqdm() method call. Let’s discuss the example below:

def processMessages(folder, ignore):
	global messages
	print("[+] Processing {} Folder with {} messages".format(folder.name, folder.number_of_sub_messages))
	if folder.number_of_sub_messages == 0:
	for message in tqdm.tqdm(folder.sub_messages, desc="Processing", unit="emails"):
		eml_from, replyto, returnpath = ("", "", "")
		messages += 1

In this case, as long as the provided folder has messages, we iterate through each message and create a progress bar around that loop. This way, for each message that gets processed, the progress bar increments on its own and requires no further input from us. Beyond supplying the tqdm() method the iterable object, folder.sub_messages, we also set description and unit keywords. That’s it – this is as hands-off of a progress bar library as you’ll find.


As shown in the screenshot above, a progress bar is created for each folder and lists a few metrics including the current and total amount of emails being processed, the time elapsed, and the number of emails processed per second. This is a drop the mic moment, I don’t think further discussion on why this is a useful library is necessary. Let’s then move onto the next topic of this post.

Link Thresholds

I think we all understand that suspicious embedded links within emails are the hallmark of any phishing campaign. Thus far, the script has not performed any kind of analysis on links within an email. That may have been a bit of an oversight, which we will start to correct in this post.

In this post, we will look at how to extract links from an email using regular expression and a simple way to identify links worthy of a second look. Before we get into this, we need to install another third-party module, tldextract:

pip install tldextract

This library allows us to take a link we extract from an email and parse out just the domain component. Similar to what we did to identify unique senders, we will identify unique domains in a given mailbox. The idea here is that it is unlikely that the bad actor would send many phishing emails to the same mailbox from the same domain. While not always the case, it is a useful thing to check for and the user can specify exactly how “unique” a given domain should be before it is flagged with the -l switch. By default, if a domain only appears once within the entire mailbox it is flagged. Enough talk, let’s look at the code:

def linkExtractor(body, body_type):
	links = []
	if body_type == "html":
		urls = re.findall(r'href=[\'"]?([^\'" >]+)', body)
	elif body_type == "text":
		urls = re.findall(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', body)
		urls = re.findall(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', body)
	return set(tldextract.extract(x).registered_domain for x in urls)

When a given email is being processed, it now extracts the body content from the message and sends it to the linkExtractor to identify links. Libpff represents email body content as one of the following formats: HTML, text, or RTF. Depending on the type, we use a different regular expression to extract any available links. Essentially, in HTML type emails, the links are embedded within href tags, whereas the other formats store links as raw strings without any HTML markup.

The findall() method from re, the built-in regular expression library, returns a list of matches. Once the links are identified, we use set comprehension to create a unique list of the domains present in the email. This means that if a particular domain is present more than once in a given email – it will only be counted once due to the set data type’s requirement for unique elements.

From here on out, processing is handled similarly to how we dealt with identifying unique senders from the previous post. The returned set of unique domains are added to a link dictionary which keeps track of the frequency of the domains. After all emails are processed, the linkThreshold() function is called to identify any domains meeting or below the threshold.

def linkThreshold(threshold):
	global message_list, links_dict
	link_count = 0
	for link in links_dict:
		if not links_dict[link][0] > threshold:
			link_count += 1
			tmp_list = links_dict[link][1:]
			tmp_list.append("Link Threshold")
	print("[+] Identified {} domains less than or equal to the threshold".format(link_count))

The zero index of a given list for a domain in the links_dict is the frequency of that domain within the processed emails. If it does not exceed the threshold, which is set to 1 by default, then the domain is flagged in the output spreadsheet. The end of the function prints the number of domains which were less than or equal to the threshold.

That wraps it up for this post. As always, if you have questions or suggested features, please let me know below.

A Return to Phishing

One thought on “A Return to Phishing

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s