Hasty Scripts: Capture Google Activity Log

In the last post, we discussed a number of valuable Google account artifacts that are not necessarily captured in a Google Takeout. One of the these, the Google Activity Log, is a unified timeline containing cards for various Google services associated with the account. Chief among these artifacts were Chrome internet history, Android application usage, and audio recordings of the user interacting with Google Assistant or Google Home products. While this information can be queried and filtered online with access to the account, what can we do to collect this information offline to review at a later date? Here’s one solution.

The Google Activity log is a dynamic page that loads additional records as the user scrolls down the page. Theoretically, you could scroll down the page indefinitely to load as many records as system resources allow and then capture that content with screenshots or copying the text to a file. This process could take an inordinate amount of time and would make analyzing this data in any meaningful way difficult.

Why not Python, then? Let’s take a look at an example of how we might go about developing a script to capture this or other dynamically-loading webpages like this. The script we will be discussing can be downloaded from the Go-Capture Github repository. Be aware that this script can break and will require updates with any changes to Google’s Activity timeline HTML objects or URL.

Setting up the Environment

This is a Python 3.X script and makes use of three third-party modules to drive webpage interaction, parse webpages, and create progress bars. The first of these, selenium, is a favorite and can be used with a driver to automate webpage navigation. BeautifulSoup4 is designed to parse HTML data and we will use to interpret the HTML page source after generating the Google Activity records. Lastly, tqdm, which we’ve discussed before, is used to create progress bars.

All three of these third-party modules can be installed individually with pip. The Github repository also contains a requirements.txt file which can be used to install the three modules at once with the following command:

pip install -r requirements.txt

Once the modules are installed, you will need to download the Google Chrome webdriver and place it in your PATH. You can add directories to your PATH variable in Windows by navigating to Control Panel\System and Security\System, clicking Advanced system settings, and clicking the “Envrionment Variables…” button. Alternatively, you can copy the Chrome webdriver to a directory already in your PATH variable. With the environment setup, we are ready to look at some code.

Scrolling a Webpage and Extracting Data

A useful way to get oriented with a script is to look at the arguments it accepts. This script takes up to five arguments: the username, password, the output file path for the resulting CSV, the number of times to scroll to the bottom of the page, and the number of seconds to wait between each scroll. Be aware that when supplying the password argument, you may need to wrap it in double quotes as some characters caused issues.

python go_capture.py -h
usage: go_capture.py [-h] -u USERNAME -p PASSWORD -c COUNT -o OUTPUT
                     [-s SECONDS]

Capture Google My Activity..

optional arguments:
  -h, --help            show this help message and exit
  -u USERNAME, --username USERNAME
                        Google account user: username@gmail.com
  -p PASSWORD, --password PASSWORD
                        Google account password
  -c COUNT, --count COUNT
                        Number of times to scroll to the end of the page
  -o OUTPUT, --output OUTPUT
                        Output CSV
  -s SECONDS, --seconds SECONDS
                        Number of seconds to wait before pressing END (default
                        1 second)

First thing first, we must add some code to login to a user’s account. The URL we will be navigating to with our Chrome driver, is a Googe login page that, upon successful authentication, redirects to the Google My Activity timeline. After creating the webdriver object, we use the get() method to navigate to the desired URL. On that page, we find the username field and password fields and, with the send_keys() method, enter the supplied username and password, respectively.

def main(user, pwd, cnt, output, seconds):
	login_url = "https://accounts.google.com/ServiceLogin?continue=https://myactivity.google.com/myactivity&hl=en"
	driver = webdriver.Chrome()
	print("[+] Navigating to Google Login followed by the 'My Activity' page")
	driver.get(login_url)

	driver.find_element_by_id("identifierId").send_keys(user)
	driver.find_element_by_id("identifierNext").click()
	time.sleep(1.5)
	driver.find_element_by_name("password").send_keys(pwd)
	driver.find_element_by_id("passwordNext").click()

	input("[+] Manually handle any two-factor requirements. Press Enter when ready to continue...")
	scroll_activity(driver, cnt, output, seconds)

At this point, the script pauses while the user manually resolves any two-factor requirements associated with the account. You can interact with the webdriver browser like you would with any other browser. Once you have successfully authenticated, or if the account did not have two-factor turned on, press Enter in the console. At this point, we are ready to start scrolling to generate records.

def scroll_activity(driver, cnt, output, seconds):
	driver.get("https://myactivity.google.com/item")
	driver.find_element_by_tag_name('body').click()

	print("[*] Pressing the END key {} times to scroll through the page".format(cnt))
	for i in tqdm(range(cnt)):
		driver.find_element_by_tag_name('body').send_keys(Keys.END)
		time.sleep(seconds)

The scroll_activity() function navigates to the item view of the Google Activity timeline and clicks the body element. After printing an update to the console, we create a tqdm progress bar and loop to press the END key to scroll to the bottom of the webpage. After each press, the script sleeps for a specified amount of time before repeating the process. Given that the page loads dynamically, it can take a few seconds for the data to appear hence the need to wait before sending the END key.

	print("[+] Reading page source, this may take awhile...")
	html_source = driver.page_source
	soup = BeautifulSoup(html_source, 'html.parser')
	parse_data(soup, output)

After the script has finished pressing the END key, it reads in the page source, parses it with BeautifulSoup, and sends that object to the parse_data() function. As a side note, take care when setting the count argument to a very high number. During testing, the webdriver was crashing when the page became too large. You may find that it cannot capture all of the user’s activity log in one go based on system limitations.

If that is the case, the script could be modified to accommodate incremental captures. For instance, you could remove the “driver.get(“https://myactivity.google.com/item”)” line, and instead, once authenticated, browse to the Item view (this is required) and use the Google Activity date filter to set the page to a specific time frame before pressing Enter in the console to continue script execution.

Next, we must parse the large HTML structure we have generated with BeautifulSoup. The parse_data() function starts by setting up some variables and, importantly, selecting the the hist-date-block element within the page source. This element contains each day’s worth of cards on the page and all of their records. We use a for loop to iterate through each day in the cards, and store the text of the date value as the key of the parsed_data dictionary.

def parse_data(html_soup, output):
	num_days = 0
	num_events = 0
	parsed_data = {}
	cards = html_soup.select("#main-content > div > div > hist-date-block")
	for day in cards:
		num_days += 1
		date = day.select("div")[0].text.split("\n\n\n\n\n")[1].strip()
		parsed_data[date] = []

We create another for loop to loop through each individual card (or record) in the given day. We do this using the BeautifulSoup select() function again to specify a path in the source to the card’s content. For each card, we parse a number of different values: the card’s raw text, the Google service referenced, any action, the card title, subtitle, and date. Different services have slightly different cards but can all be parsed more or less the same. The raw text, which we include in the output, acts as a failsafe in case we fail to parse a relevant value from the card.

Parsing each card may look a little complicated, but it boils down to the same process each time. Essentially, we use the BeautifulSoup find() or find_all() methods to locate specific elements with specific attributes. Once these are found, we extract their text, remove any unnecessary newline characters, and perform any other necessary string formatting. You may be confused with the join() statement, but this is a nifty trick to ensure there is only one space between each word in the string.

If you were to leverage the code for this script to parse a different, but similar webpage, this function is likely where you would spend most of your development time. During development, I used the built-in Python debugger, pdb, to locate the relevant HTML components and determining their attributes with the attrs method.

		for card in day.select("div > div > div > md-card > hist-display-item > md-card-content"):
			num_events += 1
			tmp_dict = {}

			temp = [x.replace("\n", "").strip() for x in card.text.split('\n\n\n\n')]
			raw_string = " - ".join([" ".join(x.split()) for x in temp if x.strip() != "" and x.strip() != "Delete" and x.strip() != "Details"])
			service = card.find('div', attrs={'class': ['fp-display-item-title']}).text.replace("\n", "").strip()
			action = " ".join(card.find("h4", attrs={"class": ["fp-display-block-title t08"]}).text.replace("\n", "").strip().split())
			title = " ".join(card.find("div", attrs={'class': ['layout-column']}).find("span").text.replace("\n", "").strip().split())
			sub_title_elements = card.find_all("div", attrs={"ng-repeat": "subTitle in ::item.getSubTitleList()"})
			sub_title = " ".join([x.text.replace("\n", "").strip() for x in sub_title_elements])
			dt_time = card.find("div", attrs={"ng-if": "::!detailsItem", "class": ["fp-display-block-details"]}).text.replace("\n", "").split(u"\u2022")[0].strip()

With the relevant values extracted, we append them to a temporary dictionary which is ultimately added to our date associated list we created at the beginning of the function. Once all the cards have been processed, we print a few status messages to the console and write the output to a CSV.

			tmp_dict["Raw String"] = raw_string
			tmp_dict["Date"] = date

			if service != "":
				tmp_dict["Service"] = service
			if action.strip() != "":
				tmp_dict["Action"] = action
			if title.strip() != "":
				tmp_dict["Card Title"] = title
			if sub_title.strip() != "":
				tmp_dict["Card Subtitle"] = sub_title
			if dt_time.strip() != "":
				tmp_dict["Time"] = dt_time

			parsed_data[date].append(tmp_dict)

	print("[+] Processed {} days' worth of events from Google's MyActivity".format(num_days))
	print("[+] Processed {} events from Google's MyActivity".format(num_events))

	write_csv(parsed_data, output)

The write_csv() function is similar to those we’ve discussed in the past. It uses the csv.DictWriter() class to write the dictionaries containing card data to a CSV file. Once this process completes, the script exit successfully and the user now has an offline copy of these records stored in a spreadsheet. I think we can all agree this is far more suitable for analysis than a series of screenshots or a text file.

Check out the code and start incorporating this artifact into your analysis! You’d be remiss if you fail to do so as it can contain some really useful information. You may also use a similar tactic to capture other dynamically loading webpages, such as the Google Drive activity log discussed in the previous post. Please let me know if you have any feature requests or thoughts in the comments below.

Hasty Scripts: Capture Google Activity Log

One thought on “Hasty Scripts: Capture Google Activity Log

  1. […] Preston Miller at DPM Forensics shares a script for extracting data from the Google Activity page. I’ve been doing some research into Google Home cloud data and noticed that there was a bit of data that wasn’t extracted by the cloud tools (the audio for example), so I’ll be taking a further look into this in the future. I don’t think anything extracts out the audio just yet, so that’s something I’ll be looking into. Hasty Scripts: Capture Google Activity Log  […]

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s