Hasty Scripts: Box Edit Log Parser

How often are you tasked with reviewing large data sets in less than ideal (or even terrible) formats? Everyone has likely had to review logs containing many tens of thousands of lines at one point or another. We may not have time or budget (or patience) to review every line in a text editor. What do you do?

It would be preferable to process each line, breaking it up into discrete elements, and storing the data in a spreadsheet. We can all flex our Excel muscles at that point and review a seemingly daunting data set in short order. Great, but we need to convert it to that format first. Let’s develop an example that processes logs, extracts relevant data elements, and creates a spreadsheet for review. Even with the time spent to develop the script, it will massively decrease the overall amount of time necessary to understand and analyze the data.

This post is an extension of the previous posts on Box. We will create a python parser for the Box Edit logs previously discussed. This is an example of how to create a parser for log files in thirty minutes or less. It may not work on different versions of Box Edit, but that’s not the point, instead it serves as an example of how to go about converting this particular data set. The unabridged code for this script can be found on the Box Edit Github repository.

Box Edit Log Example

For review, the Box Edit log(s) can be found in the following directory: C:\Users[User]\AppData\Local\Box\Box Edit\Logs. This log keeps track of, among other things, when the application was launched, what files were edited, and if any changes were made to them and uploaded successfully. The photo below is an example of the log showing three separate files being opened with Box Edit. Each of the numbered events are lines we want to capture in our script. They tell us when a file was opened and, if edits were made, and when the file was saved and uploaded to Box.

box_edit_log_example

Notice how each line in this log has a similar structure: four groupings of square brackets, followed by a dash, and a string. If we wanted to, we could create a script to put each of these data elements (the four groups of square brackets and the string) in their own columns so we can sort through the data. In some cases that might be the preferred means of analyzing the data. However, for this particular case, I think it would be more beneficial to just extract some of the relevant lines in the log to create a summary of the data set.

Let’s say we want to review all Box Edit logs and determine what files were edited and if and when they were uploaded. In this case, we want to extract all lines that contain “Launching Editor for file” and “File uploaded successfully”. With just these lines, we will be able to summarize the Box Edit activity for our scenario. In essence, we are performing both filtering and processing steps at once with Python. Let’s take a look at how we will do that in the section below.

Bedit_parser.py – process_files Function

Let’s skip the boring parts and get right to processing the log file(s). This occurs in the process_files() function, which takes the list of logs we need to process and the input directory as arguments. The data dictionary, on line 3, will contain our processed data elements from the lines of interest.

We start out by creating a loop on line 4 below to iterate through each log found in the input directory. For each log, we are first going to open it (on line 5) and give it the alias log_file. When we refer to the variable log_file from now on we will be referring to the currently open log file object. On line 6, we create another loop to iterate through each line in the log file. Now, we need to insert the necessary logic to capture our two lines of interest.

def process_files(logs, in_dir):
    [snip]
    data = {}
    for log in logs:
        with open(os.path.join(in_dir, log)) as log_file:
            for line in log_file:
                if 'Launching Editor for file' in line:
                    file_id = line.split('file:')[1].split(' ')[1]
                    file_ts = datetime.strptime(line.split(']')[0].replace('[', '').split(',')[0], '%Y-%m-%d %H:%M:%S')
                    if file_id not in data.keys():
                        data[file_id] = []
                        data[file_id].append([file_ts.strftime('%m/%d/%Y %H:%M:%S'), os.path.join(in_dir, log)])
                if 'File uploaded successfully' in line:
                    file_id = line.split('-')[3].split(' ')[1].replace('[', '').replace(']', '')
                    if file_id not in data.keys():
                        print '[*] Check {}, file ID {} has an upload event but no edit event'.format(log, file_id)
                        continue
                    file_ts = datetime.strptime(line.split(']')[0].replace('[', '').split(',')[0], '%Y-%m-%d %H:%M:%S')
                    count = len(data[file_id]) - 1
                    data[file_id][count].insert(1, file_ts.strftime('%m/%d/%Y %H:%M:%S'))
    return data

We can do this by simply checking if a substring from our lines of interest is present as we do on lines 7 and 13. If there is a hit, then we need to extract the Box file ID and timestamp for each entry. We do so by using standard string object methods split() and replace(). The split() method will split any string into a list of substrings based on a given delimiter. On line 8, we are splitting the line in two by using the string “file:” which should only occur once. The “1” in square brackets selects the second of two sub-strings in the resulting list (recall in Python that lists begin with index 0). Immediately after that, we split again on the next space and, again, take the substring in the first index of the resulting list. The example below illustrates what each method is doing.

python_split_example

On line 9, we employ similar tactics to isolate the date value, rather than the file ID, on the same line. We split on a “]” bracket and take the substring in index 0 of the resulting list. We use the replace() method to remove the “[” bracket and use split to remove the millisecond value after the comma. The figure below demonstrates the purpose of the various functions. Notice all of this is wrapped in a call to the strptime() method from the datetime library. This allows us to store the date value as a datetime object rather than a string, allowing us to perform various math and formatting options on it. You can read more about this library, strptime, and the datetime directives on the Python wiki.

python_replace_example

On line 10, we check to see if the file ID we have just extracted is already in our data dictionary. If it isn’t we initiate the key and add an empty list as its only value. Finally, on line 12, we append a list containing the file edit time and the log pathname to the file ID key’s list. As this log may contain more than one entry for the same file ID, this method will allow us to continue appending new lists associated to that same file ID without affecting the previous entries.

Moving on to the next line we wish to extract. The “file upload” line is handled similarly to the previous one with some minor differences. One difference here is a check to see if the file ID is not in the data dictionary, and if it is not, to skip processing the current line by invoking the continue statement. As the “file upload” line should be proceeded by the “file edit” line this should never occur. Additionally, we are not necessarily keeping track of which instance of “file edit” this “file upload” corresponds to. Therefore, on line 19, we use the len() method to count how many lists have been appended to the specific file ID’s list as we assume this “file upload” line must be associated with the last appended list.

On line 20, we insert the update datetime object in the first index of the last appended list. This inserts the update datetime value after the edit datetime value but before the log pathname. We do this to avoid headaches when we go to write the data. Speaking of which, let’s take a look at that function now.

Bedit_parser.py – csv_writer Function

This function takes the dictionary of processed file IDs we created in the previous function and the output directory as its only inputs. This function is much simpler than the previous one. We will make use of the built-in csv library, specifically the csv writer() method. On line 3, we create a file object in write binary, “wb”, mode. Opening the csv file in “w” mode would result in blank rows being inserted between each row we write.

def csv_writer(data_out, output_dir):
    [snip]
    with open(os.path.join(output_dir, 'Box_Edit_Logs.csv'), 'wb') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow(['File ID', 'Edit Time (Local)', 'Upload Time (Local)', 'Log File', 'Notes'])
    for id in data_out.keys():
        for entry in data_out[id]:
            note = ''
            if len(entry) == 2:
                # No Upload timestamp found
                entry.insert(1, 'N/A')
            if len(entry) > 3:
                # Multiple uploads detected
                multi_upload = entry[2:-1]
                entry = entry[:2] + entry[-1:]
                note = 'Additional Uploads Detected at: '
                note += ','.join(multi_upload)
            entry.insert(0, id)
            entry.insert(4, note)
            writer.writerow(entry)

After we pass the file object to the csv writer() method on line 4, we write our column headers using the writerow() method. The writerow() method takes a list and writes each element to its own column. Using this function we will iterate through our output data to write each list of extracted data to a row.

On lines 6 and 7 we create loops to access the innermost lists (recall our data structure is a dictionary containing lists of lists). In the event that an update time was not found our interior list will only have two elements. If this is the case, we insert the string “N/A” in the first index to reflect that in the resulting output. If our list has a length greater than three then we have encountered multiple upload events for the same file. We handle this, on line 14, by first extracting the extra timestamps into their own separate list via list comprehension. Next, we recreate the entry variable with only the edit time, last upload time, and log path items. Lastly, on lines 16 and 17, we add a note about the additional upload events and separate each additional upload timestamp with a comma.

Regardless of the length of the list, on line 18, we prepend the file ID to the list. Similarly, on line 19, we add the note to the column of our entry. Finally, we write each of these lists to the output file. Now let’s take a look at what we’ve created.

script_csv_output

We can easily review what files were edited and if they were uploaded successfully, including multiple instances, to Box. At this point, we could filter on specific file IDs or review a relevant time period. It should be apparent that this will give us answers far more quickly than reviewing those same text files by hand. One thing we might want to add is a duration column calculating the delta between the edit start time and upload time. This could be done in Python or in Excel by subtracting the two dates.

There are other entries in the Box Edit logs such as when a previously edited file is removed from the Box Edit Documents cache folder. As an added challenge, add functionality to parse this line (or other lines you find relevant for your investigation.) The process is little different than what we have already developed. Please feel free to modify the code to suit your needs and chime in if you have any questions.

Hasty Scripts: Box Edit Log Parser

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s