Web Scraping Threat Intel with Python

Last week I posted a poll on Twitter asking my followers about what they wanted to see on Jenius. Out of the four options provided, scripts were the most popular choice. Then I asked myself, what type of scripts would be most useful to my followers and readers? Because I have a substantial interest and influence in threat intelligence and information security in general, I decided to make a web scraping script for threat intelligence.

Web scraping is an automated form of extracting data from websites. It makes taking large amounts of data from a website almost painless because of the time saved and other factors. In cybersecurity, there is always an opportunity to automate tedious yet simple tasks such as gathering threat intelligence data to parse.

The Script

To make the script, I needed a website to scrape. I chose to use MITRE Attack. MITRE Attack is a knowledge base that consists of cyber adversary techniques, tactics, and procedures. For this script, I decided to focus on web scraping the various threat actor group names and descriptions on the website. In addition to pulling the data from MITRE Attack, I also want to save the data in a JSON file for other uses.

The Code

I’m utilizing BeautifulSoup to help obtain HTML data.  If you are unfamiliar with BeautifulSoup, I suggest reading the guide here.

The first thing is to install/import the libraries.  You can install any missing libraries with the ‘pip’ command

Next, I need to create a function (get_group) to process opening the URL and grabbing the HTML data. I also create an array (groupArr) that will store the data.

Then, I define the variable soup, which calls the function that I just created to request the URL we want to scrape.

Now, we take a look at the URL. Ultimately, I want to grab all the data within the Group table on this page:

To do that, I need to inspect the HTML behind the page. I’m using Chrome, which allows me to right-click and select “inspect.” I can now highlight the specific element I want to inspect.

I want to capture the name and description data from each row of the table. Looking at each row, it appears the word “group” is present within both the name and description cells. I can use a for loop and BeautifulSoup’s .select() method to search for any row that contains the word “group.”

Once the rows containing “group” is found, I want to take that data from the row and put it in an object (groupObject). However, I need to specify what data I want to grab. Looking at the HTML again, you can observe that the name of each threat actor group is within a tags (link) and the description is within p tags (paragraph). When creating my object, I want to grab the text contained within those tags. Then, I want to append my object to the array that I created above. To make sure the data looks accurate, I also print out the object.

Finally, I append my groupObject to the array I created earlier (groupArr) and write it to a JSON file called groupData.json.

That’s it!  You can view the script here on Github.

References:

HackerNoon – Buidling a webscaper from start to finish

YouTube – SAF Business Analytics

Stack Overflow