In Part 1, we explored the UK Gov’s Gender Pay Gap Service and broke down what a scraper might look like for extracting over 12 thousand businesses results from the site. We finished off Part 1 by diving into the Chrome Developer tools to find the HTML tags our scraper would need to access the overview information on each company.
In this Part 2, we’ll find the HTML tags our scraper will need to extract data from each individual pay gap report and finish up by building a scraper to extract all 12 thousand results using Apify. In case you missed in, jump back to Part 1 for a full recap.
Extracting Information from the Pay Gap Report
As we did with the overview page, we’ll need to dive back into the Chrome Developer tools to find the HTML tags for our scraper. This data is structured slightly differently, as we now no longer have a list of items to deal with but a single page.
On this page all the information we’re looking for lives under a tag called “employer-report-metadata” which gives our scraper a sign to look out for. For the page design, three
tags seperate the Snapshot Date, Employer Size and Person Responsible information, with classes “metadata-text-label” and “metadata-text-value” denoting the Label and the Result.
down we see the “metadata-text-value” includes two tags which separate the contact’s name from their title. This is helpful as it allows us to split the name and title out into seperate columns.
This gives us most of what we need, but there is one more useful bit of information left over on this page. Given the company data we’re getting here is based on the company’s registered business name, we’ll want some way to link their to their trading name or website. On some of these listings, the listed business has provided a link to their website to explain the results in the report. Capturing this link is extremely helpful as we can visit it later to extract direct contact details.
Just below the Snapshot, Employer Size and Person Responsible elements, there are three tags that seperate the links in this section. For those companies that wish to provide a link to their website, this is found in the third
down. Note that these divs have a style attribute which we can use to seperate these from other
tags on the page.
With our HTML tags all mapped out, the only thing left to do is build our scrapers.
Web Scraping with Apify
For those not familiar, Apify is a web scraping platform that allows you to build powerful web scrapers. Apify allows you to control a Headless Chrome Browser that will visit your target website and extract the information on the page. Using a Headless Chrome Browser provides a lot of power here as we can extract all the information that is displayed on a website, even dynamic data that you could only get by clicking a button, submitting a form, or waiting on a page. Apify provides a web scraper template to get started with that gives us all the functionality we need right out of the box.
If you recall from Part 1, we proposed to build two scrapers. One scraper would click through the list of businesses, extracting the overview details and getting the links to the individual pay gap reports. A second scraper would then visit each of these individual pay gap reports and pull out the information using the tags we mapped out above. Scraper 1 which we’ll call the “Overview Scraper” would need the ability to click the “Next” button on each page to get a new page of results to add to the list. Scraper 2 or the “Report Scraper” would then visit each of these and pull out the information, before closing the page.
Building the Overview Scraper in Apify
For our first scraper we’ll want to start it at page one of the results page and we’ll want it to click “Next” through each subsequent page. Apify gives us a Link Selector option where we can tell it which links to click to add pages to the scraping queue. This option takes a CSS selector, which is a bit of code that tells Apify exactly which link to collect. Remember, the scraper can’t see the screen, it can only see the HTML and CSS we found in the developer tools.