Web scrapers come in many different forms.

ParseHub is a free web scraping tool. Turn any site into a spreadsheet or API. As easy as clicking on the data you want to extract. ParseHub is a free web scraping tool. Turn any site into a spreadsheet or API. As easy as clicking on the data you want to extract.

From simple browser plugins to more robust software applications. Depending on the web scraper you’re using, you might or might not be able to scrape multiple pages of data in one single run.

Today, we will review how to use a free web scraper to scrape multiple pages of data. These include pages with 2 different kinds of navigation.

For this, we will use ParseHub, a free and powerful web scraper that can extract data from any website.

Web Scraping with ParseHub

If you have never used ParseHub before, do not fret. It is actually quite easy to use while still being incredibly powerful.

In basic terms, ParseHub works by loading the website you’d like to scrape and letting you click on the specific data you want to extract.

Taking it a step further, you can also instruct ParseHub to interact or click on specific elements of the pages in order to browse to other pages with more data in them. That means you can make ParseHub click through to navigate through multiple pages.

Read more: How to use ParseHub to scrape data from any website into an Excel spreadsheet

Scraping Multiple Pages on a Website

A Website’s pagination (or the lack thereof) can come in many different ways. Let’s break down how to deal with any of these scenarios while scraping data.

Clicking on the “Next Page” Button

This is probably the most common scenario you will find when scraping multiple pages of data. Here’s how to deal with it:

  1. In ParseHub, click on the PLUS(+) sign next to your page selection and choose the Select command.
  2. Using the select command, click on the “Next Page” link (usually at the bottom of the page you’re scraping). Rename your new selection to NextPage.
  3. Expand your NextPage selection by using the icon next to it and delete both Extract commands under it.
  4. Using the PLUS(+) sign next to your NextPage selection, choose the Click command.
  5. A pop-up will appear asking you if this a next page link. Click on “Yes” and enter the number of times you’d like to repeat the process of clicking on this button. (If you want to scrape 5 pages of data total, you’d enter 4 repeats).

No “Next Button”

Sometimes, there might be no next page link for pagination. In these cases, there might just be links to the specific page numbers such as the image below.

Here’s how to navigate through these with ParseHub:

  1. In ParseHub, click on the PLUS (+) sign next to your page selection and click on the current page number (In this case, page 1). Rename your selection to CurrentPage.
  2. Click on the PLUS (+) sign next to the CurrentPage selection and add a Relative Select command.
  3. Using the Relative Select command, click on the current page number and then on the next page number. An arrow will appear to show the connection you’re creating. Rename this selection to NextPage.
  4. Now, use the PLUS (+) sign next to the NextPage selection to add a Click Command.
  5. A pop-up will appear asking you if this a “Next Page” link. Click on “Yes” and enter the number of times you’d like to repeat this process (If you want to scrape 5 pages of data total, you’d enter 4 repeats).
  6. ParseHub will now load the next page of results. Scroll all the way down and check that the NextPage Relative Selection you created is now selecting Page 3 instead of Page 2 again. If it is, then click on Page 2 and then on Page 3 to train ParseHub accordingly.

Other Methods of Scraping Multiple Pages

You might also be interested in scraping multiple pages by searching through a list of keywords or by loading a predetermined list of URLs.

These are tasks that ParseHub can easily tackle as well. Check out Help Center for these guides.

Closing Thoughts

You now know how to scrape multiple pages worth of data from any website.

However, we know that websites come in many different shapes and forms. The methods highlighted in this article might not work for your specific project.

If that’s the case, reach out to us at hello(at)parsehub.com and we’ll be happy to assist you with your project.

Happy Scraping!

  • METHODS

Web::Scraper - Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions

The structure would resemble this (visually) { authors => [ { fullname => $fullname, link => $uri }, { fullname => $fullname, link => $uri }, ] }

Web::Scraper is a web scraper toolkit, inspired by Ruby's equivalent Scrapi. It provides a DSL-ish interface for traversing HTML documents and returning a neatly arranged Perl data structure.

The scraper and process blocks provide a method to define what segments of a document to extract. It understands HTML and CSS Selectors as well as XPath expressions.

scraper

Creates a new Web::Scraper object by wrapping the DSL code that will be fired when scrape method is called.

TutorialWebscraperapp

scrape

Retrieves the HTML from URI, HTTP::Response, HTML::Tree or text strings and creates a DOM object, then fires the callback scraper code to retrieve the data structure.

If you pass URI or HTTP::Response object, Web::Scraper will automatically guesses the encoding of the content by looking at Content-Type headers and META tags. Otherwise you need to decode the HTML to Unicode before passing it to scrape method.

You can optionally pass the base URL when you pass the HTML content as a string instead of URI or HTTP::Response.

This way Web::Scraper can resolve the relative links found in the document.

process

Web Scraper Software

process is the method to find matching elements from HTML with CSS selector or XPath expression, then extract text or attributes into the result stash.

If the first argument begins with '//' or 'id(' it's treated as an XPath expression and otherwise CSS selector.

process_first

process_first is the same as process but stops when the first matching result is found.

result

result allows to return not the default value after processing but a single value specified by a key or a hash reference built from several keys.

WebScraperWebScraper

There are many examples in the eg/ dir packaged in this distribution. It is recommended to look through these.

Web Scraper Chrome Extension

Scrapers can be nested thus allowing to scrape already captured data.

Filters are applied to the result after processing. They can be declared as anonymous subroutines or as class names.

Filters can be stacked

WebScraper

More about filters you can find in Web::Scraper::Filter documentation.

Webscraper.com

By default HTML::TreeBuilder::XPath is used, this can be replaces by a XML::LibXML backend using Web::Scraper::LibXML module.

Web Scraper Python

Tatsuhiko Miyagawa <[email protected]>

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

To install Web::Scraper, copy and paste the appropriate command in to your terminal.

For more information on module installation, please visit the detailed CPAN module installation guide.

Coments are closed

Recent News

  • AText
  • IBackup Viewer
  • RapidClick
  • BatchPhoto Pro
  • Infinity

Scroll to top