Anysite Scraper Frequently Asked Questions

Getting Started & Core Concepts

What is XPath, and why is it important for scraping?

XPath (XML Path Language) is a syntax or language for finding any element on a web page using XML path expression. XPath is used to find the location of any element on a webpage using HTML DOM structure.

For more details visit:

What is a "project" in the software?

When you need to extract specific leads from any website, the software provides you an environment to generate each field (like business name, address, person name, phone number, email address, website URL, etc.) from the web pages with mouse right clicks on the page. When you complete all fields needed to extract from the web page(s) of the website, you save the configurations in a file called a project.

My target website loads data when I scroll. How do I handle this?

Some websites load data dynamically on scroll therefore you have to apply auto-scroll on page load before extraction. Set scroll step points and delay of each step enough so that page gets enough time to load. Also your internet speed should be enough.

Example: Your internet speed is medium and you set scroll step points 300 (means 300 pixels) and 500 milliseconds delay.

Understanding Website Structures

What is a "short profile" or "multi-record" page?

Some websites display business name, address and contact information on search page also and it is enough information to extract instead of going to detail page for complete profile. It is called short profile information page or multi-record per page information. Short profile information pages are normally search pages containing multiple records on each page. It is fast if short profile fulfills your needs.

Example screenshot showing a search page with 3 records:

[Search page contains 3 records]

What is a "detail profile" page?

Most of the websites display short profile on search page and complete profile on separate page when you click on short profile link. Such types of profiles are called detail profile information pages. In such cases software takes profile links from search page and open detail profile in separate windows to extract data.

Configuring Your Project: Key Techniques

Selecting Data & Parents

How do I select the correct area to scrape?

First of all select area with mouse over where your required data exists as shown in below image where area is selected which have all fields required like business name, address, telephone, rating, reviews etc.

[Area selection example screenshot]

What is a "parent node," and how do I select it?

In browser HTML nodes are shown in a tree structure and there is relationship between nodes like child parent, siblings, ancestors same as human relation. Parent nodes are those nodes which have some fields as child nodes in own tree structure.

You can select any field as a parent node where child nodes exist. First select area with mouse over where your required data exists.

Handling Interactive Elements

How do I extract data that only appears after a click?

Sometimes HTML field's actual data is not visible and we need to click on that field to see its actual data or extract actual data. For example, first we click on "Telephone" then it shows telephone numbers. When you add click item in software configuration during project creation then software will automatically click to show number then extract.

Steps to configure click action:

Select the field item where you want to click (e.g., "Phone number")
Right click on phone number, a popup window will open
Set appropriate wait time (milliseconds) for data to load after click
Put the name of selected item and save it

How do I collect profile links from a search page?

To collect these profiles URL links "Right Click" on any field which has URL link and a popup window will be opened.

Important settings:

Field type should be "Get Detail page links"
If you have any parent of profiles links then select relevant parent in "select parent field" dropdown list

How do I handle pagination (next page) if there's no obvious "Next" button?

There are some websites where directly "next page" is not given and pagination is showing like numbered pages (1, 2, 3, 4, etc.). In that case we can select next page by "Right click" on any page number like 2,3,4 etc. that is not already selected.

Configuration:

Selected field type should be "Set the next page item"
If there is any parent node of selected pagination then selected parent type must be relevant parent where this is existing

Extracting Specific Data Types

How do I extract data that is stored as a property attribute?

Sometimes our required field data is the value of any property. For example, telephone number might be stored as the value of a "data-visible-number" property.

To extract property data:

Selected type should be "Get property data"
Write the name of the property or select property name from dropdown list
For example, property name might be "data-visible-number"

How do I extract an image URL?

To extract image URL address from any site just Right click on image, a new popup window will appear.

Note: Select field type "Extract the image source address" when you are selecting image link from any site.

How do I extract an email address?

When you want to select any email address which is available on web page right click on that field. A popup will appear.

When your selected field type is "Extract email address" software will pick email formatted data from your selected HTML field.