Friday, March 24, 2023
HomeMobile MarketingHow To Crawl A Giant Website And Extract Information Utilizing Screaming Frog's...

How To Crawl A Giant Website And Extract Information Utilizing Screaming Frog’s web optimization Spider


We’re helping a number of purchasers proper now with Marketo migrations. As giant firms make the most of enterprise options like this, it’s like a spider internet that weaves itself into processes and platforms over years… till the purpose that firms aren’t even conscious of each touchpoint.

With an enterprise advertising and marketing automation platform like Marketo, types are the entry level of knowledge all through websites and touchdown pages. Corporations typically have hundreds of pages and a whole bunch of types all through their websites that must be recognized for updating.

An important device for that is Screaming Frog’s web optimization Spider… maybe the preferred platform within the web optimization marketplace for crawling, auditing, and extracting information from a website. The platform is feature-rich and presents a whole bunch of choices for nearly each process you require. The options lengthen far past optimization for search, although, with one extremely useful function for extracting information out of your website because it’s being crawled.

Screaming Frog web optimization Spider: Crawl And Extract

A key function of Screaming Frog web optimization Spider is that you would be able to carry out customized extractions primarily based on Regex, XPath, or CSSPath specifics. This is available in extraordinarily helpful as we want to crawl the consumer’s websites and audit and seize the MunchkinID and FormId values from pages.

With the device, open Configuration > Customized > Extraction to establish components you want to extract.

screamingfrog custom extraction

The extraction display permits for nearly limitless information assortment:

Screaming Frog SEO Spider Extraction Rules

Regex, XPath, and CSSPath Extraction

For the MunchkinID, the identifier is situated throughout the kind script that’s throughout the web page:

<script kind='textual content/javascript' id='marketo-fat-js-extra'>
    /* <![CDATA[ */
    var marketoFat = {
        "id": "123-ABC-456",
        "prepopulate": "",
        "ajaxurl": "https://yoursite.com/wp-admin/admin-ajax.php",
        "popout": {
            "enabled": false
        }
    };
    /* ]]> */

We then apply a Regex rule to seize the id from throughout the script tag that’s inserted within the web page:

Regex: ["']id["']: *["'](.*?)["']

For the Kind ID, the information is in an enter tag throughout the Marketo kind:

<enter kind="hidden" title="formid" class="mktoField mktoFieldDescriptor" worth="1234">

We apply an XPath rule to seize the id from throughout the kind that’s inserted within the web page. The XPath question appears to be like for a kind with an enter with a reputation of formid, then the extraction saves the worth:

XPath: //kind/enter[@name="formid"]/@worth

Extract Inline Type Tags

We’re serving to a consumer proper now clear up a website the place they used inline types on the Elementor plugin to customise nearly each ingredient with a web page. To establish the place inline types have been used, we scrapted the location with a lot of RegEx guidelines for customized extraction:

<spans+(?:[^>]*?s+)?types*=s*"([^"]*)"
<as+(?:[^>]*?s+)?types*=s*"([^"]*)"
<divs+(?:[^>]*?s+)?types*=s*"([^"]*)"
  • Heading Tag Inline Type:
<h+(?:[^>]*?s+)?types*=s*"([^"]*)"

Exclude Subdomains In Your Crawl

At Martech Zone, we serve the location in a number of languages at completely different subdomains. Crawling these translations isn’t obligatory since all of the belongings and data is predicated on the core website. Due to this, we enabled the Exclude Checklist Configuration and added the next rule:

.*.martech.zone

It’s also possible to use this to skip crawling pointless paths like tags by including:

martech.zone/tag/.*

The platform even has a pleasant methodology to check some URLs in opposition to the foundations to make sure it really works correctly earlier than you crawl your website.

Screaming Frog web optimization Spider Javascript Rendering

One other nice choice of Screaming Frog is that you just aren’t restricted to the HTML within the web page, you’ll be able to render any JavaScript that’s going to insert types inside your website. Inside Configuration > Spider, you’ll be able to go to the Rendering tab and allow this.

Screaming Frog SEO Spider Javascript Rendering

This does take a bit of longer to crawl the location, after all, however you’ll get types which are rendered client-side by JavaScript in addition to types which are inserted server-side.

Whereas this can be a very particular utility, it’s an extremely helpful one as you’re working with giant websites. You’ll completely need to audit the place your types are embedded all through the location.

Obtain Screaming Frog web optimization Spider

Disclosure: Martech Zone is utilizing its affiliate hyperlinks on this article.



Supply hyperlink

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments