Use node & javascript for web crawling

I recently have to crawl the web to get some data from a website, even if I’m more a Java addict and convinced that JavaScript is the 21 century basic, this language seemed to be the more efficient way to do this task.

I discovered for this purpose the use if crawler : a javascript library for node.js

This post details this experimentation.

Getting start with Crawler

The use of crawler is really simple, the purpose of it is to manage a queue of task you want to execute on different url you want to crawl. The page analysis is done through the use of jquery.

To install this package you just have to :

# npm install crawler

Then it is easy to start. The first thing is to instantiate the crawler

var Crawler = require("crawler").Crawler;
var c = new Crawler({
    "maxConnections":5,
    "onDrain" : callForEnding
});

This declare a new Crawler with a max concurrent of 5 connections. The default is 10, do as you want.

The onDrain parameter is a callback that will be fired once the queue will be empty meaning no more job have to be proceed. As everything is asynchronous and single thread it is the only way to determine the end.

This call back can be something like :

callForEnding = function() {
    console.log("End of work !!!");
    setTimeout(function(){
            process.exit(0);
    },30000);
};

Here I add a 30 second tempo before quitting node.js has it seems the callback is fired once the queue is empty, not when the work is done. It was a security, I did not test deeply.

Now you can add some URL to proceed in the Queue like this :

c.queue([
    {
        "uri" : "http://myurl/to/start/with",
        "callback" : function(error,result,$) {
            // $ is a jQuery instance scoped to the server-side DOM of the page
            $("a").each(function (index, a) {
                if (a.href.indexOf("searchString") > 0) {
                   c.queue([
                            {"uri": a.href,
                             "callback": proceedSearch
                            }
                        ]);
                }
            });
        }
    }
]);

In this example a first URL is add to the queue. When proceeded, the callback function will be fired. You can test error, here it is not. In the callback, the html page will be analyzed using jQuery syntax to search all <a> tag. For each of them a function will be executed. This function is searching for some specific content in the link. If found the link is added to the queue. Then, for each a specific function proceedSearch will be fired.

To use jQuery syntax to analyze a page you have different tools and things to know :

$() - is the global page
$(tag.class) - is a way to search tags only if part of a specific class
$(tag#id) - is a way to search tags only if part of a specific id
$(xxx).find(tag | tag.class | tag#id) - is equivalent but can be use on different objects

So you can use all of this to target precise elements, here is an example :

var leftmenu = $(div.leftmenu);   // get the <div class="leftmenu"> ... </div> part of the page
$(leftmenu).find("a").each(       // in leftmenu, search for <a>
   function(index,a) {            // this function is executed for each <a>
      // a.href - refer to href part of the <a>
      // $(a).text() - refer to the content of the <a...>CONTENT</a>
      // index is the incremental sequence number
   }
);

Instead of using function(index,tag), you can use the following:

$("*").each(function() {   // "*" refer to any tag
   // $(this) - refer to the selected subpart of the page
   if ( $(this).is("h2") ) { ... } // allows to test the name of the current tag
});

This Crawling library, used with jQuery is really easy to do what it is for. The main issue I had with it is the memory consumption. Even if in an asynchronous engine I do not really understand why, but as a fact, this code is consuming a really LARGE amount of memory every time you queue a request. So I reach the memory limit a large number of time and had to limit my search by doing different run with different scope. So do not try to queue more than 200 url before starting to dequeue or it crashs.

Add a mysql storage

As the result was stored in a mysql db you can also install it

# npm install mysql

And prepare your code to access your DB

var mysql = require("mysql");
var mySqlClient = mysql.createConnection({
   host : "db.host.foo.bar",
   user : "dbUser",
   password : "dbPass",
   database : "dbName"
});

mySqlClient.connect();

// this is specific to my code, I changed the query format to be able to use name
// of column dynamically generated
mySqlClient.config.queryFormat = function (query, values) {
    if (!values) return query;
    return query.replace(/\:(\w+)/g, function (txt, key) {
        if (values.hasOwnProperty(key)) {
            return this.escape(values[key]);
        }
        return txt;
    }.bind(this));
};

Now to store any data in the DB it can be done using in the JS code the SQL orders :

mySqlClient.query('INSERT INTO country SET name = :name, description = :description',
        {
            name : country.name,
            description : country.description
        }, function (err,result){
            if (err) throw err;
            else {
                console.log("*** Create Country : "+country.name);
            }
        });

This code maps the “name” column to “:name” field that is corresponding to “country.name” value

An update is done the same way :

mySqlClient.query('UPDATE country SET '+fieldName+' = :Image WHERE name = :Name',
            {
                Image: image,
                Name: countries[countryGetIndex(name)].name
            }, function (err, result) {
                if (err) throw err;
                else {
                    console.log("*** Update Img : "+name);
                }
            });

This uses a column name that is generated dynamically “fieldname” map the same way we have seen previously.

2 thoughts on “Use node & javascript for web crawling

  1. Did you ever try to run your tutorial?
    There are so many mistakes.

    require(“crawler”).Crawler is wrong, right is require(“crawler”);
    $(div.leftmenu) is not possible to use as a selector, right is $(“div.leftmenu”)
    callback functions are wrong also

    • To answer your question : as usual my post are copy & paste of some of my work, so yes I ran it.
      mistype or library evolution are still possible, so thank you for you comment to improve / update this post.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.