Nutch crawler and integration with Solr

Before moving ahead with this article, I assume you have Solr installed and running. If you would like to install Solr on windows, mac or via docker, please read Setup a Solr instance.

There are several ways to install nutch which you can read from Nutch tutorial, however I have written this article for those who would like to install nutch using docker. I tried finding help on google but could not find any help for nutch installation using docker and spent good amount of time fixing issues specific to it. Therefore I have written this article to help and save time of other developers.

Install nutch using docker-

1. Pull docker image of nutch using below command,

   > docker pull apache/nutch
2. Once image is pulled, run the container,
   > docker run -t -i -d --name nutchcontainer apache/nutch /bin/bash
3. You should be able to enter in the container and see bash prompt,
   > bash-5.1#

Let's setup few important settings now-

1. Goto bin folder,
   > bash-5.1# cd /nutch/bin
      you should find nutch and crawl scripts in bin folder.
2. Check nutch is installed by running below command,
   > bash-5.1# ./nutch
       you should get nutch details as output.

3. Create a new folder where we will add our URLs for crawl,
   > bash-5.1# mkdir urls
   > bash-5.1# touch seed.txt
   > bash-5.1# vi seed.txt

add urls in seed.txt which you would like to crawl.

4. seed.txt file (in this case, I have added two urls to crawl) should look like this,

5. Modify nutch-site.xml file,

> bash-5.1# cd ../conf

> bash-5.1# vi nutch-site.xml

Remove existing lines and add following lines in it,

<?xml version="1.0"?>

<name>http.agent.name</name>

<value>nutch-solr-integration</value>

</property>

<name>generate.max.count</name>

</property>

<name>generate.max.per.host</name>

</property>

<name>plugin.includes</name>

</property>

</configuration>

6. Run below command to inject the urls for crawling,

> bash-5.1# cd ../bin

> bash-5.1# ./nutch inject crawlerdb urls

It should inject the urls and show you a success message. Along with it, a crawldb will also be created.

7. Now run, below command to crawl, create segments and invertedlinks,

> bash-5.1# ./crawl --num-threads 3 -s urls crawldb 2

this will start crawling of urls with 3 consecutive threads and iterate it 2 times.

8. After successfull run, you will notice, a crawldb, a segment and a linkdb folders are created.

Integrate it with Solr -

1. Modify index-writers.xml file,

> bash-5.1# cd ../conf

> bash-5.1# vi index-writers.xml

Update solr url as shown below, where nutch_collection is collection created in Solr.

2. Run the command to push it to Solr instance,

> bash-5.1# ./crawl -i -D solr.server.url=http://solr.bajajsumit.com:8983/solr/nutch_collection crawldb 1

3. After successful completion, records should show in Solr instance->nutch_collection

Please feel free to ask for any help. Keep learning and building the community strong.

AJAX Progrraming

Ajax , shorthand for Asynchronous JavaScript and XML , is a web development technique for creating interactive web applications. The intent is to make web pages feel more responsive by exchanging small amounts of data with the server behind the scenes, so that the entire web page does not have to be reloaded each time the user requests a change. This is meant to increase the web page's interactivity, speed, and usability. The Ajax technique uses a combination of: XHTML (or HTML) and CSS, for marking up and styling information. The DOM accessed with a client-side scripting language, especially JavaScript and JScript, to dynamically display and interact with the information presented. The XMLHttpRequest object is used to exchange data asynchronously with the web server. In some Ajax frameworks and in certain situations, an IFrame object is used instead of the XMLHttpRequest object to exchange data with the web server, and in other implementations, dynamically added tags may be used. ...

sadhbhnagataMarch 4, 2022 at 4:02 PM
Best Casinos Near Washington DC 2021 - Mapyro
Find the 청주 출장샵 best casinos near Washington DC 2021 with mapyro 남양주 출장마사지 users' 경주 출장안마 ratings, mapyro users comments and more. Riverview Casino 군산 출장마사지 Resort 익산 출장마사지 Casino, D
AnonymousApril 29, 2022 at 12:22 PM
Thank you for this wonderful guide! I have a question though: what to do when received the error "No User-Agent string set". I can't get past this error

2022-04-29 19:19:42,438 ERROR o.a.n.p.h.a.HttpBase [main] No User-Agent string set (http.agent.name)!
2022-04-29 19:19:42,560 WARN o.a.n.p.PluginRepository [main] Could not find org.apache.nutch.protocol.http.Http
java.lang.RuntimeException: Agent name not configured!
Ronnie RoystonFebruary 10, 2023 at 4:35 PM
Thank you Sumit. Very helpful!

The noblest pleasure is the joy of understanding