Skip to main content

Nutch crawler and integration with Solr

Before moving ahead with this article, I assume you have Solr installed and running. If you would like to install Solr on windows, mac or via docker, please read Setup a Solr instance.

There are several ways to install nutch which you can read from Nutch tutorial, however I have written this article for those who would like to install nutch using docker. I tried finding help on google but could not find any help for nutch installation using docker and spent good amount of time fixing issues specific to it. Therefore I have written this article to help and save time of other developers.

Install nutch using docker-

1. Pull docker image of nutch using below command,
    > docker pull apache/nutch
2. Once image is pulled, run the container,
    > docker run -t -i -d --name nutchcontainer apache/nutch /bin/bash
3. You should be able to enter in the container and see bash prompt,
    > bash-5.1# 

Let's setup few important settings now-

1. Goto bin folder, 
    > bash-5.1# cd /nutch/bin
        you should find nutch and crawl scripts in bin folder.
2. Check nutch is installed by running below command,
    > bash-5.1# ./nutch
        you should get nutch details as output.

3. Create a new folder where we will add our URLs for crawl,
    > bash-5.1# mkdir urls
    > bash-5.1# touch seed.txt
    > bash-5.1# vi seed.txt

         add urls in seed.txt which you would like to crawl.

4. seed.txt file (in this case, I have added two urls to crawl) should look like this,

5. Modify nutch-site.xml file,
    > bash-5.1# cd ../conf
    > bash-5.1# vi nutch-site.xml
    Remove existing lines and add following lines in it,
    <?xml version="1.0"?>
    <configuration>
         <property>
          <name>http.agent.name</name>
          <value>nutch-solr-integration</value>
         </property>
         <property>
          <name>generate.max.count</name>
          <value>10</value>
         </property>
         <property>
          <name>generate.max.per.host</name>
          <value>10</value>
         </property>
         <property>
          <name>plugin.includes</name>
          <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|indexer-solr|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer
         </property>
    </configuration>

6. Run below command to inject the urls for crawling,
    > bash-5.1# cd ../bin 
    > bash-5.1# ./nutch inject crawlerdb urls
        It should inject the urls and show you a success message. Along with it, a crawldb will also be                    created. 

7. Now run, below command to crawl, create segments and invertedlinks,

    > bash-5.1# ./crawl --num-threads 3 -s urls crawldb 2

    this will start crawling of urls with 3 consecutive threads and iterate it 2 times.


8. After successfull run, you will notice, a crawldb, a segment and a linkdb folders are created.  

Integrate it with Solr - 

1. Modify index-writers.xml file,
    > bash-5.1# cd ../conf
    > bash-5.1# vi index-writers.xml
     Update solr url as shown below, where nutch_collection is collection created in Solr. 
2. Run the command to push it to Solr instance,

    > bash-5.1# ./crawl -i -D solr.server.url=http://solr.bajajsumit.com:8983/solr/nutch_collection crawldb 1


3. After successful completion, records should show in Solr instance->nutch_collection


Please feel free to ask for any help. Keep learning and building the community strong. 


Comments

  1. Best Casinos Near Washington DC 2021 - Mapyro
    Find the 청주 출장샵 best casinos near Washington DC 2021 with mapyro 남양주 출장마사지 users' 경주 출장안마 ratings, mapyro users comments and more. Riverview Casino 군산 출장마사지 Resort 익산 출장마사지 Casino, D

    ReplyDelete
  2. Thank you for this wonderful guide! I have a question though: what to do when received the error "No User-Agent string set". I can't get past this error

    2022-04-29 19:19:42,438 ERROR o.a.n.p.h.a.HttpBase [main] No User-Agent string set (http.agent.name)!
    2022-04-29 19:19:42,560 WARN o.a.n.p.PluginRepository [main] Could not find org.apache.nutch.protocol.http.Http
    java.lang.RuntimeException: Agent name not configured!

    ReplyDelete

Post a Comment

Popular posts from this blog

Could not load file or assembly 'Microsoft.Web.Infrastructure'

Could not load file or assembly 'Microsoft.Web.Infrastructure, Version=1.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35' or one of its dependencies. The system cannot find the file specified. What 'Micorosoft.Web.Infrastructure' does? This dll lets HTTP modules register at run time. Solution to above problem: Copy 'Micorosoft.Web.Infrastructure' dll in bin folder of your project and this problem should be resolved. If you have .Net framework installed on machine, this dll should be present on it. You can search for this dll and copy it in your active project folder.   Alternatively,  you can install this dll using nuget package manager PM> Install-Package Microsoft.Web.Infrastructure -Version 1.0.0 Happy coding!!

Create chatbot in 20 minutes using RASA

This blog will help you create a working chatbot with in 20 minutes. For creating chatbot we need following libraries to be installed- >> Python3 >> Pip3 >> Rasa Lets start installing all libraries & dependencies which are need for creating chatbot. Note: I have used MAC, therefore sharing commands related to it. You can install it on Windows, Linux or any other operating system using respective commands. 1. Install Python3 > brew install python3 > python --version #make sure you have python3 installed 2. Install Pip3 > curl -O https://bootstrap.pypa.io/get-pip.py > sudo python3 get-pip.py If you get issue related to Frameoworks while installing pip, follow below steps -  > cd /usr/local/lib > mkdir Frameworks > sudo chown -R $(whoami) $(brew --prefix)/* Once installed check pip3 version > pip3 --version After python3 and pip3 is succeffully installed, proceed to next steps. 3. Install Rasa > pip

Running dotnet on Linux

Server: Linux, version SUSE 12 To run dotnet code on Linux, the first and foremost task is to "Install Mono package on linux". Note: Mono is an open implementation of Microsoft's .Net framework, including compilers. It uses the same development libraries on Linux which are being used on Windows. Therefore, if you code and compiled some mono code on Linux,  it will work for Windows as well.       zypper is a package installation tool which is used in this scenario. If zypper is not available, check which package manager tool is installed on server. Furthermore, to verify if zypper is installed or not, type zypper on command line which will show all options if zypper is available on server else it will show 'command not found'. zypper ar -r http://download.opensuse.org/repositories/Mono/SLE_11_SP2/Mono.repo The above command will download from mentioned URL in a new repository. Here 'ar' stands for 'add repo'. After adding it to repos