Setup Nutch 2.x is quite tricky in terms of Nutch 1.x and the main feature of 2.x is that it uses gora
backend. One of the implementation of gora
is HBase and thus I’ll use HBase to configure Nutch 2.x.
Every version is Nutch 2.x is tied with a version of HBase, so it’s very important to use the mentioned version of HBase.
gora-hbase
has a rev=0.6.1
that comes with Nutch 2.3.1 which is tied with HBase 0.98.8 Hadoop2 and you can download from this link.
So, let’s get down to business.
Install HBase Link to heading
HBase is fairly easy to setup and need to follow couple of steps.
Before you proceed, check if you’ve Java 1.6/1.7 installed and JAVA_HOME
is setup correctly.
First download HBase 0.98.8 Hadoop2
I’d generally extract to /opt/
and create a symbolic-link to /opt/hbase
for my ease of use, but that’s your call. For this article, I’ll use /opt/hbase
.
If you’re using Ubuntu/Debian, make sure you’ve added localhost
to your /etc/hosts
as below
127.0.0.1 localhost
127.0.0.1 ubuntu.ubuntu-domain ubuntu
Now, open /opt/hbase/conf/hbase-env.sh
and update the JAVA_HOME
.
export JAVA_HOME=/usr/java/default
Now, open opt/hbase/conf/hbase-site.xml
and put the following configuration:
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///opt/hbase-db</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/path/to/trynutch/zookeeper</value>
</property>
</configuration>
hbase.rootdir
should be a non-existent directory in your FS where HBase will create itself otherwise Hbase will try to migrate.
That should be it.
Fire up HBase using the following command:
/opt/hbase/bin/hbase shell
Install Nutch 2.x Link to heading
Download Nutch 2.x source code, as the time of writing this article Nutch 2.3.1 is the latest and hence I’ll use it.
Like HBase, I extracted the archive to /opt/nutch
.
You’ll now need to compile it from source using Ant. Once Ant install, just issue the following command under /opt/nutch
ant runtime
This is going to take a long time because all the dependency required will be downloaded from Maven repository etc. Just be patient and wait.
Once the source is built, head to /opt/nutch/runtime/local/conf
and modify hbase-site.xml
and put the same configuration we’ve put above in HBase configuration.
Make the following changes in nutch-default.xml
<property>
<name>file.content.limit</name>
<value>524288</value>
<description>The length limit for downloaded content using the file
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the http.content.limit setting.
</description>
</property>
<property>
<name>http.content.limit</name>
<value>524288</value>
<description>The length limit for downloaded content using the http
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
</description>
</property>
<property>
<name>db.max.outlinks.per.page</name>
<value>500</value>
<description>The maximum number of outlinks that we'll process for a page.
If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
will be processed for a page; otherwise, all outlinks will be processed.
</description>
</property>
Now open nutch-site.xml
and put the following in the configuration
block:
<property>
<name>http.agent.name</name>
<value>crawler</value>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>
Now, we need to restrict Nutch not to crawl the entire web and to do that, we’ll edit regex-urlfilter.txt
modify the following at the end
+. # Remove this line
+https?://([a-z0-9]*\.)*example.com # Add this line
This makes sure that Nutch only crawls example.com
and all its subdomains only.
We’ll need provide a list of URLs that Nutch will going to be crawling into. Create a directory urls
under /opt/nutch/runtime/local
and put your domain names in a file.
#Filename: /opt/nutch/runtime/local/urls/allowed.txt
https://example.com
https://foo.example.com
That’s it. Now, we’ll use /opt/nutch/runtime/local/bin/crawl
command to crawl example.com
using the below parameters:
bin/crawl urls first-crawl 3
bin/crawl
is the shell script that has sequential execution of nutch life-cycle.urls
is the directory name that contains list of all domains that nutch will crawl.first-crawl
is an unqiue crawl key.3
is the number of rounds crawl is going to perform, in other case the depth of the site you’re going to crawl.
That’s all about it. Please comment below if you’ve any questions.