Configure Nutch 2.x with HBase

Setup Nutch 2.x is quite tricky in terms of Nutch 1.x and the main feature of 2.x is that it uses gora backend. One of the implementation of gora is HBase and thus I’ll use HBase to configure Nutch 2.x.

Every version is Nutch 2.x is tied with a version of HBase, so it’s very important to use the mentioned version of HBase.

gora-hbase has a rev=0.6.1 that comes with Nutch 2.3.1 which is tied with HBase 0.98.8 Hadoop2 and you can download from this link.

So, let’s get down to business.

Install HBase

HBase is fairly easy to setup and need to follow couple of steps.

Before you proceed, check if you’ve Java 1.6/1.7 installed and JAVA_HOME is setup correctly.

First download HBase 0.98.8 Hadoop2

I’d generally extract to /opt/ and create a symbolic-link to /opt/hbase for my ease of use, but that’s your call. For this article, I’ll use /opt/hbase.

If you’re using Ubuntu/Debian, make sure you’ve added localhost to your /etc/hosts as below

127.0.0.1 localhost
127.0.0.1 ubuntu.ubuntu-domain ubuntu

Now, open /opt/hbase/conf/hbase-env.sh and update the JAVA_HOME.

export JAVA_HOME=/usr/java/default

Now, open opt/hbase/conf/hbase-site.xml and put the following configuration:

<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>file:///opt/hbase-db</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/path/to/trynutch/zookeeper</value>
  </property>
</configuration>

hbase.rootdir should be a non-existent directory in your FS where HBase will create itself otherwise Hbase will try to migrate.

That should be it.

Fire up HBase using the following command:

# /opt/hbase/bin/hbase shell

Install Nutch 2.x

Download Nutch 2.x source code, as the time of writing this article Nutch 2.3.1 is the latest and hence I’ll use it.

Like HBase, I extracted the archive to /opt/nutch.

You’ll now need to compile it from source using Ant. Once Ant install, just issue the following command under /opt/nutch

# ant runtime

This is going to take a long time because all the dependency required will be downloaded from Maven repository etc. Just be patient and wait.

Once the source is built, head to /opt/nutch/runtime/local/conf and modify hbase-site.xml and put the same configuration we’ve put above in HBase configuration.

Make the following changes in nutch-default.xml

<property>
  <name>file.content.limit</name>
  <value>524288</value>
  <description>The length limit for downloaded content using the file
   protocol, in bytes. If this value is nonnegative (>=0), content longer
   than it will be truncated; otherwise, no truncation at all. Do not
   confuse this setting with the http.content.limit setting.
  </description>
</property>
<property>
  <name>http.content.limit</name>
  <value>524288</value>
  <description>The length limit for downloaded content using the http
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  </description>
</property>
<property>
  <name>db.max.outlinks.per.page</name>
  <value>500</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
</property>

Now open nutch-site.xml and put the following in the configuration block:

<property>
	<name>http.agent.name</name>
	<value>crawler</value>
</property>
<property>
	<name>storage.data.store.class</name>
	<value>org.apache.gora.hbase.store.HBaseStore</value>
	<description>Default class for storing data</description>
</property>

Now, we need to restrict Nutch not to crawl the entire web and to do that, we’ll edit regex-urlfilter.txt modify the following at the end

+. # Remove this line
+https?://([a-z0-9]*\.)*example.com # Add this line

This makes sure that Nutch only crawls example.com and all its subdomains only.

We’ll need provide a list of URLs that Nutch will going to be crawling into. Create a directory urls under /opt/nutch/runtime/local and put your domain names in a file.

#Filename: /opt/nutch/runtime/local/urls/allowed.txt
https://example.com
https://foo.example.com

That’s it. Now, we’ll use /opt/nutch/runtime/local/bin/crawl command to crawl example.com using the below parameters:

# bin/crawl urls first-crawl 3
  • bin/crawl is the shell script that has sequential execution of nutch life-cycle.
  • urls is the directory name that contains list of all domains that nutch will crawl.
  • first-crawl is an unqiue crawl key.
  • 3 is the number of rounds crawl is going to perform, in other case the depth of the site you’re going to crawl.

That’s all about it. Please comment below if you’ve any questions.

Abdul Munim June 20, 2016 search-engine

Lync 2013 fix server version incompatibility

Today, I installed Office 2013 and found out Lync 2013 is not signing in to BT Office Communicator Server 2007. I started getting a message saying

“Cannot sign in because the server version is incompatible with Microsoft Lync. Contact your support team with this information”

After scratching my head for many hours, I finally found how to disable the server check.

Make sure you run the Command Prompt as “Administrator” and execute the command below:

Reg Add "HKEY_LOCAL_MACHINESOFTWAREPoliciesMicrosoftOffice15.0Lync" /V "DisableServerCheck" /D 1 /T REG_DWORD /F
Abdul Munim December 10, 2012 office

HOW TO: Convert office documents to PDF using Open Office/LibreOffice in C#

Lately, we had this requirement to convert office documents such as DOC, DOCX, XLS, XLSX, PPT, PPTX to PDF. After googling for sometime, Microsoft Office doesn’t have any API or exposes any command to achieve this target. As I have some interaction with Open Office on my Ubuntu netbook and I knew Open Office has their API exposed with UNO. I started searching and working on a workable demo to convert office documents to PDF.

Open Office has CLI implementation on Java’s UNO development environment for their API to use on .NET Framework. See Open Office Developer Guide for details.

Things need to be installed

  • Open Office (Libre Office now)
  • Open Office SDK

NOTE: Open Office SDK is not quite required if you can copy the required assemblies from GAC to you application’s assembly folder. The required files listed below:

  • cli_basetypes.dll
  • cli_cppuhelper.dll
  • cli_oootypes.dll
  • cli_ure.dll
  • cli_uretypes.dll

If you have installed Open Office SDK, you will get these files under sdkcli on you installed SDK folder.

Implementation

Add reference to all the DLLs above on your project follow the noted methods below.

You will be needing these namespaces imported

using System.Diagnostics;
using System.IO;
using System.Threading;
using uno;
using uno.util;
using unoidl.com.sun.star.beans;
using unoidl.com.sun.star.frame;
using unoidl.com.sun.star.lang;

This is the method you will be actually using on your assembly or expose from your class library. This method starts up Open Office executable, initialize UNO components and saves to PDF in the end.

public static void ConvertToPdf(string inputFile, string outputFile)
{
    if (ConvertExtensionToFilterType(Path.GetExtension(inputFile)) == null)
        throw new InvalidProgramException("Unknown file type for OpenOffice. File = " + inputFile);

    StartOpenOffice();

    //Get a ComponentContext
    var xLocalContext =
        Bootstrap.bootstrap();
    //Get MultiServiceFactory
    var xRemoteFactory =
        (XMultiServiceFactory)
        xLocalContext.getServiceManager();
    //Get a CompontLoader
    var aLoader =
        (XComponentLoader) xRemoteFactory.createInstance("com.sun.star.frame.Desktop");
    //Load the sourcefile

    XComponent xComponent = null;
    try
    {
        xComponent = InitDocument(aLoader,
                                PathConverter(inputFile), "_blank");
        //Wait for loading
        while (xComponent == null)
        {
            Thread.Sleep(1000);
        }

        // save/export the document
        SaveDocument(xComponent, inputFile, PathConverter(outputFile));
    }
    finally
    {
        if (xComponent != null) xComponent.dispose();
    }
}

Starts executable instance of soffice.exe where your application will be communicating with this using CLI DLLs referenced.

private static void StartOpenOffice()
{
    var ps = Process.GetProcessesByName("soffice.exe");
    if (ps.Length != 0)
        throw new InvalidProgramException("OpenOffice not found.  Is OpenOffice installed?");
    if (ps.Length > 0)
        return;
    var p = new Process
                {
                    StartInfo =
                        {
                            Arguments = "-headless -nofirststartwizard",
                            FileName = "soffice.exe",
                            CreateNoWindow = true
                        }
                };
    var result = p.Start();

    if (result == false)
        throw new InvalidProgramException("OpenOffice failed to start.");
}

This initializes the document instance and load the source file.

private static XComponent InitDocument(XComponentLoader aLoader, string file, string target)
{
    var openProps = new PropertyValue[1];
    openProps[0] = new PropertyValue {Name = "Hidden", Value = new Any(true)};

    var xComponent = aLoader.loadComponentFromURL(
        file, target, 0,
        openProps);

    return xComponent;
}

This method saves the processed document to a destination file.

private static void SaveDocument(XComponent xComponent, string sourceFile, string destinationFile)
{
    var propertyValues = new PropertyValue[2];
    // Setting the flag for overwriting
    propertyValues[1] = new PropertyValue {Name = "Overwrite", Value = new Any(true)};
    //// Setting the filter name
    propertyValues[0] = new PropertyValue
                            {
                                Name = "FilterName",
                                Value = new Any(ConvertExtensionToFilterType(Path.GetExtension(sourceFile)))
                            };
    ((XStorable) xComponent).storeToURL(destinationFile, propertyValues);
}

Converts file path to OpenOffice API readable format.

private static string PathConverter(string file)
{
    if (string.IsNullOrEmpty(file))
        throw new NullReferenceException("Null or empty path passed to OpenOffice");

    return String.Format("file:///{0}", file.Replace(@"\", "/"));
}

This methods returns the filter type required for conversion based on extension.

public static string ConvertExtensionToFilterType(string extension)
{
    switch (extension)
    {
        case ".doc":
        case ".docx":
        case ".txt":
        case ".rtf":
        case ".html":
        case ".htm":
        case ".xml":
        case ".odt":
        case ".wps":
        case ".wpd":
            return "writer_pdf_Export";
        case ".xls":
        case ".xlsb":
        case ".xlsx":
        case ".ods":
            return "calc_pdf_Export";
        case ".ppt":
        case ".pptx":
        case ".odp":
            return "impress_pdf_Export";

        default:
            return null;
    }
}

Well, that’s it! Just invoke the method ConvertToPdf with input and output file name parameter.

Abdul Munim March 30, 2011 c# open-office

Inside Object Oriented JavaScript Explained (Part – 1)

In most of our programming languages like C#, C++, Java etc, we use this keyword to denote current object we are working on. We often make mistakes in JavaScript object-orientation using this keyword.

In many object-oriented JavaScript tutorials and classes we are taught to use JavaScript in this OOP way, for instance:

function A (val) {
    /**
    * private field
    */
    var privateProp = "this is a private property";

    /**
    * public field
    */
    this.valueProperty = "This is a value";

    /**
    * initialization and constructor part
    */
    this.valueProperty = val;
    privateProp = "updated value with: " + val;

    /**
    * public property
    */
    this.getPrivateProperty = function () {
        return privateProp;
    }
}
var a1Obj = new A("this is a new value");
alert(a1Obj.getPrivateProperty());

So, how does the code above work?

In JavaScript, when we write

var myFuncObj = new anyFunction();

what it does internally, look something similar to this:

var myFuncObj = (function() {
    var returnObj = new Object();
    //somehow set returnObj to the next caller anyFunction
    //where returnObj will be accessible by this
    var retValue = anyFunction();
    if (retValue) {
        returnObj = retValue;
    }
    return returnObj;
})();

Explanation: This is somewhat a flavor of object instantiation in JavaScript. When we use new keyword to a function, JavaScript engine actually wraps the statement which looks like the above code. Firstly, a new JavaScript generic object is created, and then JavaScript engine calls the function you are intending to instantiate and set this = the new JavaScript generic object which makes the object accessible within the function we have given as this keyword. If our given function returns anything then it sets our instantiated variable with the returned value otherwise sets the generic object created.

In our function or class whatever you call, we are actually setting up properties inside the object supplied by the JavaScript engine, which is actually working as the constructor of the object supplied.

Alright, now you are thinking that how are those private fields are working where public methods have access to those private methods? It’s the JavaScript closures! Although I will try to write some short note about closures, but I highly recommend you to read Jim Ley’s article on Closures.

Closures

In simple words, keeping variables alive in a function even if the function returned! For instance, if you have a nested-function which has access to the outer function’s variables, JavaScript garbage collector keeps the variables alive until the nested-functions are collected by the GC.

Let me give you an example with a code

function showHello() {
    var value = 321;
    function showAlert() {
        value += 123;
        alert(value);
    };
    //the function above is not called here
    //only the reference to the function
    //is returned
    return showAlert;
}
//invokes the showAlert function which is
//returned by the calling function
var showAlertReference = showHello();
//this will alert the code of showAlert function. Try it out!
alert(showAlertReference);
showAlertReference();

Explanation: So, we have a function showHello and a private variable and a nested-function where this function is a closure as it has reference to the outer scope’s members, and then we return the reference of the nested-function showAlert and preserving it in showAlertReference variable, and then we invoke showAlertReference which is only a reference to the nested-function showAlert having only the alert code in it. This is where closure comes to play, JavaScript engine keeps the variables alive that are used in the nested-function make them available for access. These variables kept alive are sent to the garbage collector along with showAlert function is sent!

Hope this helped! But you should consider seriously about the Jim Ley’s article I noted above!

Back to private members

Private members are nothing but closures here! Look at the code below!

function person() {
    var _name;

    this.getName = function() {
        //_name is available here! this is closure!
        return _name;
    }
    this.setName = function(value) {
        //the same instance of _name is also available here!
        _name = value;
    }
}

var p = new person();
alert(p.getName());
p.setName("Munim");
alert(p.getName());

We are actually simulating the private/public members/methods by using closures! Again, when new person() is called JavaScript engine wraps the code and instantiate generic JavaScript object and set the object which is accessible with this keyword inside the person function! As _name is not being set on the object passed as this and thus _name is just acting like a local variable of the function person and the methods (getName and setName) inside the person function are set to the object passed as this and due to closure, invoking the methods (getName and setName) can have access to _name variable which should have died when person function expired!

Conclusion

Object orientation in JavaScript is not usual like any other languages and mainly OO is implemented in JavaScript in sort of functional programming way! Understanding and thinking in closures is highly recommended for OO design principles in JavaScript.

Abdul Munim September 5, 2010 oop javascript

Setup XDebug on XAMPP and Eclipse

Before proceeding with the setup, it’s recommended to note down few things which may save couple of hours.

  • Your PHP version. Create a PHP file with <?php phpinfo(); ?> to get the version
  • Your Windows version (32-bit or 64-bit). Check System Properties.

Now go to XDebug Download page and you will see Windows binaries with having some version information.

XDebug has different build for different version for PHP and OS, and the DLL file name is formatted like:

php_xdebug-{xdebug-version}–{php-version}–{build-env}–{thread-safe/non-thread-safe}.dll
  • xdebug-version: Generally, you should use the latest build provided by XDebug. Otherwise, use your required version.
  • php-version: This section has the PHP version like 5.2 or 5.3 etc
  • build-env: Generally you should consider using VC6 compiled version of XDebug but if you are using VC9 compiled PHP then you should consider using VC9 compiled version of XDebug.
  • thread-safe/non-thread-safe: By default thread-safe builds does not have this marker where “nts” added for non-thread-safe. Use this if you have Zend Optimizer disabled. Generally you should use thread-safe.

Download the XDebug DLL and put it on your {xampp-folder}\phpext folder and rename the file to php_xdebug.dll. [NOTE: {xampp-folder} is just an alias. Consider it as where you installed XAMPP. Something like C:\XAMPP, if you’re using Windows]

Now you have to open php.ini located in your {xampp-folder}\php folder. Now search for [XDebug] section, if you find anything then just put the lines, otherwise add [XDebug] at the end of the file and the lines below:

xdebug.remote_enable=1
xdebug.remote_host="localhost"
xdebug.remote_port=9000
xdebug.remote_handler="dbgp"

Hold on, you are not done yet. One more line to make PHP know XDebug DLL.

If you are using less than PHP v5.3 then add

zend_extension_ts={xampp-folder}phpextphp_xdebug.dll

If you are using PHP v5.3 and above then add

zend_extension={xampp-folder}phpextphp_xdebug.dll

If you are using any custom build of PHP with —enable-debug then add

zend_extension_debug={xampp-folder}phpextphp_xdebug.dll

OK! You are almost done. Search for [Zend] section now in PHP.ini, if you find zend_extension_ts, then comment the line using ;. And that’s it!

Refresh you phpinfo page and search for XDebug and you will see setting of XDebug and you have successfully configured XDebug with PHP.

Considering you are using Eclipse v3.3 or above, which has built-in XDebug support. All you have to do is, go to “Debug Configuration” as set start URL, click “Apply” and close the window. Now click “Debug” button from toolbar to start debugging.

Points to note:

  • There’s a XDebug bug which makes Apache server to crash while debugging with Eclipse. The crash occurs when invalid expressions (watch items) are there. Delete all expression items when terminating or launching your debug session.
  • Always start your debug session with a PHP file, otherwise you may end up hanging on the message “Waiting for XDebug session”.

HAPPY LIFE :)

Abdul Munim August 8, 2010 eclipse debug