Building a simple ATOM crawler with Atom Nuke, Netbeans 7.2 and Java

Post to Twitter

Atom Nuke is a new project created by John Hopper who incidentally is also the creator of Atom Hopper and Repose. Atom Nuke, or “Nuke” as we call it is a collection of utilities built on a simple and fast ATOM implementation that aims for a minimal dependency footprint. One of the many interesting things about Atom Nuke is that is can be used by Python and Java and developers and plans are in the works to support JavaScript as well. Today I’m going to show off just one of the features: Crawling an ATOM feed.


Note: This code sample was written to the 0.9.2 version of Nuke.

Nuke has a default built-in HTTPClient that makes it easy to build a client to poll an ATOM feed or multiple ATOM feeds. The first thing you will want to do is setup Atom Hopper and populate it with some sample feed data. You can find out how to setup Atom Hopper from the wiki or following one of my blog articles. For this demo I’m going to setup Atom Hopper with two feeds, my atom-server.cfg.xml file looks like:

<?xml version="1.0" encoding="UTF-8"?>

<atom-hopper-config xmlns="http://atomhopper.org/atom/hopper-config/v1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://atomhopper.org/atom/hopper-config/v1.0 ./../../config/atom-hopper-config.xsd">
    <defaults>
        <author name="Atom Hopper" />
    </defaults>

    <host domain="localhost:8080" scheme="http" />

    <workspace title="Testing Namespace" resource="/sample/">
        <categories-descriptor reference="workspace-categories-descriptor" />

        <feed title="Testing Feed" resource="/feed1">
            <publisher reference="hibernate-feed-publisher" />
            <feed-source reference="hibernate-feed-source" />
        </feed>
    </workspace>
    
    <workspace title="Testing Namespace" resource="/sample/">
        <categories-descriptor reference="workspace-categories-descriptor" />

        <feed title="Testing Feed" resource="/feed2">
            <publisher reference="hibernate-feed-publisher" />
            <feed-source reference="hibernate-feed-source" />
        </feed>
    </workspace>    
</atom-hopper-config>

With Atom Hopper running put a few ATOM entries into both feeds. We will come back to Atom Hopper in a moment.

Start up NetBeans 7.2 and create a new Maven Java Application called: SimpleCrawler

Add Nuke to the dependencies, you can do this by adding the public Rackspace repository to NetBeans.

<dependency>
  <groupId>org.atomnuke</groupId>
  <artifactId>nuke</artifactId>
  <version>0.9.2-SNAPSHOT</version>
</dependency>
</dependencies>

The first thing is to create a SimpleListener class which implements the AtomListener interface. To keep things simple I’m going to listen for any new feed pages and then cycle through the entries that are contained. My SimpleListener.java file looks like this:

package com.giantflyingsaucer.simplecrawler.listener;

import org.atomnuke.atom.model.Category;
import org.atomnuke.atom.model.Entry;
import org.atomnuke.atom.model.Feed;
import org.atomnuke.listener.AtomListener;
import org.atomnuke.listener.AtomListenerException;
import org.atomnuke.listener.AtomListenerResult;
import org.atomnuke.listener.ListenerResult;
import org.atomnuke.task.context.TaskContext;
import org.atomnuke.task.lifecycle.DestructionException;
import org.atomnuke.task.lifecycle.InitializationException;

public class SimpleListener implements AtomListener {

    public ListenerResult entry(Entry entry) throws AtomListenerException {
        return AtomListenerResult.ok();
    }

    public ListenerResult feedPage(Feed page) throws AtomListenerException {
        System.out.println(page.entries().size() + " entries in this feed page");
        
        for(Entry entry : page.entries()) {
            System.out.println("-----> Incoming Entry <-----");
            System.out.println("Title: " + entry.title().toString());
            System.out.println("Content: " + entry.content().toString());
            for(Category category : entry.categories()) {
                if(category.term() != null) {
                    System.out.println("Category Term: " + category.term());
                }
                if(category.label() != null) {
                    System.out.println("Category Label: " + category.label());
                }
            }
            System.out.println("----------------------------");            
        }
        
        return AtomListenerResult.ok();
    }

    public void init(TaskContext tc) throws InitializationException {
    }

    public void destroy(TaskContext tc) throws DestructionException {
    }
}

We now have the listener. Next I’ll modify the App.java file to actually start up the NukeKernel and a Task to follow a local Atom Hopper feed. The task will poll the Atom Hopper feed every 5 seconds for new updates, if none are found then it waits another 5 seconds. The polling time is completely configurable by the way, so this can be milliseconds, seconds, hours, etc. and if not specified it defaults to 1 minute. After 20 seconds I’ll simply close the whole process and shutdown.

package com.giantflyingsaucer.simplecrawler;

import com.giantflyingsaucer.simplecrawler.listener.SimpleListener;
import java.util.concurrent.TimeUnit;
import org.atomnuke.Nuke;
import org.atomnuke.NukeKernel;
import org.atomnuke.source.crawler.FeedCrawlerSourceFactory;
import org.atomnuke.task.Task;
import org.atomnuke.util.TimeValue;

public class App {

    public static void main(String[] args) throws Exception {
        final SimpleListener listener = new SimpleListener();

        final Nuke nukeKernel = new NukeKernel();

        final FeedCrawlerSourceFactory crawlerFactory = new FeedCrawlerSourceFactory();
        final Task task = nukeKernel.follow(crawlerFactory.newCrawlerSource("MyFeed", "http://localhost:8080/sample/feed1/"),
                new TimeValue(5, TimeUnit.SECONDS));

        task.addListener(listener);

        nukeKernel.start();
        
        Thread.sleep(20000);
        
        nukeKernel.destroy();
    }
}

Start up Atom Hopper and populate some entires into it if you haven’t already. Then run the SimpleCrawler project and examine the results. While the SimpleCrawler is running add a couple more ATOM entries to Atom Hopper.

Results:

2 entries in this feed page
-----> Incoming Entry <-----
Title: Another gold for athelete
Content: More gold medals were handed out today at the London 2012 Olympics.
Category Term: 2012 Olympics
----------------------------
-----> Incoming Entry <-----
Title: This is the title
Content: Hello World
Category Term: MyCategory 1
Category Term: MyCategory 2
Category Term: MyCategory 3
----------------------------
0 entries in this feed page
0 entries in this feed page
1 entries in this feed page
-----> Incoming Entry <-----
Title: Star Trek News
Content: The Borg are coming!  The Borg are coming!
Category Term: Delta Quadrant
----------------------------

You can see I populated two initial ATOM entries, then added a new one about 15 seconds later.

What if I need to monitor two or more ATOM feeds? Thats easy with Nuke. Simply add another Task. The modified App.java code looks like this:

package com.giantflyingsaucer.simplecrawler;

import com.giantflyingsaucer.simplecrawler.listener.SimpleListener;
import java.util.concurrent.TimeUnit;
import org.atomnuke.Nuke;
import org.atomnuke.NukeKernel;
import org.atomnuke.source.crawler.FeedCrawlerSourceFactory;
import org.atomnuke.task.Task;
import org.atomnuke.util.TimeValue;

public class App {

    public static void main(String[] args) throws Exception {
        final SimpleListener listener = new SimpleListener();

        final Nuke nukeKernel = new NukeKernel();

        final FeedCrawlerSourceFactory crawlerFactory = new FeedCrawlerSourceFactory();

        final Task task1 = nukeKernel.follow(crawlerFactory.newCrawlerSource("MyFeed1", "http://localhost:8080/sample/feed1/"),
                new TimeValue(3, TimeUnit.SECONDS));
        final Task task2 = nukeKernel.follow(crawlerFactory.newCrawlerSource("MyFeed2", "http://localhost:8080/sample/feed2/"),
                new TimeValue(8, TimeUnit.SECONDS));        

        task1.addListener(listener);
        task2.addListener(listener);

        nukeKernel.start();
        
        Thread.sleep(20000);
        
        nukeKernel.destroy();
    }
}

Results with two tasks monitoring two feeds (one polling every 3 seconds the other polling every 8 seconds):

3 entries in this feed page
-----> Incoming Entry <-----
Title: Star Trek News
Content: The Borg are coming!  The Borg are coming!
Category Term: Delta Quadrant
----------------------------
-----> Incoming Entry <-----
Title: Another gold for athelete
Content: More gold medals were handed out today at the London 2012 Olympics.
Category Term: 2012 Olympics
----------------------------
-----> Incoming Entry <-----
Title: This is the title
Content: Hello World
Category Term: MyCategory 1
Category Term: MyCategory 2
Category Term: MyCategory 3
----------------------------
0 entries in this feed page
1 entries in this feed page
-----> Incoming Entry <-----
Title: This is feed #2
Content: I was posted on feed number two.
Category Term: Misc. Info
----------------------------
0 entries in this feed page
0 entries in this feed page
0 entries in this feed page
0 entries in this feed page

Of course I could’ve created a second listener class customized to the second feed if I had wanted to, etc,

Post to Twitter

This entry was posted in Atom Hopper, AtomNuke, Java, Netbeans, Open Source. Bookmark the permalink.

One Response to Building a simple ATOM crawler with Atom Nuke, Netbeans 7.2 and Java

  1. Pingback: Create an ATOM feed with Atom Nuke, NetBeans 7.2 and Java | Giant Flying Saucer

Comments are closed.