Data Recipes: October 2011

Since it's been such a long time since I last posted I thought I'd make this one a bit longer. It really is a condensing of a lot of things I've been working with and thinking about over the past few months.

Nearest Neighbors

The nearest neighbors problem (also known as the post-office problem) is this: Given a point X in some metric space M, assign to it the nearest neighboring point S. In other words, given a residence, assign to it the nearest post office. The K-nearest neighbors problem, which this post addresses, is just a slight generalization of that problem. Instead of just one neighbor we are looking for K neighbors.

Problem

So, we're going to use the geonames data. This is a set of nearly 8 million geo points with names, coordinates, and a bunch of other good stuff, from around the world. We would like to find, for a given point in the geonames set, the 5 nearest points (also in geonames) that are nearest to it. Should be pretty simple yeah?

Get data

The geonames data set 'allCountries.zip' can be downloaded like so:


$: wget http://download.geonames.org/export/dump/allCountries.zip

Prepare data

Since the geonames data set comes as a nice tab-separated-values (.tsv) file already it's just a matter of unzipping the package and placing it on your hdfs (you do have one of those don't you?). Do:


$: unzip allCountries.zip
$: hadoop fs -put allCountries.txt .

to unzip the package and place the tsv file into your home directory on the hadoop distributed file system.

Schema

Oh, and by the way, before we forget, the data from geonames has this pig schema:


  geonameid:         int,
  name:              chararray,
  asciiname:         chararray,
  alternatenames:    chararray,
  latitude:          double,
  longitude:         double,
  feature_class:     chararray,
  feature_code:      chararray,
  country_code:      chararray,
  cc2:               chararray,
  admin1_code:       chararray,
  admin2_code:       chararray,
  admin3_code:       chararray,
  admin4_code:       chararray,
  population:        long,
  elevation:         int,
  gtopo30:           int,
  timezone:          chararray,
  modification_date: chararray

The Algorithm

Now that we have the data we can start to play with it and think about how to solve the problem at hand. Looking at the data (use something like 'head', 'cat', 'cut', etc) we see that there are really only three fields of interest in the data: (geonameid, longitude, and latitude). All the other fields are just nice metadata which we can attach later.

Now, since we're going to be using Apache Pig to solve this problem we need to think a little bit about parallelism. One constraint is that at no time is any one point going to have access to the locations of all the other points. In other words, we will not be storing the full set of points in memory. Besides, it's 8 million points, that's kind of a lot for my poor little machine to handle.

So it's clear (right?) that we're going to have to partition the space in some way. Then, within a partition of the space, we'll need to apply a local version of the nearest neighbors algorithm. That's it really. Map and reduce. Wait, but there's one problem. What happens if we don't find all 5 neighbors for a point in a single partition? Hmmm. Well, the answer is iteration. We'll choose a small partition size to begin with and gradually increase the partition size until either the partition size is too large or all the neighbors have been found. Got it?

Recap:

(1) Partition the space

(2) Search for nearest neighbors in a single partition

(3) If all neighbors have been found, terminate; else increase partition size and repeat (1) and (2)

Implementation

For partitioning the space we're going to use Google quadkeys (http://msdn.microsoft.com/en-us/library/bb259689.aspx) since it's super easy to implement and it partitions the space nicely. This will be a java UDF for Pig that takes a (longitude, latitude, and zoom level) tuple and returns a string quadkey (the partition id).

Here's the actual java code for that. Let's call it "GetQuadkey":


package sounder.pig.geo;

import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

/**
   See: http://msdn.microsoft.com/en-us/library/bb259689.aspx

   A Pig UDF to compute the quadkey string for a given
   (longitude, latitude, resolution) tuple.
   
 */
public class GetQuadkey extends EvalFunc< String> {
    private static final int TILE_SIZE = 256;

    public String exec(Tuple input) throws IOException {
        if (input == null || input.size() < 3 || input.isNull(0) || input.isNull(1) || input.isNull(2))
            return null;

        Double longitude = (Double)input.get(0);
        Double latitude = (Double)input.get(1);
        Integer resolution = (Integer)input.get(2);

        String quadKey = quadKey(longitude, latitude, resolution);
        return quadKey;        
    }

    private static String quadKey(double longitude, double latitude, int resolution) {
        int[] pixels = pointToPixels(longitude, latitude, resolution);
        int[] tiles = pixelsToTiles(pixels[0], pixels[1]);
        return tilesToQuadKey(tiles[0], tiles[1], resolution);
    }

    /**
       Return the pixel X and Y coordinates for the given lat, lng, and resolution.
     */
    private static int[] pointToPixels(double longitude, double latitude, int resolution) {
        double x = (longitude + 180) / 360;
        double sinLatitude = Math.sin(latitude * Math.PI / 180);
        double y = 0.5 - Math.log((1 + sinLatitude) / (1 - sinLatitude)) / (4 * Math.PI);

        int mapSize = mapSize(resolution);
        int[] pixels = {(int) trim(x * mapSize + 0.5, 0, mapSize - 1), (int) trim(y * mapSize + 0.5, 0, mapSize - 1)};
        return pixels;
    }

    /**
       Convert from pixel coordinates to tile coordinates.
     */
    private static int[] pixelsToTiles(int pixelX, int pixelY) {
        int[] tiles = {pixelX / TILE_SIZE, pixelY / TILE_SIZE};
        return tiles;
    }
    
    /**
       Finally, given tile coordinates and a resolution, returns the appropriate quadkey
     */
    private static String tilesToQuadKey(int tileX, int tileY, int resolution) {
        StringBuilder quadKey = new StringBuilder();
        for (int i = resolution; i > 0; i--) {
            char digit = '0';
            int mask = 1 << (i - 1);
            if ((tileX & mask) != 0) {
                digit++;
            }
            if ((tileY & mask) != 0) {
                digit++;
                digit++;
            }
            quadKey.append(digit);
        }
        return quadKey.toString();
    }
    
    /**
       Ensure input value is within minval and maxval
     */
    private static double trim(double n, double minVal, double maxVal) {
        return Math.min(Math.max(n, minVal), maxVal);
    }

    /**
       Width of the map, in pixels, at the given resolution
     */
    public static int mapSize(int resolution) {
        return TILE_SIZE << resolution;
    }
}

Next we need to have another java udf that operates on all the points in a partition. Let's call it "NearestNeighbors". Here's a naive implementation of that:


package sounder.pig.geo.nearestneighbors;

import java.io.IOException;
import java.util.PriorityQueue;
import java.util.Iterator;
import java.util.Comparator;

import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.data.BagFactory;
import org.apache.pig.data.DataBag;

public class NearestNeighbors extends EvalFunc< DataBag> {
    private static TupleFactory tupleFactory = TupleFactory.getInstance();
    private static BagFactory bagFactory = BagFactory.getInstance();
    
    public DataBag exec(Tuple input) throws IOException {
        if (input == null || input.size() < 2 || input.isNull(0) || input.isNull(1))
            return null;

        Long k = (Long)input.get(0);
        DataBag points = (DataBag)input.get(1);      // {(id,lng,lat,{(n1,n1_dist)...})}
        DataBag result = bagFactory.newDefaultBag(); 

        for (Tuple pointA : points) {
            DataBag neighborsBag = (DataBag)pointA.get(3);
            if (neighborsBag.size() < k) {
                PriorityQueue< Tuple> neighbors = toDistanceSortedQueue(k.intValue(), neighborsBag);
                Double x1 = Math.toRadians((Double)pointA.get(1));
                Double y1 = Math.toRadians((Double)pointA.get(2));
                
                for (Tuple pointB : points) {
                    if (pointA!=pointB) {
                        Double x2 = Math.toRadians((Double)pointB.get(1));
                        Double y2 = Math.toRadians((Double)pointB.get(2));
                        Double distance = haversineDistance(x1,y1,x2,y2);

                        // Add this point as a neighbor if pointA has no neighbors
                        if (neighbors.size()==0) {
                            Tuple newNeighbor = tupleFactory.newTuple(2);
                            newNeighbor.set(0, pointB.get(0));
                            newNeighbor.set(1, distance);
                            neighbors.add(newNeighbor);
                        }

                        Tuple furthestNeighbor = neighbors.peek();
                        Double neighborDist = (Double)furthestNeighbor.get(1);
                        if (distance < neighborDist) {
                            Tuple newNeighbor = tupleFactory.newTuple(2);
                            newNeighbor.set(0, pointB.get(0));
                            newNeighbor.set(1, distance);
                            
                            if (neighbors.size() < k) {                                
                                neighbors.add(newNeighbor);
                            } else {
                                neighbors.poll(); // remove farthest
                                neighbors.add(newNeighbor);
                            }
                        }
                    }
                }
                // Should now have a priorityqueue containing a sorted list of neighbors
                // create new result tuple and add to result bag
                Tuple newPointA = tupleFactory.newTuple(4);
                newPointA.set(0, pointA.get(0));
                newPointA.set(1, pointA.get(1));
                newPointA.set(2, pointA.get(2));
                newPointA.set(3, fromQueue(neighbors));
                result.add(newPointA);
            } else {
                result.add(pointA);
            }                    
        }
        return result;
    }

    // Ensure sorted by descending
    private PriorityQueue< Tuple> toDistanceSortedQueue(int k, DataBag bag) {
        PriorityQueue< Tuple> q = new PriorityQueue< Tuple>(k,
                                                          new Comparator< Tuple>() {
                                                              public int compare(Tuple t1, Tuple t2) {
                                                                  try {
                                                                      Double dist1 = (Double)t1.get(1);
                                                                      Double dist2 = (Double)t2.get(1);
                                                                      return dist2.compareTo(dist1);
                                                                  } catch (ExecException e) {
                                                                      throw new RuntimeException("Error comparing tuples", e);
                                                                  }
                                                              };
                                                          });
        for (Tuple tuple : bag) q.add(tuple);            
        return q;
    }

    private DataBag fromQueue(PriorityQueue< Tuple> q) {
        DataBag bag = bagFactory.newDefaultBag();
        for (Tuple tuple : q) bag.add(tuple);
        return bag;
    }
    
    private Double haversineDistance(Double x1, Double y1, Double x2, Double y2) {
        double a = Math.pow(Math.sin((x2-x1)/2), 2)
            + Math.cos(x1) * Math.cos(x2) * Math.pow(Math.sin((y2-y1)/2), 2);

        return (2 * Math.asin(Math.min(1, Math.sqrt(a))));
    }
}

The details of the NearestNeighbors UDF aren't super important and it's mostly pretty clear what's going on. Just know that it operates on a bag of points as input and returns a bag of points as output that has the same schema. This is really important since we're going to be iterating.

Then we're on to the Pig part, hurray! Since Pig doesn't have any built in support for iteration, I chose to use Jruby (because it's awesome) and pig's "PigServer" java class to do all the work. Here's what the jruby runner looks like (it's kind of a lot so don't get scared):


#!/usr/bin/env jruby

require 'java'

#
# You might consider changing this to point to where you have
# pig installed...
#
jar  = "/usr/lib/pig/pig-0.8.1-cdh3u1-core.jar"
conf = "/etc/hadoop/conf"

$CLASSPATH << conf
require jar

import org.apache.pig.ExecType
import org.apache.pig.PigServer
import org.apache.pig.FuncSpec

class NearestNeighbors

  attr_accessor :points, :k, :min_zl, :runmode

  #
  # Create a new nearest neighbors instance
  # for the given points, k neighbors to find,
  # a optional minimum zl (1-21) and optional
  # hadoop run mode (local or mapreduce)
  #
  def initialize points, k, min_zl=20, runmode='mapreduce'
    @points  = points
    @k       = k
    @min_zl  = min_zl
    @runmode = runmode
  end

  #
  # Run the nearest neighbors algorithm
  #
  def run
    start_pig_server
    register_jars_and_functions
    run_algorithm
    stop_pig_server
  end

  #
  # Actually runs all the pig queries for
  # the algorithm. Stops if all neighbors
  # have been found or if min_zl is reached
  #
  def run_algorithm
    start_nearest_neighbors(points, k, 22)
    if run_nearest_neighbors(k, 22)
      21.downto(min_zl) do |zl|
        iterate_nearest_neighbors(k, zl)
        break unless run_nearest_neighbors(k,zl)
      end
    end
  end

  #
  # Registers algorithm initialization queries
  #
  def start_nearest_neighbors(input, k, zl)
    @pig.register_query(PigQueries.load_points(input))
    @pig.register_query(PigQueries.generate_initial_quadkeys(zl))
  end

  #
  # Registers algorithm iteration queries
  #
  def iterate_nearest_neighbors k, zl
    @pig.register_query(PigQueries.load_prior_done(zl))
    @pig.register_query(PigQueries.load_prior_not_done(zl))
    @pig.register_query(PigQueries.union_priors(zl))
    @pig.register_query(PigQueries.trim_quadkey(zl))
  end

  #
  # Runs one iteration of the algorithm
  #
  def run_nearest_neighbors(k, zl)
    @pig.register_query(PigQueries.group_by_quadkey(zl))
    @pig.register_query(PigQueries.nearest_neighbors(k, zl))
    @pig.register_query(PigQueries.split_results(k, zl))

    if !@pig.exists_file("done#{zl}")
      @pig.store("done#{zl}", "done#{zl}")
      not_done = @pig.store("not_done#{zl}", "not_done#{zl}")
      not_done.get_results.has_next?
    else
      true
    end
  end

  #
  # Start a new pig server with the specified run mode
  #
  def start_pig_server
    @pig = PigServer.new(runmode)
  end

  #
  # Stop the running pig server
  #
  def stop_pig_server
    @pig.shutdown
  end

  #
  # Register the jar that contains the nearest neighbors
  # and quadkeys udfs and define functions for them.
  #
  def register_jars_and_functions
    @pig.register_jar('../../../../udf/target/sounder-1.0-SNAPSHOT.jar')
    @pig.register_function('GetQuadkey',       FuncSpec.new('sounder.pig.geo.GetQuadkey()'))
    @pig.register_function('NearestNeighbors', FuncSpec.new('sounder.pig.geo.nearestneighbors.NearestNeighbors()'))
  end

  #
  # A simple class to contain the pig queries
  #
  class PigQueries

    #
    # Load the geonames points. Obviously,
    # this should be modified to accept a
    # variable schema.
    #
    def self.load_points geonames
      "points = LOAD '#{geonames}' AS (
     geonameid:         int,
     name:              chararray,
     asciiname:         chararray,
     alternatenames:    chararray,
     latitude:          double,
     longitude:         double,
     feature_class:     chararray,
     feature_code:      chararray,
     country_code:      chararray,
     cc2:               chararray,
     admin1_code:       chararray,
     admin2_code:       chararray,
     admin3_code:       chararray,
     admin4_code:       chararray,
     population:        long,
     elevation:         int,
     gtopo30:           int,
     timezone:          chararray,
     modification_date: chararray
   );"
    end

    #
    # Query to generate quadkeys at the specified zoom level
    #
    def self.generate_initial_quadkeys(zl)
      "projected#{zl} = FOREACH points GENERATE GetQuadkey(longitude, latitude, #{zl}) AS quadkey, geonameid, longitude, latitude, {};"
    end

    #
    # Load previous iteration's done points
    #
    def self.load_prior_done(zl)
      "prior_done#{zl+1} = LOAD 'done#{zl+1}/part*' AS (
     quadkey:   chararray,
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );"
    end

    #
    # Load previous iteration's not done points
    #
    def self.load_prior_not_done(zl)
      "prior_not_done#{zl+1} = LOAD 'not_done#{zl+1}/part*' AS (
     quadkey:   chararray,
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );"
    end

    #
    # Union the previous iterations points that are done
    # with the points that are not done
    #
    def self.union_priors zl
      "prior_neighbors#{zl+1} = UNION prior_done#{zl+1}, prior_not_done#{zl+1};"
    end

    #
    # Chop off one character of precision from the existing
    # quadkey to go one zl down.
    #
    def self.trim_quadkey zl
      "projected#{zl} = FOREACH prior_neighbors#{zl+1} GENERATE
     SUBSTRING(quadkey, 0, #{zl}) AS quadkey,
     geonameid AS geonameid,
     longitude AS longitude,
     latitude  AS latitude,
     neighbors AS neighbors;"
    end

    #
    # Group the points by quadkey
    #
    def self.group_by_quadkey zl
      "grouped#{zl}  = FOREACH (GROUP projected#{zl} BY quadkey) GENERATE group AS quadkey, projected#{zl}.(geonameid, longitude, latitude, $4) AS points_bag;"
    end

    #
    # Run the nearest neighbors udf on all the points for
    # a given quadkey
    #
    def self.nearest_neighbors(k, zl)
      "nearest_neighbors#{zl} = FOREACH grouped#{zl} GENERATE quadkey AS quadkey, FLATTEN(NearestNeighbors(#{k}l, points_bag)) AS (
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );"
    end

    #
    # Split the results into done and not_done relations
    # The algorithm is done when 'not_done' contains
    # no more tuples.
    #
    def self.split_results(k, zl)
      "SPLIT nearest_neighbors#{zl} INTO done#{zl} IF COUNT(neighbors) >= #{k}l, not_done#{zl} IF COUNT(neighbors) < #{k}l;"
    end
  end

end

NearestNeighbors.new(ARGV[0], ARGV[1]).run

Call this file "nearest_neighbors.rb". The idea here is that we register some basic pig queries to do the initialization of the algorithm and the iterations. These queries are run over and over until either the "not_done" relation contains no more elements or the minimum zoom level has been reached. Note that a small zoom level means a big partition of space.

Run it!

I think we're finally ready to run it. Let K=5 and the min zoom level (zl) be 10. Then just run:


$: ./nearest_neighbors.rb allCountries.txt 5 10

To kick it off. The output will live in 'done10' (in your home directory on the hdfs) and all the ones that couldn't find their neighbors (poor guys) are left in 'not_done10'. Let's take a look:


$: hadoop fs -cat done10/part* | head 
  22321231 5874132 -164.73333  67.83333 {(5870713,0.0020072489956274512),(5864687,0.001833343068439346),(5879702,0.0017344751302650937),(5879702,0.0017344751302650937),(5866444,9.775849082653818E-4)}
 133223322 2631726 -17.98263   66.52688 {(2631999,8.959503690090641E-4),(2629528,8.491922360779314E-4),(2629528,8.491922360779314E-4),(2630727,3.840018669838177E-4),(2630727,3.840018669838177E-4)}
 133223322 2631999 -18.01884   66.56514 {(2631726,8.959503690090641E-4),(2630727,6.687797018366464E-4),(2629528,4.889879974917344E-5),(2630727,6.687797018366464E-4),(2629528,4.889879974917344E-5)}
 200103201 5874186 -165.39611  64.74333 {(5864454,0.001422335354026523),(5864454,0.001422335354026523),(5867287,0.0013175743301195593),(5867186,0.0010114669397588846),(5867287,0.0013175743301195593)}
 200103201 5878614 -165.3      64.76667 {(5874186,0.0017231123142588567),(5867287,0.0012670407374788086),(5864454,0.0012205595078534047),(5867287,0.0012670407374788086),(5864454,0.0012205595078534047)}
 200103203 5875461 -165.55111  64.53889 {(5865814,0.0028283599772347947),(5874676,0.0025819291222640857),(5876108,0.001901914079309611),(5869354,0.0016504815389672197),(5869180,0.0025319553109125676)}
 200103300 5861248 -164.27639  64.69278 {(5880635,9.627541402483858E-4),(5878642,8.535957131129946E-4),(5878642,8.535957131129946E-4),(5876626,6.598180173900259E-4),(5876626,6.598180173900259E-4)}
 200103300 5876626 -164.27111  64.65389 {(5880635,8.246219806226404E-4),(5861248,6.598180173900259E-4),(5878642,3.7418928038080964E-4),(5861248,6.598180173900259E-4),(5878642,3.7418928038080964E-4)}
 200233011 5870290 -170.3      57.21667 {(5867100,0.00113848360324883),(7117548,0.0011082333731440464),(7117548,0.0011082333731440464),(5865746,0.001071745830095263),(5865746,0.001071745830095263)}
 200233123 5873595 -169.48056  56.57778 {(7275749,0.0010608526185899635),(5878477,5.242632532229457E-4),(5875162,3.39969838478673E-4),(5878477,5.242632532229457E-4),(5875162,3.39969838478673E-4)}

Hurray!

Pig Code

Let's just take a look at the pig code that actually got executed.


points = LOAD 'allCountries.txt' AS (
     geonameid:         int,
     name:              chararray,
     asciiname:         chararray,
     alternatenames:    chararray,
     latitude:          double,
     longitude:         double,
     feature_class:     chararray,
     feature_code:      chararray,
     country_code:      chararray,
     cc2:               chararray,
     admin1_code:       chararray,
     admin2_code:       chararray,
     admin3_code:       chararray,
     admin4_code:       chararray,
     population:        long,
     elevation:         int,
     gtopo30:           int,
     timezone:          chararray,
     modification_date: chararray
   );
projected22 = FOREACH points GENERATE GetQuadkey(longitude, latitude, 22) AS quadkey, geonameid, longitude, latitude, {};
grouped22  = FOREACH (GROUP projected22 BY quadkey) GENERATE group AS quadkey, projected22.(geonameid, longitude, latitude, $4) AS points_bag;
nearest_neighbors22 = FOREACH grouped22 GENERATE quadkey AS quadkey, FLATTEN(NearestNeighbors(5l, points_bag)) AS (
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
SPLIT nearest_neighbors22 INTO done22 IF COUNT(neighbors) >= 5l, not_done22 IF COUNT(neighbors) < 5l;
prior_done22 = LOAD 'done22/part*' AS (
     quadkey:   chararray,
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
prior_not_done22 = LOAD 'not_done22/part*' AS (
     quadkey:   chararray,
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
prior_neighbors22 = UNION prior_done22, prior_not_done22;
projected21 = FOREACH prior_neighbors22 GENERATE
     SUBSTRING(quadkey, 0, 21) AS quadkey,
     geonameid AS geonameid,
     longitude AS longitude,
     latitude  AS latitude,
     neighbors AS neighbors;
grouped21  = FOREACH (GROUP projected21 BY quadkey) GENERATE group AS quadkey, projected21.(geonameid, longitude, latitude, $4) AS points_bag;
nearest_neighbors21 = FOREACH grouped21 GENERATE quadkey AS quadkey, FLATTEN(NearestNeighbors(5l, points_bag)) AS (
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
SPLIT nearest_neighbors21 INTO done21 IF COUNT(neighbors) >= 5l, not_done21 IF COUNT(neighbors) < 5l;
prior_done21 = LOAD 'done21/part*' AS (
     quadkey:   chararray,
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
prior_not_done21 = LOAD 'not_done21/part*' AS (
     quadkey:   chararray,
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
prior_neighbors21 = UNION prior_done21, prior_not_done21;
projected20 = FOREACH prior_neighbors21 GENERATE
     SUBSTRING(quadkey, 0, 20) AS quadkey,
     geonameid AS geonameid,
     longitude AS longitude,
     latitude  AS latitude,
     neighbors AS neighbors;
grouped20  = FOREACH (GROUP projected20 BY quadkey) GENERATE group AS quadkey, projected20.(geonameid, longitude, latitude, $4) AS points_bag;
nearest_neighbors20 = FOREACH grouped20 GENERATE quadkey AS quadkey, FLATTEN(NearestNeighbors(5l, points_bag)) AS (
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
SPLIT nearest_neighbors20 INTO done20 IF COUNT(neighbors) >= 5l, not_done20 IF COUNT(neighbors) < 5l;
prior_done20 = LOAD 'done20/part*' AS (
     quadkey:   chararray,
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
prior_not_done20 = LOAD 'not_done20/part*' AS (
     quadkey:   chararray,
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
prior_neighbors20 = UNION prior_done20, prior_not_done20;
projected19 = FOREACH prior_neighbors20 GENERATE
     SUBSTRING(quadkey, 0, 19) AS quadkey,
     geonameid AS geonameid,
     longitude AS longitude,
     latitude  AS latitude,
     neighbors AS neighbors;
grouped19  = FOREACH (GROUP projected19 BY quadkey) GENERATE group AS quadkey, projected19.(geonameid, longitude, latitude, $4) AS points_bag;
nearest_neighbors19 = FOREACH grouped19 GENERATE quadkey AS quadkey, FLATTEN(NearestNeighbors(5l, points_bag)) AS (
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
SPLIT nearest_neighbors19 INTO done19 IF COUNT(neighbors) >= 5l, not_done19 IF COUNT(neighbors) < 5l;
prior_done19 = LOAD 'done19/part*' AS (
     quadkey:   chararray,
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
prior_not_done19 = LOAD 'not_done19/part*' AS (
     quadkey:   chararray,
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
prior_neighbors19 = UNION prior_done19, prior_not_done19;
projected18 = FOREACH prior_neighbors19 GENERATE
     SUBSTRING(quadkey, 0, 18) AS quadkey,
     geonameid AS geonameid,
     longitude AS longitude,
     latitude  AS latitude,
     neighbors AS neighbors;
grouped18  = FOREACH (GROUP projected18 BY quadkey) GENERATE group AS quadkey, projected18.(geonameid, longitude, latitude, $4) AS points_bag;
nearest_neighbors18 = FOREACH grouped18 GENERATE quadkey AS quadkey, FLATTEN(NearestNeighbors(5l, points_bag)) AS (
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
SPLIT nearest_neighbors18 INTO done18 IF COUNT(neighbors) >= 5l, not_done18 IF COUNT(neighbors) < 5l;
prior_done18 = LOAD 'done18/part*' AS (
     quadkey:   chararray,
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
prior_not_done18 = LOAD 'not_done18/part*' AS (
     quadkey:   chararray,
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
prior_neighbors18 = UNION prior_done18, prior_not_done18;
projected17 = FOREACH prior_neighbors18 GENERATE
     SUBSTRING(quadkey, 0, 17) AS quadkey,
     geonameid AS geonameid,
     longitude AS longitude,
     latitude  AS latitude,
     neighbors AS neighbors;
grouped17  = FOREACH (GROUP projected17 BY quadkey) GENERATE group AS quadkey, projected17.(geonameid, longitude, latitude, $4) AS points_bag;
nearest_neighbors17 = FOREACH grouped17 GENERATE quadkey AS quadkey, FLATTEN(NearestNeighbors(5l, points_bag)) AS (
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
SPLIT nearest_neighbors17 INTO done17 IF COUNT(neighbors) >= 5l, not_done17 IF COUNT(neighbors) < 5l;
prior_done17 = LOAD 'done17/part*' AS (
     quadkey:   chararray,
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
prior_not_done17 = LOAD 'not_done17/part*' AS (
     quadkey:   chararray,
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
prior_neighbors17 = UNION prior_done17, prior_not_done17;
projected16 = FOREACH prior_neighbors17 GENERATE
     SUBSTRING(quadkey, 0, 16) AS quadkey,
     geonameid AS geonameid,
     longitude AS longitude,
     latitude  AS latitude,
     neighbors AS neighbors;
grouped16  = FOREACH (GROUP projected16 BY quadkey) GENERATE group AS quadkey, projected16.(geonameid, longitude, latitude, $4) AS points_bag;
nearest_neighbors16 = FOREACH grouped16 GENERATE quadkey AS quadkey, FLATTEN(NearestNeighbors(5l, points_bag)) AS (
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
SPLIT nearest_neighbors16 INTO done16 IF COUNT(neighbors) >= 5l, not_done16 IF COUNT(neighbors) < 5l;
prior_done16 = LOAD 'done16/part*' AS (
     quadkey:   chararray,
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
prior_not_done16 = LOAD 'not_done16/part*' AS (
     quadkey:   chararray,
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
prior_neighbors16 = UNION prior_done16, prior_not_done16;
projected15 = FOREACH prior_neighbors16 GENERATE
     SUBSTRING(quadkey, 0, 15) AS quadkey,
     geonameid AS geonameid,
     longitude AS longitude,
     latitude  AS latitude,
     neighbors AS neighbors;
grouped15  = FOREACH (GROUP projected15 BY quadkey) GENERATE group AS quadkey, projected15.(geonameid, longitude, latitude, $4) AS points_bag;
nearest_neighbors15 = FOREACH grouped15 GENERATE quadkey AS quadkey, FLATTEN(NearestNeighbors(5l, points_bag)) AS (
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
SPLIT nearest_neighbors15 INTO done15 IF COUNT(neighbors) >= 5l, not_done15 IF COUNT(neighbors) < 5l;
prior_done15 = LOAD 'done15/part*' AS (
     quadkey:   chararray,
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
prior_not_done15 = LOAD 'not_done15/part*' AS (
     quadkey:   chararray,
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
prior_neighbors15 = UNION prior_done15, prior_not_done15;
projected14 = FOREACH prior_neighbors15 GENERATE
     SUBSTRING(quadkey, 0, 14) AS quadkey,
     geonameid AS geonameid,
     longitude AS longitude,
     latitude  AS latitude,
     neighbors AS neighbors;
grouped14  = FOREACH (GROUP projected14 BY quadkey) GENERATE group AS quadkey, projected14.(geonameid, longitude, latitude, $4) AS points_bag;
nearest_neighbors14 = FOREACH grouped14 GENERATE quadkey AS quadkey, FLATTEN(NearestNeighbors(5l, points_bag)) AS (
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
SPLIT nearest_neighbors14 INTO done14 IF COUNT(neighbors) >= 5l, not_done14 IF COUNT(neighbors) < 5l;
prior_done14 = LOAD 'done14/part*' AS (
     quadkey:   chararray,
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
prior_not_done14 = LOAD 'not_done14/part*' AS (
     quadkey:   chararray,
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
prior_neighbors14 = UNION prior_done14, prior_not_done14;
projected13 = FOREACH prior_neighbors14 GENERATE
     SUBSTRING(quadkey, 0, 13) AS quadkey,
     geonameid AS geonameid,
     longitude AS longitude,
     latitude  AS latitude,
     neighbors AS neighbors;
grouped13  = FOREACH (GROUP projected13 BY quadkey) GENERATE group AS quadkey, projected13.(geonameid, longitude, latitude, $4) AS points_bag;
nearest_neighbors13 = FOREACH grouped13 GENERATE quadkey AS quadkey, FLATTEN(NearestNeighbors(5l, points_bag)) AS (
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
SPLIT nearest_neighbors13 INTO done13 IF COUNT(neighbors) >= 5l, not_done13 IF COUNT(neighbors) < 5l;
prior_done13 = LOAD 'done13/part*' AS (
     quadkey:   chararray,
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
prior_not_done13 = LOAD 'not_done13/part*' AS (
     quadkey:   chararray,
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
prior_neighbors13 = UNION prior_done13, prior_not_done13;
projected12 = FOREACH prior_neighbors13 GENERATE
     SUBSTRING(quadkey, 0, 12) AS quadkey,
     geonameid AS geonameid,
     longitude AS longitude,
     latitude  AS latitude,
     neighbors AS neighbors;
grouped12  = FOREACH (GROUP projected12 BY quadkey) GENERATE group AS quadkey, projected12.(geonameid, longitude, latitude, $4) AS points_bag;
nearest_neighbors12 = FOREACH grouped12 GENERATE quadkey AS quadkey, FLATTEN(NearestNeighbors(5l, points_bag)) AS (
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
SPLIT nearest_neighbors12 INTO done12 IF COUNT(neighbors) >= 5l, not_done12 IF COUNT(neighbors) < 5l;
prior_done12 = LOAD 'done12/part*' AS (
     quadkey:   chararray,
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
prior_not_done12 = LOAD 'not_done12/part*' AS (
     quadkey:   chararray,
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
prior_neighbors12 = UNION prior_done12, prior_not_done12;
projected11 = FOREACH prior_neighbors12 GENERATE
     SUBSTRING(quadkey, 0, 11) AS quadkey,
     geonameid AS geonameid,
     longitude AS longitude,
     latitude  AS latitude,
     neighbors AS neighbors;
grouped11  = FOREACH (GROUP projected11 BY quadkey) GENERATE group AS quadkey, projected11.(geonameid, longitude, latitude, $4) AS points_bag;
nearest_neighbors11 = FOREACH grouped11 GENERATE quadkey AS quadkey, FLATTEN(NearestNeighbors(5l, points_bag)) AS (
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
SPLIT nearest_neighbors11 INTO done11 IF COUNT(neighbors) >= 5l, not_done11 IF COUNT(neighbors) < 5l;
prior_done11 = LOAD 'done11/part*' AS (
     quadkey:   chararray,
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
prior_not_done11 = LOAD 'not_done11/part*' AS (
     quadkey:   chararray,
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
prior_neighbors11 = UNION prior_done11, prior_not_done11;
projected10 = FOREACH prior_neighbors11 GENERATE
     SUBSTRING(quadkey, 0, 10) AS quadkey,
     geonameid AS geonameid,
     longitude AS longitude,
     latitude  AS latitude,
     neighbors AS neighbors;
grouped10  = FOREACH (GROUP projected10 BY quadkey) GENERATE group AS quadkey, projected10.(geonameid, longitude, latitude, $4) AS points_bag;
nearest_neighbors10 = FOREACH grouped10 GENERATE quadkey AS quadkey, FLATTEN(NearestNeighbors(5l, points_bag)) AS (
     geonameid: int,
     longitude: double,
     latitude:  double,
     neighbors: bag {t:tuple(neighbor_id:int, distance:double)}
   );
SPLIT nearest_neighbors10 INTO done10 IF COUNT(neighbors) >= 5l, not_done10 IF COUNT(neighbors) < 5l;

So you can see now why, with iteration, it's a good idea to generate as much code as possible.

And we're done :)

All the code for this is on github in the Sounder repo (along with tons of other Apache Pig examples).

Data Recipes

mathjax

Wednesday, October 26, 2011

Nearest Neighbors With Apache Pig and Jruby