GO’CIRCUIT
A project by Petar Maymounkov.

Trending blogs: Real-time multi-stage map-reduce

 
This is another advanced tutorial that is best read after the Hello, world! tutorial. This one builds a simple real-time “trending blogs” application, which demonstrates an in-memory map-reduce-aggregate stream processing filter. The application consumes events from the Tumblr Firehose stream. If you'd like to use this application with another service like Twitter or Wikipedia, you would need to substitute the Tumblr Firehose driver with your own.

Quick start

The build and deploy sequence for this tutorial is the same as for all others:

Overview

The architecture of the trending blog app is illustrated below.

The Tumblr Firehose exposes a stream of events that indicate user actions taken on the Tumblr site. We are interested in two kinds of incoming events: CreatePost events that announce newly created blog posts on Tumblr—each such event delivers the ID of the new post, as well as the ID of the blog it belongs to; Like events indicate someone liked an existing post—these events carry the blog and post ID of the post being liked.

We aim to keep a live ranklist of at most 10 blogs, whose posts have received the highest number of “likes” throughout the past one minute. At the end, this top-ten ranklist is exposed to external clients via a WebSocket protocol. The intention is that a browser UI would listen to a continuously changing top-ten list and display it to the user live.

Architecture

The Tumblr Firehose, built around Apache Kafka, is designed so it can serve multiple independent HTTP clients that share credentials. It automatically partitions the stream of live events roughly evenly across all participating clients dynamically.

Our app architecture consists of three differently-purposed sets of workers. A set of “mappers” each connects to the Tumblr Firehose and consumes events greedily as they come. The mappers forward each event to one of multiple “reducer” workers, which are each responsible for maintaining the like-count of a subset of the blog ID space.

Each reducer worker maintains a top-ten list for the blogs it is responsible for. At regular intervals an “aggregator” worker sweeps through all reducers, asking them for their top-ten opinion. It then combines all shard ranklists into a single global top-ten list.

Illustration of the Hello, world! application logic.

Source comments

The source of the tutorial is found in the circuit subdiretory tutorials/trend.

Mappers, reducers and aggregators are all implemented in package trend/x. The executable command that spawns the service trend-spawn is implemented in package trend/cmd/trend-spawn. The tutorial includes a simple “auto-configuration” logic in package trend/config which demonstrates a simple technique for determining what service configuration to use based on the circuit environment that the executable is being run in.

The source for this tutorial is one of our favorite examples how one can build a non-trivial map-reduce application from the fround up—i.e. without using any auxiliary infrastructure other than the circuit itself—in under 200 lines of code.