Onionoo design document
=======================

This short document describes Onionoo's design in a mostly informal and
language-independent way.  The goal is to be able to discuss design
decisions with non-Java programmers and to provide a blueprint for porting
Onionoo to other programming languages.  This document cannot describe all
the details, but it can provide a rough overview.

There are two main building blocks of Onionoo that are described here:

  1) an hourly cronjob processing newly published Tor descriptors and

  2) a web service component answering client requests.

The interface between the two building blocks is a directory in the local
file system that can be read and written by component 1 and can be read by
component 2.  In theory, the two components can be implemented in two
entirely different programming languages.  In a possible port from Java to
another programming language, the two components can easily be ported
subsequently.

The purpose of the hourly batch processor is to read updated Tor
descriptors from the metrics service and transform them to be read by the
web service component.  Answering a client request in component 2 of
Onionoo needs to be highly efficient which is why any data aggregation
needs to happen beforehand.  Parsing descriptors on-the-fly is not an
option.

The hourly batch processor is run in a cron job at :15 every hour that
usually takes up to five minutes and that contains the following substeps:

  1.1)  Rsync new Tor descriptors from metrics.

  1.2)  Read previously stored status data about relays and bridges that
        have been running in the last seven days to memory.  These data
        include for each relay or bridge: nickname, fingerprint, primary
        OR address and port, additional OR addresses and ports, exit
        addresses, network status publication time, directory port, relay
        flags, consensus weight, country code, host name as obtained by
        reverse domain name lookup, and timestamp of last reverse domain
        name lookup.

  1.3)  Import any new relay network status consensuses that have been
        published since the last run.

  1.4)  Set the running bit for all relays that are contained in the last
        known relay network status consensus.

  1.5)  Look up all relay IP addresses in a local GeoIP database and in a
        local AS number database.  Extract country codes and names, city
        names, geo coordinates, AS name and number, etc.

  1.6)  Import any new bridge network statuses that have been published
        since the last run.

  1.7)  Start reverse domain name lookups for all relay IP addresses.  Run
        in background, only refresh lookups for previously looked up IP
        address every 12 hours, run up to five lookups in parallel, and
        set timeouts for single requests and for the general lookup
        process.  In theory, this step could happen a few steps before,
        but not before step 1.3.

  1.8)  Import any new relay server descriptors that have been published
        since the last run.

  1.9)  Import any new exit lists that have been published since the last
        run.

  1.10) Import any new bridge server descriptors that have been published
        since the last run.

  1.11) Import any new bridge pool assignments that have been published
        since the last run.

  1.12) Make sure that reverse domain name lookups are finished or the
        timeout for running lookups has expired.  This step cannot happen
        at any time later than step 1.13 and shouldn't happen long before.

  1.13) Rewrite all details files that have changed.  Details files
        combine information from all previously imported descriptory
        types, database lookups, and performed reverse domain name
        lookups.  The web service component needs to be able to retrieve a
        details file for a given relay or bridge without grabbing
        information from different data sources.  It's best to write the
        details file part for a give relay or bridge to a single file in
        the target JSON format, saved under the relay's or bridge's
        fingerprint.  If a database is used, the raw string should be
        saved for faster processing.

  1.14) Import relays' and bridges' bandwidth histories from extra-info
        descriptors that have been published since the last run.  There
        must be internally stored bandwidth histories for each relay and
        bridge, regardless of whether they have been running in the last
        seven days.  The original bandwidth histories, which are available
        on 15-minute detail, can be aggregated to longer time periods the
        farther the interval lies in the past.  The interal bandwidth
        histories are different from the bandwidth files described in 1.15
        which are written to be given out to clients.

  1.15) Rewrite bandwidth files that have changed.  Bandwidth files
        aggregate bandwidth history information on varying levels of
        detail, depending on how far observations lie in the past.  It's
        inevitable to write JSON-formatted bandwidth files for all relays
        and bridges in the hourly cronjob.  Any attempts to process years
        of bandwidth data while answering a web request can only fail.
        The previously aggregated bandwidth files are stored under the
        relay's or bridge's fingerprint for quick lookup.

  1.16) Update the summary file listing all relays and bridges that have
        been running in the last seven days which was previously read in
        step 1.2.  This is the last step in the hourly process.  The web
        service component checks the modification time of this file to
        decide whether it needs to reload its view on the network.  If
        this step was not the last step, the web service component might
        list relays or bridges for which there are no details or bandwidth
        files available yet.  (With the approach taken here, it's
        conveivable that a bandwidth file of a relay or bridge that hasn't
        been running for a week has been deleted before step 1.16.  This
        case has been found acceptable, because it's highly unlikely.  If
        a database would have been used, steps 1.2 to 1.16 would have
        happened in a single database transaction.)

The web service component has the purpose of answering client requests.
It uses previously prepared data from the hourly cronjob to respond to
requests very quickly.

During initialization, or whenever the hourly cronjob has finished, the
web service component does the following substeps:

  2.1)  Read the summary file that was produced by the hourly cronjob in
        step 1.16.

  2.2)  Keep the list of relays and bridges in memory, including all
        information that is used for filtering or sorting results.

  2.3)  Prepare summary lines for all relays and bridges.  The summary
        resource is a JSON file with a single line per relay or bridge.
        This line contains only very few fields as compared to details
        files that a client might use for further filtering results.

When responding to a request, the web service component does the following
steps:

  2.4)  Parse request and its parameters.

  2.5)  Possibly filter relays and bridges.

  2.6)  Possibly re-order and limit results.

  2.7)  Write response or error code.

Again, (and this can hardly be overstated!) steps 2.4 to 2.7 need to
happen *extremely* fast.  Any steps that go beyond file system reads or
simple database lookups need to happen either in the hourly cronjob (1.1
to 1.16) or in the web service component initialization (2.1 to 2.3).

