Home » Articles » Research: Modernizing CollecTor’s ability to collect and archive Tor’s network data
Click Here To Hide Tor

Research: Modernizing CollecTor’s ability to collect and archive Tor’s network data

The Tor network protects internet users against censorship, tracking, and surveillance. Tor is composed of thousands of relay nodes, or servers, run by volunteers who are distributed across the globe to enable users to connect privately to internet services. Continuous network evaluation is pivotal in order to react to censorship events, to adapt the Tor browser to dynamic network conditions, and to verify modifications of the Tor network software.

Via collection of data about the Tor network, development of accurate simulations or emulations of the network becomes feasible. As such, this enables researchers to conduct experiments on special private Tor network testbeds rather than on the public live network where experiments may undermine the anonymity or security features of the Tor network. Via collection of data over time, visualizing data trends becomes feasible. For instance, the blockage of Tor in China can be detected via examining the data. Moreover, data collection can be utilized to identify whether or not a given circumvention technique is working in a certain country.


The CollecTor service obtains data from various relays across the public Tor network and makes the data available to everyone. CollecTor provides Tor’s network data gathered since 2004 and has been available in the form of a Java application since 2010. As CollecTor’s codebase has expanded, the technical debt has grown as new features have been added without refactoring of the original code. This renders it increasingly difficult to add new sources of data to CollecTor, as the application’s complexity grows.

To solve these problems, a recently published research paper details the requirements for a prototype of a service that can serve as a modernized replacement of CollecTor. The paper also evaluates the libraries and frameworks that are available to minimize the costs of code maintenance of CollecTor. It presents the core requirements for this novel application for data collection of the Tor network, as well as the requirements for two modules for this application: relaydesc and onionperf. The paper evaluates library frameworks that can be used for developing this application. Throughout this article, we will overview this application.

Core requirements:

1- Collect

Relaydescs (Tor relay descriptors):

Relay descriptors are published by relays and directory authorities so that users can select relay nodes for their paths across the Tor network. Figure (1) shows document references included within the documents obtained by the relaydescr module.


Figure (1): Document references included within the documents obtained by the relaydescr module

Bridgedescs (Bridge descriptors):

Bridge descriptors are published by the bridges and the bridge authority which are used by censored users to connect to the Tor network. Bridge descriptors cannot be made available, as done with relay descriptors, as this would counteract the goal of rendering bridges difficult to enumerate by censors. Consequently, bridge descriptors are sanitized via removal of all potentially identifying data and sanitized versions are published.

Bridgepools (Bridge pool assignments):

BridgeDB, the bridge distribution service, broadcasts bridge pool assignments detailing which bridges have been assigned to which distribution pools. BridgeDB receives statuses of the bridge network from the bridge authority, assigns these network bridges to persistent distribution rings, and associates them with bridge users. BridgeDB continuously dumps the list of running network bridges along with information about the rings, file buckets, and subrings to which they have been assigned to a local file. Sanitized versions of the lists including SHA-1 hashes of the bridge fingerprints, rather than the original fingerprints, are made available for statistical analysis.

Webstats (Web server logs):

Tor’s online servers, similar to most online servers, record request logs for informational and maintenance purposes. Nevertheless, oppositely to most other online servers, Tor’s online servers utilize a privacy aware log format which does not log users’ sensitive data. Moreover, opposite to most other online server logs, Tor’s server logs are neither analyzed nor archived before a number of post processing steps are performed to further minimize any privacy sensitive issues.

Exitlists (Exit lists):

TorDNSEL, the exit list service, broadcasts exit lists that include the IP addresses of relay nodes which have been identified while exiting through them.

Onionperf (Torperf’s and Onionperf’s performance data):

Torperf and Onionperf, the performance measurement services, broadcast performance data from creating simple HTTP requests across the Tor network. Torperf/OnionPerf utilizes a SOCKS client to download files across the Tor network and includes how long the substeps take.

2- Archive:

Even though it is essential for users and servers across the Tor network to strictly validate documents along with their signatures, the CollecTor service does not want to drop documents failing validation. This may be because a descriptor is utilizing a new format that is not yet understood, or perhaps because the descriptor is malformed secondary to a bug and archiving the documents will aid in debugging the issue.

The archive has to be able to validate its own integrity, confirming that descriptors have not been altered or truncated. Also, it should be possible to identify the number of missing descriptors either via timestamps where a descriptor/status have been made available or via a descriptor being referred to by another descriptor and signaled whether or not the number of missing descriptors is greater than a predefined threshold.

Archiving encrypted signatures can present great challenges as the signatures utilize algorithms that will be broken due to either flaws of design or a surge in the available computing power. Some systems provide archive timestamps, which can confirm that a data object existed at a specific time, and if an algorithm is not broken at that time, then the system can trust the original signature.

3- Serve:

CollecTor does not only archive and collect documents, CollecTor also presents them to other applications. These include services under control of Tor Metrics, e.g. Onionoo, or external applications operated by researchers.

For services that consume descriptors of a certain type, as they become identified, CollecTor has to make all recently fetched descriptors available. This is currently done via providing descriptors in a grouped form with one file per download round. Nonetheless, in the future, the modernized CollecTor service will provide an index of the recently downloaded descriptors in order to enable applications to obtain only the needed descriptors.

The modernized CollecTor can deploy parts of the Tor directory protocol version 3 in order to promote using other CollecTor instances as data sources, and to counterbalance the load generated across the network by CollecTor. If this protocol is extended to present index functionality, the present system of providing grouped files for recent documents can be replaced. This can also benefit those debugging network issues, since individual descriptors can be easily downloaded for examination.

Implementation of the prototype of the modernized CollecTor:

For implementing the application’s prototype, a plugin is specified in the paper. The prototype needs refactoring to fit this API and to enable the implementation of the application’s requirements set out in the paper. Experiments on the plugin have shown promising results and have proven to be better than the current CollecTor service.

Presently the prototype is in the form of a command line tool, rather than a service with an individual in-process scheduler. The scheduler has to be integrated with the prototype before it can be implemented.

One comment

  1. Excellent article. Thank you!

Leave a Reply

Your email address will not be published. Required fields are marked *


Captcha: *