PerlMonks Activity Levels - A Study

Let's get right to the DATA.

Each row of data represents a span of 100 SOPW posts. Each row has information on the first and last posts in the span. Data for each post include:

  1. The number (index) of the post in the grand array of all SOPW posts;
  2. The node ID;
  3. The time at which the node was posted (specifically, epoch time).
Since a row represents a span, and a span is represented by its first and last posts, each row contains two sets of the above data.

The table has six columns:
AThe index of the first post in the span
ZThe index of the last post in the span
idAThe node ID of the first post in the span
dateAThe post time of the first post in the span
idZThe node ID of the last post in the span
dateZThe post time of the last post in the span
The data is tab delimited, suitable for directly copy-and-pasting into Excel, for example.

Look at the timestamp of the last node in the last row, to determine the time at which the data was collected.

How the data was collected

Obviously, this represents a lot of data, so of primary concern is how to collect it without significantly impacting the perlmonks servers.

Lacking direct access to the database, the most efficient means I could think of was to hit the Node lister in /bare/ mode, selecting the perlquestion type, iterating through all of the result pages. (Note that this requires pmdev privileges.) I recorded the node ID of the first and last node on each page.

The number of results per page is hardcoded at 100. With about 57,000 SOPW nodes, that's 570 result pages. I put a sleep(10) before each fetch.

For each pair of node IDs recorded, I fetched their node creation times using the node query xml generator. This function takes a list of node IDs, so I could have requested info on a lot more than two nodes on each call. In this way I could have reduced the load on the web server. The load on the database would have been about the same, and in fact by doing it the way I did, I spread out the load over more time. Again, I slept 10 seconds between each call.

I converted the returned timestamp to epoch time using Time::Local::gmtime().