Let's get right to the DATA.
Each row of data represents a span of 100 SOPW posts. Each row has information on the first and last posts in the span. Data for each post include:
The table has six columns:
| A | The index of the first post in the span |
|---|---|
| Z | The index of the last post in the span |
| idA | The node ID of the first post in the span |
| dateA | The post time of the first post in the span |
| idZ | The node ID of the last post in the span |
| dateZ | The post time of the last post in the span |
Look at the timestamp of the last node in the last row, to determine the time at which the data was collected.
Obviously, this represents a lot of data, so of primary concern is how to collect it without significantly impacting the perlmonks servers.
Lacking direct access to the database, the most efficient means I could think of was to hit the Node lister in /bare/ mode, selecting the perlquestion type, iterating through all of the result pages. (Note that this requires pmdev privileges.) I recorded the node ID of the first and last node on each page.
The number of results per page is hardcoded at 100. With about 57,000 SOPW nodes, that's 570 result pages. I put a sleep(10) before each fetch.
For each pair of node IDs recorded, I fetched their node creation times using the node query xml generator. This function takes a list of node IDs, so I could have requested info on a lot more than two nodes on each call. In this way I could have reduced the load on the web server. The load on the database would have been about the same, and in fact by doing it the way I did, I spread out the load over more time. Again, I slept 10 seconds between each call.
I converted the returned timestamp to epoch time using Time::Local::gmtime().