User:Jrincayc/Wikipedia Growth Paper

From Wikipedia, the free encyclopedia

I have printed out this version and will hand it in. Mutilate at will, tell me what I did wrong, what to do next ect.

Abstract[edit]

I use a model of Wikipedia to attempt to explain the growth of it. Unfortunately, while the model I use does have explanatory power, I am unable to explain many of the coefficients.

Introduction to Wikipedia[edit]

Wikipedia, in a nutshell, is an online, multilingual, encyclopedia which can be edited by anyone with an internet connection. It was begun on January 15, 2001 as an experiment to determine whether a less formal encyclopedia (compared to the more formal Nupedia) could be developed in an `open source' manner (see Britannica or Nupedia? The Future of Free Encyclopedias, http://www.kuro5hin.org/story/2001/7/25/103136/121 ). The most unique aspect of it is that almost every page on the site has a edit this page link (exceptions being pages like the front page that are especially prone to vandalism). If you click on this link, you are taken to a page where you can edit the article and make any changes that you want. New articles are created by following a link to a article that has no text.

Of course, this also means that vandalism is very easy. Hence, detecting and undoing vandalism must correspondingly be easy. There are two major features that help this. The first is that a complete record of every edit and every version of any article is kept and made available. As part of this, it is very easy to go to a article and choose one of the older versions and make it the current version, thereby removing the subsequent vandalism (called a revert). The other feature is each person who is logged in gets a watchlist that shows when articles they are interested in change. They can then see the exact words that have been changed in the article. This allows edits to Wikipedia to be carefully examined and reverted if they are vandalisms without incurring a large time cost.

Wikipedia currently has over 350,000 articles and there are 10 languages with more than 10,000 articles. It is gaining hundreds of new articles a day and there are around 10,000 edits every day. These are impressive figures for a encyclopedia that depends entirely on volunteer effort. The fact that the entire database of edits is downloadable makes examining Wikipedia further very interesting.

Coase's Penguin[edit]

The only mention of Wikipedia in a journal that I have found is the paper Coase's Penguin, or, Linux and the Nature of the Firm (Yochai Benkler, Yale Law Journal, Volume 112, Number 3, December 2002). This paper examines several instances of creation of freely available informational and cultural works that anyone can contribute to, called peer production. The paper concludes that a major factor in helping these works get created is that transaction costs have been cut substantially compared to firm production or market production. In the context of Wikipedia, the relevant cost that has been cut is determining who is best to work on a given encyclopedia article. Each person who uses Wikipedia has a very good idea of their individual cost and the benefit of improving a particular article. If their individual cost is less then their individual benefit, then the individual can make the improvement. Wikipedia has access to far more individuals than a firm, so it is much more likely that a low cost, high benefit individual can be found. Also the firm will not be able to costlessly determine the best individual in the firm. Trying to replicate Wikipedia with a market, would either involve contracting less optimal individuals, or trying to contract thousands of people for small amounts of work. The search costs and the contracting costs involved with the latter would be huge. So, the peer production that occurs in Wikipedia may very well be the most efficient way to produce an encyclopedia since the transaction costs of producing it with a firm or by a market are substantially greater.


Effect of edits and authors, Costs and Benefits[edit]

A edit that improves an article has two effects

  1. Increases the overall quality of Encyclopedia
  2. Increases quality of Article

The first effect is expected to increase visitors and hence edits. The second effect is expected to decrease visitors that are capable of improving the article. An edit that is done by a different author is expected to have even more of an effect, since it will bring new ideas and perspectives.

So, as an article gets closer to the perfect article, the benefit of an additional edit will decrease. The cost will still be similar, or may even go up as the number of people capable of improving the article decreases and the amount of rewriting work increases.

The effect on the encyclopedia should be that more quality articles will bring in more people to read and potentially edit articles. This effect is in the opposite direction of the effect on the article.

Data gathered[edit]

The first thing that was done with the entire Wikipedia download was that it was run through a fast preprocessing program to remove the information that I was not interested in. The only information that was left for each edit was article title, author name, article checksum, number of links in article, edit date/time, and flags for the type of the edit and article (such as name-space, redirect, minor edit). The main thing that this removed was the article text which greatly

reduced the amount of data that needed to be dealt with.

The next processes continued to remove extra information that was not needed. First, all non-articles where removed. The definition of an article is the standard Wikipedia definition, a article has at least one internal link, is not a redirect and is in the main name-space (as in is not an image, a talk page or otherwise). Next reverts were removed. Any article that had the pattern A,B,C where A and C had the same checksum and length, and B and C had different authors, was considered a revert, and changes B and C where not counted for any subsequent statistics. The numbers of reverts were kept track of by month and encyclopedia.

The last processing was to get the data into a form that could have OLS done to it. For each month, the total number of articles in various categories was calculated (an example category would be articles that had 2 to 5 authors and 5 to 10 edits and would be written as AE:2to5_5to10). Also the number of bot edits was calculated (any edit by a user listed on Wikipedia:Bots) so that this could be figured into the calculation and disregarded.

This produced data with the following summary statistics:

Summary Statistics for 456 data points over 38 encyclopedias
Mean Median Standard Deviation Minimum Maximum Sum
total_delta 842.27 126.5 2450.28 0 38843 384073
edits_delta 5255.31 367 15561.04 0 116898 2396423
total 7276.39 405 23599.69 1 186355 3318032
edits 35521.20 1314 138666.29 1 1297188 16197666
reverts 37.32 0 183.98 0 1760 17020
bot_count 119.10 0 1688.20 0 34882 54310
bot_create 84.61 0 1503.90 0 31854 38584
bot_total 203.71 0 2345.65 0 34887 92894
AE:0to1_0to1 1933.11 167.5 5460.36 0 51221 881500
AE:0to1_2to3 477.05 53 1337.89 0 10171 217536
AE:0to1_4to5 72.27 5 228.82 0 1918 32954
AE:0to1_6toplus 36.55 3 118.61 0 1038 16667
AE:2to4_2to3 2195.05 66 8154.28 0 59328 1000944
AE:2to4_4to5 843.95 21.5 2639.70 0 23103 384843
AE:2to4_6to10 392.58 12 1306.26 0 12028 179018
AE:2to4_11toplus 70.83 2 251.12 0 2356 32298
AE:5to10_4to5 88.81 0 294.84 0 2005 40499
AE:5to10_6to10 544.21 1 1934.14 0 15512 248159
AE:5to10_11to20 296.54 1 1188.86 0 10697 135221
AE:5to10_21toplus 44.92 1 188.50 0 1808 20484
AE:11toplus_11to20 105.10 0 484.99 0 4279 47925
AE:11toplus_21toplus 175.40 1 949.83 0 9942 79984

Here are the averages in a table ordered by the number of authors going down and the number of edits going right. Note that the categories were chosen based on trying to make sure that each one had a reasonable number of articles and some were combined to ensure this (for example AE:0to1_6toplus is a combined article).

Average Number of Articles for Categories
Average 0to1 2to3 4to5 6to10 11to20 21plus
0to1 1933.1140 477.0526 72.2675 36.5504
2to4 2195.0526 843.9539 392.5833 70.8289
5to10 88.8136 544.2083 296.5373 44.9211
11plus 105.0987 175.4035



Equation and Coefficients[edit]

Δtotal = Beta0 + Beta1bot_create + Beta2AE:0to1_0to1 + Beta3AE:0to1_2to3 + Beta4AE:0to1_4to5 + Beta5AE:0to1_6toplus + Beta6AE:2to4_2to3 + Beta7AE:2to4_4to5 + Beta8AE:2to4_6to10 + Beta9AE:2to4_11toplus + Beta10AE:5to10_4to5 + Beta11AE:5to10_6to10 + Beta12AE:5to10_11to20 + Beta13AE:5to10_21toplus + Beta14AE:11toplus_11to20 + Beta15AE:11toplus_21toplus

On the left is Δtotal. This is the change in the total number of articles in a month for a given encyclopedia. The intercept is expected to be positive, since the data is only on encyclopedias that have actually been started, and to get started they need to go from zero pages to some pages, even though all the variables are zero. The next variable, bot_create, is the number of articles that were created that month by computer programs, referred to as bots. This should be close to one since one bot created article in the month will result in one more article being created. (It might possibly spur on human authors, but that is unlikely in a month's time.)

The rest of the variables are variables that are dependent on the structure of the articles. These variables are expected to have coefficients that increase as the number of authors and edits increases, since based on the model of Wikipedia growth, articles with more authors and/or more edits are expected to be of higher quality, and so should draw in more readers, some of whom then proceed to create new articles. On the other hand, as the article ages, more of the links that it has to other articles will be to already existing articles, so I would expect that there will be some decrease in the value of the coefficients as the number of authors and edits increases.

Δedits = Beta0 + Beta1bot_total + Beta2AE:0to1_0to1 + Beta3AE:0to1_2to3 + Beta4AE:0to1_4to5 + Beta5AE:0to1_6toplus + Beta6AE:2to4_2to3 + Beta7AE:2to4_4to5 + Beta8AE:2to4_6to10 + Beta9AE:2to4_11toplus + Beta10AE:5to10_4to5 + Beta11AE:5to10_6to10 + Beta12AE:5to10_11to20 + Beta13AE:5to10_21toplus + Beta14AE:11toplus_11to20 + Beta15AE:11toplus_21toplus

The second equation is trying to predict Δedits, or the number of new edits done in a month. The only different variable is that instead of bot_create, bot_total, or the number of edits done by bots, is used. This should have a coefficient of one since one edit by a bot should create approximately one edit in that month (plus or minus any discouragement or encouragement of humans factor). The coefficients on the article categories should be somewhat similar to the ones on the Δtotals equation since some of the same effects are occuring. Of course, the coefficients should be greater in magnitude than the ones on Δtotals equation since you only have to create an article once, but you have to edit it multiple times to get it to become a high quality article.

In general, I would expect that coefficients should be positive except when one of two things is happening. They both depend on the fact that new authors are joining and old authors are leaving. If the current mix of articles decreases the amount of new authors entering, then that is actually having a negative effect on the number of new edits done. So, if the current mix of articles is of poor quality, more potential authors might get discouraged with the poor quality of Wikipedia, and never join. On the other hand, this might just cause them to start editing. The way to tell would be that the low quality articles would possibly cause more edits to be done and less new articles to be created. The other possible cause of negative coefficients is high quality articles. These would tend to discourage new authors since no improvements that can be made will be found.

The Regression[edit]

Both equations were regressed on the data. Below is the Δtotals result:

R2 = 0.8664

ΔTotals Regression Results
Coefficients Standard Error
Intercept 177.9947 52.6005
bot_create 1.0538 0.0337
AE:0to1_0to1 0.0248 0.0283
AE:0to1_2to3 2.2715 0.5569
AE:0to1_4to5 -14.2698 4.4896
AE:0to1_6toplus 4.7155 6.7709
AE:2to4_2to3 0.0592 0.0392
AE:2to4_4to5 0.0302 0.3884
AE:2to4_6to10 0.5713 1.1788
AE:2to4_11toplus 1.3863 5.0708
AE:5to10_4to5 5.0926 2.8107
AE:5to10_6to10 -0.6224 0.9268
AE:5to10_11to20 2.4462 1.5104
AE:5to10_21toplus -21.7014 7.8333
AE:11toplus_11to20 -4.5929 1.5457
AE:11toplus_21toplus 2.5271 0.8357

Well, it has a reasonably high R2, the intercept is positive and the value for bot_create is close to one. Other than that, I have to say the values on the coefficients surprise me and I have no good story to explain them. The only two that are significant at a 95% confidence level and are positive are one author, 2 to 3 edits and 11 or more authors, 21 or more edits. It is possible that the former demonstrates some kind of new article with lots of empty links, and the latter demonstrates the high quality encyclopedia attraction effect, but it's also possible that the data is just biased on something else. Some other ones that are significant and negative such as AE:5to10_21toplus and AE:11toplus_11to20 do not follow a pattern that I can see.

Below are the structural coefficients arranged is a table:

95% confidence intervals for ΔTotals
Total 0to1 2to3 4to5 6to10 11to20 21plus
0to1 -0.0308 0.0803 1.1769 3.3661 -23.0935 -5.4460 -8.5918 18.0228
2to4 -0.0178 0.1361 -0.7331 0.7935 -1.7455 2.8881 -8.5797 11.3523
5to10 -0.4315 10.6167 -2.4439 1.1991 -0.5223 5.4147 -37.0967 -6.3062
11plus -7.6308 -1.5551 0.8846 4.1697

The Δedits regression yielded similarly puzzling results presented below:

R2 = 0.9597

ΔEdits Regression Results
Coefficients Standard Error
Intercept 530.5434 183.8839
bot_total 0.9010 0.0858
AE:0to1_0to1 -0.0487 0.0954
AE:0to1_2to3 5.7923 1.9287
AE:0to1_4to5 -37.9669 15.5650
AE:0to1_6toplus 22.6584 23.5915
AE:2to4_2to3 0.1670 0.1442
AE:2to4_4to5 1.2930 1.3584
AE:2to4_6to10 0.1671 4.1181
AE:2to4_11toplus 13.5165 17.7708
AE:5to10_4to5 54.0941 9.3732
AE:5to10_6to10 -15.7714 3.1147
AE:5to10_11to20 39.1942 5.1087
AE:5to10_21toplus -126.4542 27.9810
AE:11toplus_11to20 -15.1967 5.4038
AE:11toplus_21toplus 4.2223 2.9406

Well, it has an even higher R2, a positive intercept, and the right value on the coefficient for bot_total. On the other hand, I can't think of a good explanation for the coefficients on AE:5to10_4to5 (+), AE:5to10_6to10 (-), AE:5to10_11to20 (+), AE:5to10_21toplus (-), and AE:11toplus_11to20 (-). Also, the value on AE:5to10_21toplus seems much lower than I would expect. I am quite suspicious that some of the coefficients are picking up an excluded variable bias since they seem inexplicable. My best guess for a candidate is some kind of large encyclopedia effect is affecting the higher edit and author counts.

95% confidence intervals for ΔEdits coefficients
Edits 0to1 2to3 4to5 6to10 11to20 21plus
0to1 -0.2363 0.1388 2.0017 9.5829 -68.5579 -7.3759 -23.7075 69.0244
2to4 -0.1165 0.4505 -1.3767 3.9628 -7.9265 8.2607 -21.4096 48.4426
5to10 35.6722 72.5160 -21.8929 -9.6498 29.1537 49.2346 -181.4473 -71.4612
11plus -25.8172 -4.5761 -1.5569 10.0016


Conclusions[edit]

Something odd is happening with the data. It seems to explain quite a bit of the variation, but on the other hand, I would not have expected the signs on the coefficients that I have seen. I suspect that I will have to examine the article level data very carefully to try and explain the values that I am getting. The aggregate data that I am using does not give sufficient insight into the data to try and give a good explanation of it. I suspect that I will have to work closer with individual articles to explain some of the effects seen.