Here's a plot of my email since January 1, 2003 (992936 emails received in 357 weeks), categorized as spam or ham (not spam):


The spike on Sep 3, 2003 is a virus outbreak that was caught by spamassassin. I'll try to remove it at some point in the future.
We started filtering viruses on May 2, 2004, which cut the number of spams slightly (about 25 viruses/day at the time, but not all were caught by spamassassin, so this number could vary a bit).

The "ham" only includes mail that makes it to my inbox. Mailing list traffic is not counted.

The plot shows linear and quadratic fits to predict how many spams I'll be receiving each day three months from now.

For the first year of data collection, an exponential fit was a disturbingly good model. Since then, things have mostly settled down to a linear, though, as you can see, there is still some upward curvature in the past year.

The details, which follow, assume the following fit functions:

t=0 is today.
const shows how many spams I can expect to receive today if the exponential fit is correct qconst shows how many spams I can expect to receive today if the quadratic fit is correct
degrees of freedom (ndf) : 51
rms of residuals      (stdfit) = sqrt(WSSR/ndf)      : 3.00758
variance of residuals (reduced chisquare) = WSSR/ndf : 9.04552

Final set of parameters            Asymptotic Standard Error
=======================            ==========================

const           = 235.675          +/- 12.76        (5.415%)
linear          = -0.176216        +/- 0.06218      (35.29%)


--
degrees of freedom (ndf) : 50
rms of residuals      (stdfit) = sqrt(WSSR/ndf)      : 1.40507
variance of residuals (reduced chisquare) = WSSR/ndf : 1.97422

Final set of parameters            Asymptotic Standard Error
=======================            ==========================

C               = 5.81231e+08      +/- 6.212e+08    (106.9%)
r               = 0.00750161       +/- 0.000481     (6.412%)


--
degrees of freedom (ndf) : 354
rms of residuals      (stdfit) = sqrt(WSSR/ndf)      : 4.84165
variance of residuals (reduced chisquare) = WSSR/ndf : 23.4415

Final set of parameters            Asymptotic Standard Error
=======================            ==========================

qconst          = 332.481          +/- 11.92        (3.586%)
qlinear         = -0.0601737       +/- 0.01797      (29.86%)
quadratic       = -8.12464e-05     +/- 5.916e-06    (7.282%)

--
degrees of freedom (ndf) : 351
rms of residuals      (stdfit) = sqrt(WSSR/ndf)      : 5.17492
variance of residuals (reduced chisquare) = WSSR/ndf : 26.7798

Final set of parameters            Asymptotic Standard Error
=======================            ==========================

qconst          = 332.481          +/- 12.83        (3.858%)
qlinear         = -0.0601746       +/- 0.01964      (32.64%)
quadratic       = -8.1098e-05      +/- 6.542e-06    (8.067%)
A               = 26.7794          +/- 3.388        (12.65%)
Back to Damian's Home Page.


Page last updated Fri Nov 6 23:58:29 CST 2009. Comments should be directed to menscher@uiuc.edu.