Here's a plot of my email since January 1, 2003 (993875 emails received in 359 weeks), categorized as spam or ham (not spam):


The spike on Sep 3, 2003 is a virus outbreak that was caught by spamassassin. I'll try to remove it at some point in the future.
We started filtering viruses on May 2, 2004, which cut the number of spams slightly (about 25 viruses/day at the time, but not all were caught by spamassassin, so this number could vary a bit).

The "ham" only includes mail that makes it to my inbox. Mailing list traffic is not counted.

The plot shows linear and quadratic fits to predict how many spams I'll be receiving each day three months from now.

For the first year of data collection, an exponential fit was a disturbingly good model. Since then, things have mostly settled down to a linear, though, as you can see, there is still some upward curvature in the past year.

The details, which follow, assume the following fit functions:

t=0 is today.
const shows how many spams I can expect to receive today if the exponential fit is correct qconst shows how many spams I can expect to receive today if the quadratic fit is correct
degrees of freedom (ndf) : 51
rms of residuals      (stdfit) = sqrt(WSSR/ndf)      : 5.249
variance of residuals (reduced chisquare) = WSSR/ndf : 27.552

Final set of parameters            Asymptotic Standard Error
=======================            ==========================

const           = 125.958          +/- 16.33        (12.97%)
linear          = -0.60279         +/- 0.08829      (14.65%)


--
degrees of freedom (ndf) : 50
rms of residuals      (stdfit) = sqrt(WSSR/ndf)      : 1.34938
variance of residuals (reduced chisquare) = WSSR/ndf : 1.82082

Final set of parameters            Asymptotic Standard Error
=======================            ==========================

C               = 5.77603e+08      +/- 5.867e+08    (101.6%)
r               = 0.0074442        +/- 0.0004542    (6.101%)


--
degrees of freedom (ndf) : 356
rms of residuals      (stdfit) = sqrt(WSSR/ndf)      : 5.7748
variance of residuals (reduced chisquare) = WSSR/ndf : 33.3484

Final set of parameters            Asymptotic Standard Error
=======================            ==========================

qconst          = 251.418          +/- 12.43        (4.943%)
qlinear         = -0.166589        +/- 0.01947      (11.69%)
quadratic       = -0.000110751     +/- 6.55e-06     (5.914%)

--
degrees of freedom (ndf) : 353
rms of residuals      (stdfit) = sqrt(WSSR/ndf)      : 5.87142
variance of residuals (reduced chisquare) = WSSR/ndf : 34.4736

Final set of parameters            Asymptotic Standard Error
=======================            ==========================

qconst          = 251.418          +/- 12.98        (5.162%)
qlinear         = -0.166592        +/- 0.02065      (12.4%)
quadratic       = -0.000111036     +/- 7.001e-06    (6.305%)
A               = 26.7794          +/- 3.77         (14.08%)
Back to Damian's Home Page.


Page last updated Sun Nov 22 23:58:28 CST 2009. Comments should be directed to menscher@uiuc.edu.