Stats would be helpful

I was working on bug#13381 and I realized that I need some more stats background. Or at least, some application of the stats classes I had 6-7 years ago would be helpful.

We have this problem were we need to categorize projects into three groups: new/never-started, active, and abandoned/inactive. Active/inactive is fairly easy - figuring out new from active is more difficult, however.

Thanks to cvs2mysql I have some information on how frequently a project has committed files to their module. The problem is how to use this information to figure out that a project is still "new". As far as I can tell right now, a project is still new if:

  1. They've never committed anything
  2. They've only committed a few times
  3. They've only committed things for a short period of time over the entire commit history (say, 7 days)

I'm sure I'm missing some data here, but I'm not sure what. The pieces of data I know I have access to are:

  • Time of each transaction
  • Number of commits
  • Number of files per commit

Using these things I can infer others such as duration of the project (first commit to the last), time between commits, and other statisticy things like medians, modes, and standard deviations.

What I really need to do is fit this data to a curve, but I don't know how. Or at least, I don't know what algorithm to choose. Any ideas on how to start tackling this, or good ideas on what I should do?

links

social