Not ready with results yet, but I wanted to drop a line on this thread to say: I've been experimenting a bunch in the last couple of days and I have a technique now which may prove fruitful.
I mentioned to @Dave Keenan
yesterday that my partner does Product Marketing for a living, a deeply data-driven occupation. I regularly turn to her for answers on stats related problems. So I asked her this afternoon if she knew a better way to do regression analyses than manually copying and pasting data series into Wolfram online, because I was going to need to start testing a ton of variations of metric combinations. She said: just use Google Sheets! Whatever I say about Sheets is probably also true for Excel.
Indeed they have various formulas, e.g. SLOPE for linear fit and GROWTH for exponential fit. But those turned out to only be good for predicting future values in a series. I needed something to give me the actual formula for the best fit curve.
Google Sheets did in the end have the answer, but not in a formula. The solution was found inside their Charts feature. If you give it a data series and Customize the chart, one of the options it provides is a Trendline. Enabling the Trendline gives you a bunch of options: linear, exponential, logarithmic, power, etc. You can generally eyeball which is the best shape for your data, but an objective measure is found in the R2
value, or coefficient of determination
. It can go as high as 1, or 100%.
And f you change the Label of the Trendline in the dropdown to "Equation" then you can get its equation. But what I ultimately needed was the goodness-of-fit; I was only after the equation as a means to calculating goodness-of-fit myself. So that Sheets calculated R2
for me was even better than I was expecting!
So anyway, my next steps will be to come up with a ton of different combinations of metrics (SoPF>3, Benedetti height, Tenney height, n+d ("length
"?), abs(n-d), abs(SoPF>3(n) - SoPF>3(d)), etc. etc. etc.) and then just compare all of their R2
and see which one has the best fit with respect to the frequency statistics from Scala.
By the way, the R2
for the frequency statistics themselves is an impressive 0.991 when fit to the equation 8041x-1.37
, where x is the index of the comma in the list of commas sorted by descending frequency. Dunno if there's any significance to that coefficient, but there ya go.
So I guess the moral of the story is: trust your partner for assists, and the solution is often right under your nose.