Dave Keenan wrote: ↑Sat Jun 27, 2020 2:39 pm
I find that votes = 8254×rank

^{-1.12} is a better match to the data than N×rank

^{-1.37} when limiting to the first 80 ratios. But I also find that votes = 8280×rank

^{-1.13} is the best fit when I include all 820 data points. I also note that the fewer data points I include, starting from the most popular, the closer the exponent gets to a true Zipf's law exponent of -1.

When limiting to the first 80 items, I get -1.38. Almost exactly the same as -1.37. When I extend it out to all 820 of them, I get -1.39, also very close.

How are you finding your match for this rank exponent?

I was about to say that I was using the regression analysis features built into Google Sheets, which I trusted well enough, except that when I went to spruce up this chart to share here on the forum, some really strange things happened that broke my trust in Sheets. I noticed right before I exported that what should have been merely stylistic changes to the chart had changed its contents: it now had an exponent of -1.49 instead of -1.38!

And as I undid my actions, I swear that I saw the exponent go through at least one other value other than -1.38! I guess I just hadn't thought to monitor the exponent while applying these styles, and it had been flip-flopping all over the place as I went along.

So I then sought to figure out how I had managed to get these other exponent values. And while I tried every different combination of the styling actions I'd taken, I failed to reproduce these other values besides -1.49. And unfortunately, having forked off that original undo history branch, I was unable to get it back, so I now couldn't prove that I had ever seen anything other than -1.49.

But it appears there might be an explanation for this second value. Whether adding the x-axis to the chart, or including a header row to label the line, what seems to be happening is that it's including the first element in the list, 1/1's votes of 7624, where it wasn't including that before! So that's what causes my exponent to change from -1.38 to -1.49.

When I have it fit the line to all 820 points, this seeming inclusion or exclusion of the first data point does not affect the fit. It's locked at -1.39. Which suggests that maybe that's the one closer to the truth. Perhaps this "inclusion of the first data point" notion is actually more like Sheets' "inclusion of an extraneous imaginary data point before the start of the series".

So I'm not convinced I should trust Sheets, anymore. Surely whatever maths Sheets is running in the background, we can run ourselves. I just don't know enough stats to know how to find it, for a power series trendline. If anyone else knows, I can certainly code it up. I'll also ask Lauren sometime today when she's not busy.

By the way, we need to name this exponent we're raising the ranks to before taking their squared distance. How about r? We only used that once briefly earlier for Tenney height, which Dave found a simplification for and so we don't need it anymore.

Alternatively, and I think Dave has been implying this: we should just use -1 for the exponent, in accordance with Zipf's law. This would be tantamount to saying: deviation from -1 in the data we're working with represents noise – the extent to which these Scala votes are polluted with redundant scales, scales that aren't true scales, or otherwise aren't representative of future usage. I am open to just using -1 as the rank. I think it's elegant and memorable and justifiable.

Without including prime limit or sopf (s = 0, u = 0), I find the best k to be around 0.8, no matter whether I use rank^{-1}, rank^{-1.12} or rank^{-1.37}.

I'd bet including s and u affects k.

I've also noticed that weighting prime p according to p^{a} penalises primes greater than 13 way too much, for all reasonable values of a (≈ 1). So instead of p^{a} I tried π(p), i.e. I weighted each prime by its index. π(5)=3, π(7)=4, π(11)=5, etc. So instead of sop^{a}fr(), it's soπ(p)fr().

But I find that π(p) penalises higher primes too little.

The function Dave has found is called the

prime-counting function. Nice find, Dave! Certainly worth considering.

Suppose we raise π(p) to some power? In its case the power would probably be > 1. I could see this being the right thing. That composers actually use higher primes in a pattern which is based not on their size but their index in the sequence of primes, but then not a linear weight on these indices but an exponential one.

I wouldn't want our final formula to use both sop

^{a}fr and soπ

^{a}fr, but I will see if π can lower our sum-of-squares any further.

I'm afraid I can't quite follow your explanation of your Excel workarounds. It's not your explanatory skills that are lacking. Just my math skills (and/or familiarity with Excel's limitations). But no rush on the checking. I am going to try out soπ

^{a}fr and I suppose I should also try out soπ

^{a}f too. They may well move the needle.