developing a notational comma popularity metric

Post Reply
User avatar
cmloegcmluin
Site Admin
Posts: 1700
Joined: Tue Feb 11, 2020 3:10 pm
Location: San Francisco, California, USA
Real Name: Douglas Blumeyer (he/him/his)
Contact:

Re: developing a notational comma popularity metric

Post by cmloegcmluin »

Here's what I found:

k=0.6
a=0.56
s=0.2 (that's the weight on prime limit, 's' as it was earlier)
u=0.1 (this is the weight on SoUPF>3, or sopf if we go with general math lingo)

This gets sum-of-squares across the first 80 entries down to 0.001101743627332945. That's with rank-1.37.

For comparison, SoPF>3 gives sum-of-squares of 0.003006204375301944.

These values still result in ranking 125/1 worse than 49/1. That's the first place it goes wrong. But apparently, overall, it is best. Let me know if anyone can independently confirm these values.
User avatar
Dave Keenan
Site Admin
Posts: 2180
Joined: Tue Sep 01, 2015 2:59 pm
Location: Brisbane, Queensland, Australia
Contact:

Re: developing a notational comma popularity metric

Post by Dave Keenan »

I find that votes = 8254×rank-1.12 is a better match to the data than N×rank-1.37 when limiting to the first 80 ratios. But I also find that votes = 8280×rank-1.13 is the best fit when I include all 820 data points. I also note that the fewer data points I include, starting from the most popular, the closer the exponent gets to a true Zipf's law exponent of -1.

Without including prime limit or sopf (s = 0, u = 0), I find the best k to be around 0.8, no matter whether I use rank-1, rank-1.12 or rank-1.37.

I've also noticed that weighting prime p according to pa penalises primes greater than 13 way too much, for all reasonable values of a (≈ 1). So instead of pa I tried π(p), i.e. I weighted each prime by its index. π(5)=3, π(7)=4, π(11)=5, etc. So instead of sopafr(), it's soπ(p)fr().

But I find that π(p) penalises higher primes too little.
User avatar
Dave Keenan
Site Admin
Posts: 2180
Joined: Tue Sep 01, 2015 2:59 pm
Location: Brisbane, Queensland, Australia
Contact:

Re: developing a notational comma popularity metric

Post by Dave Keenan »

I should say that your k's and a's could well be optimal. I am not (yet) doing a sort on the metric (by which I really mean, on the results of apply the metric to the ratios), to obtain the estimated ranks. I am instead generating the estimated ranks by exponentiating the metric, i.e. taking a number slightly greater than 1, e.g. 1.126 and raising it to the power of the metric. i.e. est_rank = 1.126metric. Where metric = sopafr(n)+k*sopafr(d). I then use the Excel Solver to adjust k and a to minimise the sum of all the (est_rank-1.12 - scala_rank-1.12)2. The -1.12 replaces the earlier -1.37. I call this the Zipf exponent, z.

However the 1.126 above, the base of the exponential, call it b, is not fixed. I include it with k and a as variable to be adjusted by the Solver, to minimise the sum of the squared differences between the zipfed-ranks.

I do this because I can't include a sort in a function to be optimised by the Excel Solver. But I should be able to check your results eventually. Hiking tomorrow, so it probably won't be until the day after.
Attachments
ImprovedSoPF.xlsx
(713.72 KiB) Downloaded 201 times
User avatar
cmloegcmluin
Site Admin
Posts: 1700
Joined: Tue Feb 11, 2020 3:10 pm
Location: San Francisco, California, USA
Real Name: Douglas Blumeyer (he/him/his)
Contact:

Re: developing a notational comma popularity metric

Post by cmloegcmluin »

Dave Keenan wrote: Sat Jun 27, 2020 2:39 pm I find that votes = 8254×rank-1.12 is a better match to the data than N×rank-1.37 when limiting to the first 80 ratios. But I also find that votes = 8280×rank-1.13 is the best fit when I include all 820 data points. I also note that the fewer data points I include, starting from the most popular, the closer the exponent gets to a true Zipf's law exponent of -1.
When limiting to the first 80 items, I get -1.38. Almost exactly the same as -1.37. When I extend it out to all 820 of them, I get -1.39, also very close.

How are you finding your match for this rank exponent?

I was about to say that I was using the regression analysis features built into Google Sheets, which I trusted well enough, except that when I went to spruce up this chart to share here on the forum, some really strange things happened that broke my trust in Sheets. I noticed right before I exported that what should have been merely stylistic changes to the chart had changed its contents: it now had an exponent of -1.49 instead of -1.38!

And as I undid my actions, I swear that I saw the exponent go through at least one other value other than -1.38! I guess I just hadn't thought to monitor the exponent while applying these styles, and it had been flip-flopping all over the place as I went along.

So I then sought to figure out how I had managed to get these other exponent values. And while I tried every different combination of the styling actions I'd taken, I failed to reproduce these other values besides -1.49. And unfortunately, having forked off that original undo history branch, I was unable to get it back, so I now couldn't prove that I had ever seen anything other than -1.49.

But it appears there might be an explanation for this second value. Whether adding the x-axis to the chart, or including a header row to label the line, what seems to be happening is that it's including the first element in the list, 1/1's votes of 7624, where it wasn't including that before! So that's what causes my exponent to change from -1.38 to -1.49.

When I have it fit the line to all 820 points, this seeming inclusion or exclusion of the first data point does not affect the fit. It's locked at -1.39. Which suggests that maybe that's the one closer to the truth. Perhaps this "inclusion of the first data point" notion is actually more like Sheets' "inclusion of an extraneous imaginary data point before the start of the series".

So I'm not convinced I should trust Sheets, anymore. Surely whatever maths Sheets is running in the background, we can run ourselves. I just don't know enough stats to know how to find it, for a power series trendline. If anyone else knows, I can certainly code it up. I'll also ask Lauren sometime today when she's not busy.

By the way, we need to name this exponent we're raising the ranks to before taking their squared distance. How about r? We only used that once briefly earlier for Tenney height, which Dave found a simplification for and so we don't need it anymore.

Alternatively, and I think Dave has been implying this: we should just use -1 for the exponent, in accordance with Zipf's law. This would be tantamount to saying: deviation from -1 in the data we're working with represents noise – the extent to which these Scala votes are polluted with redundant scales, scales that aren't true scales, or otherwise aren't representative of future usage. I am open to just using -1 as the rank. I think it's elegant and memorable and justifiable.
Without including prime limit or sopf (s = 0, u = 0), I find the best k to be around 0.8, no matter whether I use rank-1, rank-1.12 or rank-1.37.
I'd bet including s and u affects k.
I've also noticed that weighting prime p according to pa penalises primes greater than 13 way too much, for all reasonable values of a (≈ 1). So instead of pa I tried π(p), i.e. I weighted each prime by its index. π(5)=3, π(7)=4, π(11)=5, etc. So instead of sopafr(), it's soπ(p)fr().

But I find that π(p) penalises higher primes too little.
The function Dave has found is called the prime-counting function. Nice find, Dave! Certainly worth considering.

Suppose we raise π(p) to some power? In its case the power would probably be > 1. I could see this being the right thing. That composers actually use higher primes in a pattern which is based not on their size but their index in the sequence of primes, but then not a linear weight on these indices but an exponential one.

I wouldn't want our final formula to use both sopafr and soπafr, but I will see if π can lower our sum-of-squares any further.

I'm afraid I can't quite follow your explanation of your Excel workarounds. It's not your explanatory skills that are lacking. Just my math skills (and/or familiarity with Excel's limitations). But no rush on the checking. I am going to try out soπafr and I suppose I should also try out soπaf too. They may well move the needle.
User avatar
volleo6144
Posts: 81
Joined: Mon May 18, 2020 7:03 am
Location: Earth
Contact:

Re: developing a notational comma popularity metric

Post by volleo6144 »

cmloegcmluin wrote: Sun Jun 28, 2020 2:07 am Suppose we raise π(p) to some power? In its case the power would probably be > 1. I could see this being the right thing. That composers actually use higher primes in a pattern which is based not on their size but their index in the sequence of primes, but then not a linear weight on these indices but an exponential one.
I ran the numbers on a few of these:
Prime pi^1 pi^2 pi^3 pi^4 pi^1.5 pi^2.5
----- ---- ---- ---- ---- ------ ------
    5    3    9   27   81 5.1962 15.588
    7    4   16   64  256      8     32
   11    5   25  125  625  11.18 55.902
   13    6   36  216 1296 14.697 88.182
   17    7   49  343 2401  18.52 129.64
   19    8   64  512 4096 22.627 181.02
   23    9   81  729 6561     27    243
   29   10  100 1000 10 K 31.623 316.23 - first reversal: 31 is more popular than 29
   31   11  121 1331 15 K 36.483 401.31
   37   12  144 1728 21 K 41.569 498.83
π1.5 or something close looks a little promising: an 11 is worth about two 5's, and a 19 is worth about two 11's.

Did you really mean "exponential"...? (It might help to know that π(p) ~ p / ln p for larger primes, which basically means that, around e10 ~ 22,000, about one in every 10 numbers is prime on average. As it turns out, there are exactly 100 primes between 22,001 and 22,999.)
I'm afraid I can't quite follow your explanation of your Excel workarounds. It's not your explanatory skills that are lacking. Just my math skills (and/or familiarity with Excel's limitations). But no rush on the checking. I am going to try out soπafr and I suppose I should also try out soπaf too. They may well move the needle.
I imagine that the way Excel's optimizer works is by nudging your starting number around to minimize a cost function or something (or it's something close to Newton's method, in which case you can't really use absolute values either), and I'm also pretty sure Excel doesn't have a function to sort things without using an actual sort command (like, a function that you can just put in a formula).
I'm in college (a CS major), but apparently there's still a decent amount of time to check this out. I wonder if the main page will ever have 59edo changed to green...
User avatar
cmloegcmluin
Site Admin
Posts: 1700
Joined: Tue Feb 11, 2020 3:10 pm
Location: San Francisco, California, USA
Real Name: Douglas Blumeyer (he/him/his)
Contact:

Re: developing a notational comma popularity metric

Post by cmloegcmluin »

volleo6144 wrote: Sun Jun 28, 2020 2:30 am Did you really mean "exponential"...?
I meant exponent > 1, as in your chart above.
User avatar
cmloegcmluin
Site Admin
Posts: 1700
Joined: Tue Feb 11, 2020 3:10 pm
Location: San Francisco, California, USA
Real Name: Douglas Blumeyer (he/him/his)
Contact:

Re: developing a notational comma popularity metric

Post by cmloegcmluin »

Using soπafr, the lowest sum of squares I can get is 0.0011856213167235174, with k = 0.64, a = 1.17, s = 0.25, and u = 0.14 (and r = -1.37). That is not as good as the previous best of 0.001101743627332945.

I haven't tried it with soπaf. I suppose I should try each of the four possible combinations. Though at this point I expect that π will not win the day.
User avatar
cmloegcmluin
Site Admin
Posts: 1700
Joined: Tue Feb 11, 2020 3:10 pm
Location: San Francisco, California, USA
Real Name: Douglas Blumeyer (he/him/his)
Contact:

Re: developing a notational comma popularity metric

Post by cmloegcmluin »

Here’s something that’s been bugging me a bit: considering the handful of situations we know of with commas we are going to run this metric against, most those commas are “off the charts” in the sense that they have 0 votes. Is this metric really meaningful in the off-the-charts world? Do we feel like we’re truly tapping into the underlying forces that determine the precedence of notational commas, even extending into that uncharted territory? I would say hesitantly “yes” — otherwise I wouldn’t be here — but I wonder if anyone else has also doubted this.
User avatar
Dave Keenan
Site Admin
Posts: 2180
Joined: Tue Sep 01, 2015 2:59 pm
Location: Brisbane, Queensland, Australia
Contact:

Re: developing a notational comma popularity metric

Post by Dave Keenan »

cmloegcmluin wrote: Sun Jun 28, 2020 2:07 am By the way, we need to name this exponent we're raising the ranks to before taking their squared distance. How about r? We only used that once briefly earlier for Tenney height, which Dave found a simplification for and so we don't need it anymore.
I wrote above: "I call this the Zipf exponent, z."
User avatar
cmloegcmluin
Site Admin
Posts: 1700
Joined: Tue Feb 11, 2020 3:10 pm
Location: San Francisco, California, USA
Real Name: Douglas Blumeyer (he/him/his)
Contact:

Re: developing a notational comma popularity metric

Post by cmloegcmluin »

Dave Keenan wrote: Sun Jun 28, 2020 9:05 am
cmloegcmluin wrote: Sun Jun 28, 2020 2:07 am By the way, we need to name this exponent we're raising the ranks to before taking their squared distance. How about r? We only used that once briefly earlier for Tenney height, which Dave found a simplification for and so we don't need it anymore.
I wrote above: "I call this the Zipf exponent, z."
If we use the Zipf exponent for r, then I agree it could be z. We haven't agreed on that yet, though. In that same post I said:
cmloegcmluin wrote: Sun Jun 28, 2020 2:07 am Alternatively, and I think Dave has been implying this: we should just use -1 for the exponent, in accordance with Zipf's law. This would be tantamount to saying: deviation from -1 in the data we're working with represents noise – the extent to which these Scala votes are polluted with redundant scales, scales that aren't true scales, or otherwise aren't representative of future usage. I am open to just using -1 as the rank. I think it's elegant and memorable and justifiable.
So are you agreeing with that? I understand you're hiking today so perhaps you didn't have time to compose a detailed response.

------

I experimented today with a soπaf(r) function and could not find any use for it that performed better than simplify sopaf(r). And as I said before, I don't want to experiment with metrics that use both π and p.

With z (or r) = -1, I found k = 1/2, a = 2/3, s = 1/4, u = 1/4 performs the best. I decided it was best for sopf to also be split by numerator and denominator with the lesser of the two weighted by k, just as is the case for sopfr. The first ranking this fails on is 125/1. Its sum of squares is about 0.0058363. I know that's not as good as the number I threw out earlier, but that's because changing from r = -1.37 to r = -1 causes those sum of squares values to all grow quite a bit. These k, a, s, and u values are not the exact values I calculated, but I think for the accuracy of the data we're working with (actually I had 0.48, 0.66, 0.23, and 0.24, respectively) simplifying them to those values feels like the right level of assertion of confidence. I also found that they did not change dramatically when changing r between -1 and -1.37. So the metric would be:

sopfr(num) + ½sopfr(den) +
¼(sopf(num) + ½sopf(den)) +
¼primelimit(num/den)

where the terms using num are always greater than the terms using den.
Post Reply