developing a notational comma popularity metric
- cmloegcmluin
- Site Admin
- Posts: 1704
- Joined: Tue Feb 11, 2020 3:10 pm
- Location: San Francisco, California, USA
- Real Name: Douglas Blumeyer (he/him/his)
- Contact:
Re: developing a notational comma popularity metric
Here's what I found:
k=0.6
a=0.56
s=0.2 (that's the weight on prime limit, 's' as it was earlier)
u=0.1 (this is the weight on SoUPF>3, or sopf if we go with general math lingo)
This gets sum-of-squares across the first 80 entries down to 0.001101743627332945. That's with rank-1.37.
For comparison, SoPF>3 gives sum-of-squares of 0.003006204375301944.
These values still result in ranking 125/1 worse than 49/1. That's the first place it goes wrong. But apparently, overall, it is best. Let me know if anyone can independently confirm these values.
k=0.6
a=0.56
s=0.2 (that's the weight on prime limit, 's' as it was earlier)
u=0.1 (this is the weight on SoUPF>3, or sopf if we go with general math lingo)
This gets sum-of-squares across the first 80 entries down to 0.001101743627332945. That's with rank-1.37.
For comparison, SoPF>3 gives sum-of-squares of 0.003006204375301944.
These values still result in ranking 125/1 worse than 49/1. That's the first place it goes wrong. But apparently, overall, it is best. Let me know if anyone can independently confirm these values.
- Dave Keenan
- Site Admin
- Posts: 2180
- Joined: Tue Sep 01, 2015 2:59 pm
- Location: Brisbane, Queensland, Australia
- Contact:
Re: developing a notational comma popularity metric
I find that votes = 8254×rank-1.12 is a better match to the data than N×rank-1.37 when limiting to the first 80 ratios. But I also find that votes = 8280×rank-1.13 is the best fit when I include all 820 data points. I also note that the fewer data points I include, starting from the most popular, the closer the exponent gets to a true Zipf's law exponent of -1.
Without including prime limit or sopf (s = 0, u = 0), I find the best k to be around 0.8, no matter whether I use rank-1, rank-1.12 or rank-1.37.
I've also noticed that weighting prime p according to pa penalises primes greater than 13 way too much, for all reasonable values of a (≈ 1). So instead of pa I tried π(p), i.e. I weighted each prime by its index. π(5)=3, π(7)=4, π(11)=5, etc. So instead of sopafr(), it's soπ(p)fr().
But I find that π(p) penalises higher primes too little.
Without including prime limit or sopf (s = 0, u = 0), I find the best k to be around 0.8, no matter whether I use rank-1, rank-1.12 or rank-1.37.
I've also noticed that weighting prime p according to pa penalises primes greater than 13 way too much, for all reasonable values of a (≈ 1). So instead of pa I tried π(p), i.e. I weighted each prime by its index. π(5)=3, π(7)=4, π(11)=5, etc. So instead of sopafr(), it's soπ(p)fr().
But I find that π(p) penalises higher primes too little.
- Dave Keenan
- Site Admin
- Posts: 2180
- Joined: Tue Sep 01, 2015 2:59 pm
- Location: Brisbane, Queensland, Australia
- Contact:
Re: developing a notational comma popularity metric
I should say that your k's and a's could well be optimal. I am not (yet) doing a sort on the metric (by which I really mean, on the results of apply the metric to the ratios), to obtain the estimated ranks. I am instead generating the estimated ranks by exponentiating the metric, i.e. taking a number slightly greater than 1, e.g. 1.126 and raising it to the power of the metric. i.e. est_rank = 1.126metric. Where metric = sopafr(n)+k*sopafr(d). I then use the Excel Solver to adjust k and a to minimise the sum of all the (est_rank-1.12 - scala_rank-1.12)2. The -1.12 replaces the earlier -1.37. I call this the Zipf exponent, z.
However the 1.126 above, the base of the exponential, call it b, is not fixed. I include it with k and a as variable to be adjusted by the Solver, to minimise the sum of the squared differences between the zipfed-ranks.
I do this because I can't include a sort in a function to be optimised by the Excel Solver. But I should be able to check your results eventually. Hiking tomorrow, so it probably won't be until the day after.
However the 1.126 above, the base of the exponential, call it b, is not fixed. I include it with k and a as variable to be adjusted by the Solver, to minimise the sum of the squared differences between the zipfed-ranks.
I do this because I can't include a sort in a function to be optimised by the Excel Solver. But I should be able to check your results eventually. Hiking tomorrow, so it probably won't be until the day after.
- Attachments
-
- ImprovedSoPF.xlsx
- (713.72 KiB) Downloaded 209 times
- cmloegcmluin
- Site Admin
- Posts: 1704
- Joined: Tue Feb 11, 2020 3:10 pm
- Location: San Francisco, California, USA
- Real Name: Douglas Blumeyer (he/him/his)
- Contact:
Re: developing a notational comma popularity metric
When limiting to the first 80 items, I get -1.38. Almost exactly the same as -1.37. When I extend it out to all 820 of them, I get -1.39, also very close.Dave Keenan wrote: ↑Sat Jun 27, 2020 2:39 pm I find that votes = 8254×rank-1.12 is a better match to the data than N×rank-1.37 when limiting to the first 80 ratios. But I also find that votes = 8280×rank-1.13 is the best fit when I include all 820 data points. I also note that the fewer data points I include, starting from the most popular, the closer the exponent gets to a true Zipf's law exponent of -1.
How are you finding your match for this rank exponent?
I was about to say that I was using the regression analysis features built into Google Sheets, which I trusted well enough, except that when I went to spruce up this chart to share here on the forum, some really strange things happened that broke my trust in Sheets. I noticed right before I exported that what should have been merely stylistic changes to the chart had changed its contents: it now had an exponent of -1.49 instead of -1.38!
And as I undid my actions, I swear that I saw the exponent go through at least one other value other than -1.38! I guess I just hadn't thought to monitor the exponent while applying these styles, and it had been flip-flopping all over the place as I went along.
So I then sought to figure out how I had managed to get these other exponent values. And while I tried every different combination of the styling actions I'd taken, I failed to reproduce these other values besides -1.49. And unfortunately, having forked off that original undo history branch, I was unable to get it back, so I now couldn't prove that I had ever seen anything other than -1.49.
But it appears there might be an explanation for this second value. Whether adding the x-axis to the chart, or including a header row to label the line, what seems to be happening is that it's including the first element in the list, 1/1's votes of 7624, where it wasn't including that before! So that's what causes my exponent to change from -1.38 to -1.49.
When I have it fit the line to all 820 points, this seeming inclusion or exclusion of the first data point does not affect the fit. It's locked at -1.39. Which suggests that maybe that's the one closer to the truth. Perhaps this "inclusion of the first data point" notion is actually more like Sheets' "inclusion of an extraneous imaginary data point before the start of the series".
So I'm not convinced I should trust Sheets, anymore. Surely whatever maths Sheets is running in the background, we can run ourselves. I just don't know enough stats to know how to find it, for a power series trendline. If anyone else knows, I can certainly code it up. I'll also ask Lauren sometime today when she's not busy.
By the way, we need to name this exponent we're raising the ranks to before taking their squared distance. How about r? We only used that once briefly earlier for Tenney height, which Dave found a simplification for and so we don't need it anymore.
Alternatively, and I think Dave has been implying this: we should just use -1 for the exponent, in accordance with Zipf's law. This would be tantamount to saying: deviation from -1 in the data we're working with represents noise – the extent to which these Scala votes are polluted with redundant scales, scales that aren't true scales, or otherwise aren't representative of future usage. I am open to just using -1 as the rank. I think it's elegant and memorable and justifiable.
I'd bet including s and u affects k.Without including prime limit or sopf (s = 0, u = 0), I find the best k to be around 0.8, no matter whether I use rank-1, rank-1.12 or rank-1.37.
The function Dave has found is called the prime-counting function. Nice find, Dave! Certainly worth considering.I've also noticed that weighting prime p according to pa penalises primes greater than 13 way too much, for all reasonable values of a (≈ 1). So instead of pa I tried π(p), i.e. I weighted each prime by its index. π(5)=3, π(7)=4, π(11)=5, etc. So instead of sopafr(), it's soπ(p)fr().
But I find that π(p) penalises higher primes too little.
Suppose we raise π(p) to some power? In its case the power would probably be > 1. I could see this being the right thing. That composers actually use higher primes in a pattern which is based not on their size but their index in the sequence of primes, but then not a linear weight on these indices but an exponential one.
I wouldn't want our final formula to use both sopafr and soπafr, but I will see if π can lower our sum-of-squares any further.
I'm afraid I can't quite follow your explanation of your Excel workarounds. It's not your explanatory skills that are lacking. Just my math skills (and/or familiarity with Excel's limitations). But no rush on the checking. I am going to try out soπafr and I suppose I should also try out soπaf too. They may well move the needle.
- volleo6144
- Posts: 81
- Joined: Mon May 18, 2020 7:03 am
- Location: Earth
- Contact:
Re: developing a notational comma popularity metric
I ran the numbers on a few of these:cmloegcmluin wrote: ↑Sun Jun 28, 2020 2:07 am Suppose we raise π(p) to some power? In its case the power would probably be > 1. I could see this being the right thing. That composers actually use higher primes in a pattern which is based not on their size but their index in the sequence of primes, but then not a linear weight on these indices but an exponential one.
Prime pi^1 pi^2 pi^3 pi^4 pi^1.5 pi^2.5 ----- ---- ---- ---- ---- ------ ------ 5 3 9 27 81 5.1962 15.588 7 4 16 64 256 8 32 11 5 25 125 625 11.18 55.902 13 6 36 216 1296 14.697 88.182 17 7 49 343 2401 18.52 129.64 19 8 64 512 4096 22.627 181.02 23 9 81 729 6561 27 243 29 10 100 1000 10 K 31.623 316.23 - first reversal: 31 is more popular than 29 31 11 121 1331 15 K 36.483 401.31 37 12 144 1728 21 K 41.569 498.83π1.5 or something close looks a little promising: an 11 is worth about two 5's, and a 19 is worth about two 11's.
Did you really mean "exponential"...? (It might help to know that π(p) ~ p / ln p for larger primes, which basically means that, around e10 ~ 22,000, about one in every 10 numbers is prime on average. As it turns out, there are exactly 100 primes between 22,001 and 22,999.)
I imagine that the way Excel's optimizer works is by nudging your starting number around to minimize a cost function or something (or it's something close to Newton's method, in which case you can't really use absolute values either), and I'm also pretty sure Excel doesn't have a function to sort things without using an actual sort command (like, a function that you can just put in a formula).I'm afraid I can't quite follow your explanation of your Excel workarounds. It's not your explanatory skills that are lacking. Just my math skills (and/or familiarity with Excel's limitations). But no rush on the checking. I am going to try out soπafr and I suppose I should also try out soπaf too. They may well move the needle.
I'm in college (a CS major), but apparently there's still a decent amount of time to check this out. I wonder if the main page will ever have 59edo changed to green...
- cmloegcmluin
- Site Admin
- Posts: 1704
- Joined: Tue Feb 11, 2020 3:10 pm
- Location: San Francisco, California, USA
- Real Name: Douglas Blumeyer (he/him/his)
- Contact:
Re: developing a notational comma popularity metric
I meant exponent > 1, as in your chart above.
- cmloegcmluin
- Site Admin
- Posts: 1704
- Joined: Tue Feb 11, 2020 3:10 pm
- Location: San Francisco, California, USA
- Real Name: Douglas Blumeyer (he/him/his)
- Contact:
Re: developing a notational comma popularity metric
Using soπafr, the lowest sum of squares I can get is 0.0011856213167235174, with k = 0.64, a = 1.17, s = 0.25, and u = 0.14 (and r = -1.37). That is not as good as the previous best of 0.001101743627332945.
I haven't tried it with soπaf. I suppose I should try each of the four possible combinations. Though at this point I expect that π will not win the day.
I haven't tried it with soπaf. I suppose I should try each of the four possible combinations. Though at this point I expect that π will not win the day.
- cmloegcmluin
- Site Admin
- Posts: 1704
- Joined: Tue Feb 11, 2020 3:10 pm
- Location: San Francisco, California, USA
- Real Name: Douglas Blumeyer (he/him/his)
- Contact:
Re: developing a notational comma popularity metric
Here’s something that’s been bugging me a bit: considering the handful of situations we know of with commas we are going to run this metric against, most those commas are “off the charts” in the sense that they have 0 votes. Is this metric really meaningful in the off-the-charts world? Do we feel like we’re truly tapping into the underlying forces that determine the precedence of notational commas, even extending into that uncharted territory? I would say hesitantly “yes” — otherwise I wouldn’t be here — but I wonder if anyone else has also doubted this.
- Dave Keenan
- Site Admin
- Posts: 2180
- Joined: Tue Sep 01, 2015 2:59 pm
- Location: Brisbane, Queensland, Australia
- Contact:
Re: developing a notational comma popularity metric
I wrote above: "I call this the Zipf exponent, z."cmloegcmluin wrote: ↑Sun Jun 28, 2020 2:07 am By the way, we need to name this exponent we're raising the ranks to before taking their squared distance. How about r? We only used that once briefly earlier for Tenney height, which Dave found a simplification for and so we don't need it anymore.
- cmloegcmluin
- Site Admin
- Posts: 1704
- Joined: Tue Feb 11, 2020 3:10 pm
- Location: San Francisco, California, USA
- Real Name: Douglas Blumeyer (he/him/his)
- Contact:
Re: developing a notational comma popularity metric
If we use the Zipf exponent for r, then I agree it could be z. We haven't agreed on that yet, though. In that same post I said:Dave Keenan wrote: ↑Sun Jun 28, 2020 9:05 amI wrote above: "I call this the Zipf exponent, z."cmloegcmluin wrote: ↑Sun Jun 28, 2020 2:07 am By the way, we need to name this exponent we're raising the ranks to before taking their squared distance. How about r? We only used that once briefly earlier for Tenney height, which Dave found a simplification for and so we don't need it anymore.
So are you agreeing with that? I understand you're hiking today so perhaps you didn't have time to compose a detailed response.cmloegcmluin wrote: ↑Sun Jun 28, 2020 2:07 am Alternatively, and I think Dave has been implying this: we should just use -1 for the exponent, in accordance with Zipf's law. This would be tantamount to saying: deviation from -1 in the data we're working with represents noise – the extent to which these Scala votes are polluted with redundant scales, scales that aren't true scales, or otherwise aren't representative of future usage. I am open to just using -1 as the rank. I think it's elegant and memorable and justifiable.
------
I experimented today with a soπaf(r) function and could not find any use for it that performed better than simplify sopaf(r). And as I said before, I don't want to experiment with metrics that use both π and p.
With z (or r) = -1, I found k = 1/2, a = 2/3, s = 1/4, u = 1/4 performs the best. I decided it was best for sopf to also be split by numerator and denominator with the lesser of the two weighted by k, just as is the case for sopfr. The first ranking this fails on is 125/1. Its sum of squares is about 0.0058363. I know that's not as good as the number I threw out earlier, but that's because changing from r = -1.37 to r = -1 causes those sum of squares values to all grow quite a bit. These k, a, s, and u values are not the exact values I calculated, but I think for the accuracy of the data we're working with (actually I had 0.48, 0.66, 0.23, and 0.24, respectively) simplifying them to those values feels like the right level of assertion of confidence. I also found that they did not change dramatically when changing r between -1 and -1.37. So the metric would be:
sop⅔fr(num) + ½sop⅔fr(den) +
¼(sop⅔f(num) + ½sop⅔f(den)) +
¼primelimit(num/den)
where the terms using num are always greater than the terms using den.