developing a notational comma popularity metric

User avatar
Dave Keenan
Site Admin
Posts: 2180
Joined: Tue Sep 01, 2015 2:59 pm
Location: Brisbane, Queensland, Australia
Contact:

Re: developing a notational comma popularity metric

Post by Dave Keenan »

cmloegcmluin wrote: Fri Jun 26, 2020 2:16 am So, just to be clear: we sort the occurrence counts from Scala into a popularity rank, and then sort whatever values our candidate metric generates into an popularity rank approximation, and then we take those two lists of integers — the former of which will be exactly monotonic, and the latter of which will be as monotonic as we can manage — and run them through this summuation-of-SED formula to get a single result, Spearman's rank correlation coefficient, or ρ ("rho")*. So the "two ranks" for each ratio you mention are the actual rank and our approximate rank. And then we rinse and repeat for each candidate metric until we find the one which gives the best (smallest) ρ.
Yes. Except you don't really need to say "summation of SED" since "SED" already is the summation.
*Actually, the formula looks to be a tad more complex than that. You extracted the important bit. But we actually need to take the above value, multiply by 6, divide by n(n2 - 1) where n is the number of observations (about 800 of them, we have), and then subtract all that from 1.
Sure. I should have written "then maximising Spearman's coefficient simplifies to minimising the sum of the squares of the differences between the two ranks for each 2,3-reduced ratio.".
I don't suppose there's an established method of weighting each term's SED with a falloff of importance, to represent how we care very much about the ranking being accurate at the top of the list but less and less so as we go deeper into the list, due to having fewer data points the deeper we go?
Good idea. But I don't like the weighting of that paper you linked. It isn't a weighted Spearman's, it's a weighted Pearson's. And the weighted rank is 0.9rank.

I think we should stick to Spearman's, and the weighted rank should be rank -1.37.
... SOPF>3 (or should we start calling it sopfr(rgh5)?) ...
I'd rather not.
User avatar
Dave Keenan
Site Admin
Posts: 2180
Joined: Tue Sep 01, 2015 2:59 pm
Location: Brisbane, Queensland, Australia
Contact:

Re: developing a notational comma popularity metric

Post by Dave Keenan »

cmloegcmluin wrote: Fri Jun 26, 2020 6:26 am I do not currently have access to a powerful hunk of math software such as MATLAB or Mathematica. WolframAlpha's online regression analysis tools seem to be somewhat limited; specifically, they only ever work (for me, anyway) with data in three dimensions or fewer. If we could find a way to do a regression analysis on the prime exponent vectors (monzos, or kets) of these notational ratios along with their popularities, we could find a big 'ol polynomial w/ different coefficients for each prime. Dunno if quadratic would work or if we'd have to go cubic or quartic, but that might do the trick.
I use Excel's Solver when I want to do that kind of thing. But independent weights for each prime still won't give you different ranks for 7/5 and 35/1.

I'm curious how much of a decrease in sum-of-squares-of-differences in rank -1.37 you can get by going from from sopfr(n.d) to
sopfr(n.d) + k.abs[sopfr(n) - sopfr(d)]
User avatar
Dave Keenan
Site Admin
Posts: 2180
Joined: Tue Sep 01, 2015 2:59 pm
Location: Brisbane, Queensland, Australia
Contact:

Re: developing a notational comma popularity metric

Post by Dave Keenan »

Whatever metric we come up with, must treat a/b the same as b/a. So we are free to arrange our ratios so that sopfr(n) ≥ sopfr(d). In that case
sopfr(n.d) + k.abs[sopfr(n) - sopfr(d)]
simplifies to
sopfr(n.d) + k[sopfr(n) - sopfr(d)]
= sopfr(n) + sopfr(d) + k.sopfr(n) - k.sopfr(d)
= (1+k)sopfr(n) + (1-k)sopfr(d)
And since multiplying everything by a constant does not affect the ranking, this is equivalent to
sopfr(n) + (1-k)/(1+k).sopfr(d)
which is just a different k, and so is equivalent to
sopfr(n) + k.sopfr(d)
User avatar
cmloegcmluin
Site Admin
Posts: 1700
Joined: Tue Feb 11, 2020 3:10 pm
Location: San Francisco, California, USA
Real Name: Douglas Blumeyer (he/him/his)
Contact:

Re: developing a notational comma popularity metric

Post by cmloegcmluin »

Dave Keenan wrote: Fri Jun 26, 2020 8:22 am Except you don't really need to say "summation of SED" since "SED" already is the summation.
Well that's confusing! The "S" stands for squared, but it does seem like adding "S" to "ED" does way more than square it, conventionally. Thanks for pointing that out.
I think we should stick to Spearman's, and the weighted rank should be rank -1.37.
Love it.
Dave Keenan wrote: Fri Jun 26, 2020 8:40 am But independent weights for each prime still won't give you different ranks for 7/5 and 35/1.
Wouldn't it? In the case of 7/5, the 5-term of the monzo is negative, while in 35/1 it's positive. Couldn't that affect the outcome? That's one reason why I suggested this approach.
I'm curious how much of a decrease in sum-of-squares-of-differences in rank -1.37 you can get by going from from sopfr(n.d) to
sopfr(n.d) + k.abs[sopfr(n) - sopfr(d)]
I'll report back soon. Does the "k" in k.abs mean anything?

I can also try abs(n - d) where n/d is the 5-rough ratio, unless you have some reason to henceforth prefer abs(sopfr(n) - sopfr(d)).
User avatar
Dave Keenan
Site Admin
Posts: 2180
Joined: Tue Sep 01, 2015 2:59 pm
Location: Brisbane, Queensland, Australia
Contact:

Re: developing a notational comma popularity metric

Post by Dave Keenan »

The k is a multiplying factor which is adjusted to minimise the sum of squared errors in the weighted ranks.
User avatar
cmloegcmluin
Site Admin
Posts: 1700
Joined: Tue Feb 11, 2020 3:10 pm
Location: San Francisco, California, USA
Real Name: Douglas Blumeyer (he/him/his)
Contact:

Re: developing a notational comma popularity metric

Post by cmloegcmluin »

Ah ha! I see now. The usage of . as a multiplication symbol is not intuitive to me.
sopfr(n) + k.sopfr(d)
:heavy_check_mark: :100:
User avatar
volleo6144
Posts: 81
Joined: Mon May 18, 2020 7:03 am
Location: Earth
Contact:

Re: developing a notational comma popularity metric

Post by volleo6144 »

cmloegcmluin wrote: Sun Apr 23, 2271752 11:33 pm Ah ha! I see now. The usage of . as a multiplication symbol is not intuitive to me.
That thing has an SSL certificate that's expired by about a week...

Also, coming from a pure math background, it might be helpful to note that various metrics often only use a number's square instead of its absolute value because the absolute value function's properties at 0 ... aren't the best.
cmloegcmluin wrote: Sun Dec 30, 90519792 5:19 pm I can also try abs(n - d) where n/d is the 5-rough ratio, unless you have some reason to henceforth prefer abs(sopfr(n) - sopfr(d)).
Well, this heavily penalizes ratios like 1:343 (:~|): - :|): = 1024:1029 = :,::~|:, an inexact SoF at the Promethean level) and 1:1225 (:(|\: - :/|): = 32768:33075 = :,::|~:)—more than ratios such as 1:341 (:/|\: - :`::(/|: = 1023:1024 ~ :,::)|:, off by 6479:6480) and 1:299 (:)~||: + :`::(|\: = 8073:8192 ~ :.::|||):, off by 76544:76545)—which ... doesn't feel right.
I'm in college (a CS major), but apparently there's still a decent amount of time to check this out. I wonder if the main page will ever have 59edo changed to green...
User avatar
cmloegcmluin
Site Admin
Posts: 1700
Joined: Tue Feb 11, 2020 3:10 pm
Location: San Francisco, California, USA
Real Name: Douglas Blumeyer (he/him/his)
Contact:

Re: developing a notational comma popularity metric

Post by cmloegcmluin »

volleo6144 wrote: Fri Jun 26, 2020 11:57 am Well, this heavily penalizes ratios like 1:343 (:~|): - :|): = 1024:1029 = :,::~|:, an inexact SoF at the Promethean level) and 1:1225 (:(|\: - :/|): = 32768:33075 = :,::|~:)—more than ratios such as 1:341 (:/|\: - :`::(/|: = 1023:1024 ~ :,::)|:, off by 6479:6480) and 1:299 (:)~||: + :`::(|\: = 8073:8192 ~ :.::|||):, off by 76544:76545)—which ... doesn't feel right.
I'm not sure I understand exactly what you mean:

1:343 abs(n - d) = 343 - 1 = 342
1:341 abs(n - d) = 341 - 1 = 340

1:343 abs(sopfr(n) - sopfr(d)) = 7 + 7 + 7 = 21
1:341 abs(sopfr(n) - sopfr(d)) = 11 + 31 = 42

It looks like abs(sopfr(n) - sopfr(d)) maybe does a better job, but I don't see a "heavy [penalty]" for abs(n - d).
cmloegcmluin wrote: Fri Jun 26, 2020 9:06 am
Dave Keenan wrote: Fri Jun 26, 2020 8:22 am I think we should stick to Spearman's, and the weighted rank should be rank -1.37.
Love it.
Raising the ranks to the -1.37 power feels right, since it weights ranks them in the same proportion as the counts of data points themselves (approximately, per my earlier findings of the best fit line for them). And it also accomplishes not needing to set a cut-off, as the later entries in the popularities hardly affect it at all. But one issue is that it results in ρ coming out extremely close to 1 in every case, so it's hard to tell whether our metric is truly an improvement. SoPF>3 already has ρ = 0.9999999998222343! That said, k = 1.5 maximizes ρ = 0.9999999998823996 (it's some number near 1.5; I don't know the exact range within which ρ = 0.9999999998823996, but for a decent slice of k around 1.5, the ranks aren't sorting any better or worse).

So perhaps our candidate function is: sopfr(n) + (3/2)sopfr(d)?

The only thing that's disappointing is that because k > 1, it actually means that more balanced ratios get worse scores:

35:1 → 12
7:5 → 14.5

so again, we find that these pairs of low primes are the exceptions, not the pattern.
User avatar
Dave Keenan
Site Admin
Posts: 2180
Joined: Tue Sep 01, 2015 2:59 pm
Location: Brisbane, Queensland, Australia
Contact:

Re: developing a notational comma popularity metric

Post by Dave Keenan »

cmloegcmluin wrote: Fri Jun 26, 2020 9:06 am
Dave Keenan wrote: Fri Jun 26, 2020 8:22 am Except you don't really need to say "summation of SED" since "SED" already is the summation.
Well that's confusing! The "S" stands for squared, but it does seem like adding "S" to "ED" does way more than square it, conventionally. Thanks for pointing that out.
The Euclidean distance is where you take all the differences, square them, sum the squares, then take the square root. It's a generalisation to n-dimensions, of Pythagoras' theorem for finding the hypotenuse. Finding the square of the Euclidean distance, simply has the effect of undoing that last step where you took the square root. So I totally agree, it is a confusing term. Better to just not do that step in the first place, and so call it the "sum of squared errors" or "sum of squared differences", often abbreviated to just "sum of squares".
Dave Keenan wrote: Fri Jun 26, 2020 8:40 am But independent weights for each prime still won't give you different ranks for 7/5 and 35/1.
Wouldn't it? In the case of 7/5, the 5-term of the monzo is negative, while in 35/1 it's positive. Couldn't that affect the outcome?
Wouldn't it be the treating of positive exponents differently from negative exponents that made the difference?
I can also try abs(n - d) where n/d is the 5-rough ratio, unless you have some reason to henceforth prefer abs(sopfr(n) - sopfr(d)).
It's that thing I mentioned earlier. Sopfr() is a kind of logarithm. It feels wrong to add numbers to their logarithms. They feel like incommensurate things, like adding pascals (sound pressure) to decibels (log of sound pressure).
User avatar
Dave Keenan
Site Admin
Posts: 2180
Joined: Tue Sep 01, 2015 2:59 pm
Location: Brisbane, Queensland, Australia
Contact:

Re: developing a notational comma popularity metric

Post by Dave Keenan »

cmloegcmluin wrote: Fri Jun 26, 2020 4:02 pm Raising the ranks to the -1.37 power feels right, since it weights ranks them in the same proportion as the counts of data points themselves (approximately, per my earlier findings of the best fit line for them). And it also accomplishes not needing to set a cut-off, as the later entries in the popularities hardly affect it at all.
Agreed.
But one issue is that it results in ρ coming out extremely close to 1 in every case, so it's hard to tell whether our metric is truly an improvement. SoPF>3 already has ρ = 0.9999999998222343! That said, k = 1.5 maximizes ρ = 0.9999999998823996 (it's some number near 1.5; I don't know the exact range within which ρ = 0.9999999998823996, but for a decent slice of k around 1.5, the ranks aren't sorting any better or worse).
I wouldn't bother calculating ρ. I'd just look at the sum of squared errors. But I'm curious how you're getting from the sum of squared errors in rank-1.37, to ρ. I wouldn't have a clue how to normalise that.
So perhaps our candidate function is: sopfr(n) + (3/2)sopfr(d)?

The only thing that's disappointing is that because k > 1, it actually means that more balanced ratios get worse scores:

35:1 → 12
7:5 → 14.5

so again, we find that these pairs of low primes are the exceptions, not the pattern.
I don't get it. I thought rank-1./37 would give much more weight to the combinations of lower primes, like 7/5 and 35/1. I felt sure that would pull k to be less than 1.

I hope you swapped numerators and denominators where required to ensure sopfr(n) ≥ sopfr(d). For example, 25/11 would need to become 11/25 because sopfr(11) = 11 and sopfr(25) = 5+5 = 10. That's what lets us avoid taking absolute values, and lets us use the simplification sopfr(n) + k*sopfr(d).

You could also try substituting your sop0.9fr() for sopfr(). It's unclear to me whether that can make any difference to the ranking.
Post Reply