Page 2 of 43

### Re: developing a notational comma popularity metric

Posted: Thu Jun 25, 2020 2:13 am
Dave Keenan wrote:
Wed Jun 24, 2020 6:37 pm
I love SoUP or even SoUPF, versus SoPF, and so I am sad to report that these functions already have standard names
SoPF = sopfr (sum of prime factors with repetition)
SoUP = sopf (sum of prime factors )
https://mathworld.wolfram.com/SumofPrimeFactors.html
I'm glad you like it. And nice find! Now if only we could find an established name for a function that roughens a number x to n-rough, we could have a more proper way of writing SoPF>3. I looked around for a bit but didn't find aything. Perhaps it might simply be roughnx, and similarly we'd have smoothnx. Or perhaps rghnx and smthnx would be preferable. Then we'd have sopfr(rghn(x)) instead of SoPF>3.
I want to get into testing some ideas for this, in a spreadsheet, as is my wont. Given that we don't care about matching the frequencies, but only matching the frequency ranking (of the first 40 or so 5-rough ratios), what are you actually fitting your candidate functions to?
I've been thinking about that the last couple days, especially in light of my re-emphasizing of part of your original ask, to "[filter] out the historical noise". Maybe the way to approach the problem is: the frequencies are a useful tool to help us aim, but they're not the target. The rank is the target. It seems like you were already on that page...

I'm not sure if we need to define an exact success condition at this point. But I suppose if we managed to find a metric which matched the rankings for the first 40 or so commas, we'd've done fairly well for ourselves.
How might one directly measure how well one ranking matches another, i.e. the ranking produced by the candidate function and the ranking from the Scala archive statistics?
If our target is matching just the rankings, shouldn't we try to hit them exactly? Maybe I'm missing something.

I was thinking our candidate function would map a comma to some value, just as the Scala stats map a comma to a frequency value, and if they were both sorted, then we did it.

It sounds like maybe you were thinking our candidate function would map a comma to a value that was meant to look just like the value of a rank, e.g. it might try to map 11:1 to something really close to 6 because 11:1 is the 6th most popular comma. That could work too, but it seems like an extra step for our candidate function to do, and also an extra question for us to answer (this question of yours immediately above), and also maybe makes the problem unnecessarily more difficult. I just don't think, given how sparse and noisy the data is, we should shoot for anything higher fidelity than the sorting coming out the same.

### Re: developing a notational comma popularity metric

Posted: Thu Jun 25, 2020 4:20 am
cmloegcmluin wrote:
Thu Jun 25, 2020 2:13 am
It sounds like maybe you were thinking our candidate function would map a comma to a value that was meant to look just like the value of a rank, e.g. it might try to map 11:1 to something really close to 6 because 11:1 is the 6th most popular comma. That could work too, but it seems like an extra step for our candidate function to do, and also an extra question for us to answer (this question of yours immediately above), and also maybe makes the problem unnecessarily more difficult. I just don't think, given how sparse and noisy the data is, we should shoot for anything higher fidelity than the sorting coming out the same.
My thoughts pretty much exactly: while you can fit an (n-1)st-degree polynomial to any n (x,y) data points (unless two of them share the same x-value), they usually don't do much good at predicting any future terms. Fitting a sixth-degree polynomial (x=rank, y=SoPF>3) to the first seven data points here (the Spartan set) gives (x-1)(x5+16x4-529x3+4076x2-12660x+16560)/720. This polynomial perfectly reproduces the seven points (1,0), (2,5), (3,7), (4,10), (5,12), (6,11), and (7,12), but if you go any further in either direction, everything goes wrong: (8,35), (9,124), (10,357), (0,-23).

Making the sort come out the same is also going to be particularly hard, because 1:29 is less popular than 1:31. And there's even something in between: 1:343 (the >3 content of 1024:1029, the difference between three M2 and a P5), with a significantly lower SoPF>3.

### Re: developing a notational comma popularity metric

Posted: Thu Jun 25, 2020 6:46 am
volleo6144 wrote:
Thu Jun 25, 2020 4:20 am
Making the sort come out the same is also going to be particularly hard, because 1:29 is less popular than 1:31. And there's even something in between: 1:343 (the >3 content of 1024:1029, the difference between three M2 and a P5), with a significantly lower SoPF>3.
I feel like there's gotta be some pithier and clearer term for "they sort the same"... like, mutually monotonic?

Anyway, that's a good observation re: 29:1 and 31:1's popularity order being switched (vis-à-vis their prime limit). That could mean either:
1. that should be our cut-off point. It's only 32 entries into the list – less than the 40 count suggestion which Dave threw out earlier, but maybe 32 would suffice. Or,
2. the weight on primes may need to be more complex than something like what we've looked at already, p0.9, i.e. it may need to take harmonic concepts into account. If someone has an argument for why the 31st harmonic is more useful to compose with than the 29th, and can manifest said argument in the form of a mathematical formula, we could experiment with that.

### Re: developing a notational comma popularity metric

Posted: Thu Jun 25, 2020 1:55 pm
I've just had the thought that perhaps we should strike from this topic further discussion of non-2,3-reduced (aka non-5-rough) (aka "full"?) ratios. I believe what we should focus on here are the 5-smooth (aka "notational") ratios.

In each of the three scenarios where I personally need to apply this notational comma popularity metric, I do happen to know the approximate size in cents of the ratios I'm dealing with, and thus I do know their full ratio. However, I am thinking now that any sort of badness metric for those ratios should be independent of this metric (for those here also on the Magrathean diacritics thread, what I mean specifically is that evaluation of the full ratio would be another submetric we should fold into a consolidated badness metric along with whatever we come up with here re: the notational comma, plus whatever else we've made plans for over there – abs3exp, tina error, apotome slope, 5-schisma slope, etc.).

When Dave shared the Scala stats out (here) he shared two spreadsheets: one with comma popularities, which is what we've been working most with, but another one with ratio popularities. I thought briefly that we should experiment with fitting a metric to the ratio popularity rankings, but then I thought better of it. The comma popularities sheet was almost certainly engineered from the ratio popularities one. @Dave Keenan can probably confirm this. My expectation is that for each ratio, they figured out which comma would be required to notate it relative to a Pythagorean chain of fifths. My point is that when we're looking for a best comma – whether between the 1/121k and 1/1225k, or for the 75th mina, or for the tinas – we're not looking for ratios which are themselves popular, but ones which enable the notation of popular commas.

I may not be making myself clear, since my command of these concepts is tenuous, but hopefully that makes sense, then, why I think we should leave aside talk of the full ratios.

Or maybe I'm just a bit desperate to start closing in on solutions, rather than expanding the problem's possibility space.

### Re: developing a notational comma popularity metric

Posted: Thu Jun 25, 2020 3:02 pm
I totally agree. In this thread we should stick to the very specific sub-problem of ranking a comma according to the combined popularities of all the pitch ratios it can notate. Hence the removal of the factors of 2 and 3.
cmloegcmluin wrote:The comma popularities sheet was almost certainly engineered from the ratio popularities one. @Dave Keenan can probably confirm this.
Yes. For example, in popularityOfCommas.xlsx, the count for 5/1, namely 5371, is the sum of the counts of ... 40/27 10/9 5/3 5/4 15/8 45/32 ... and ... 27/20 9/5 6/5 8/5 16/15 64/45 ... and their octave extensions, because these can all be notated with a 5-comma symbol (plus nominals and sharps or flats). But in this thread we're not distinguishing between different sizes of 5-comma, e.g. the 5-schisma or the 5-Comma. Any one will do.

### Re: developing a notational comma popularity metric

Posted: Thu Jun 25, 2020 10:16 pm
I'm happy to accept a popularity metric that ranks prime 31 as less popular than prime 29, even though this is the reverse of the Scala archive stats, provided it gets enough other things right.

This looks like what I want for comparing rankings:
https://en.wikipedia.org/wiki/Spearman% ... oefficient
If I use ordinal ranking, so every rank is a unique integer (no ties), then this simplifies to minimising the sum of the squares of the differences between the two ranks for each 2,3-reduced ratio.

### Re: developing a notational comma popularity metric

Posted: Fri Jun 26, 2020 2:16 am
Dave Keenan wrote:
Thu Jun 25, 2020 10:16 pm
This looks like what I want for comparing rankings:
https://en.wikipedia.org/wiki/Spearman% ... oefficient
If I use ordinal ranking, so every rank is a unique integer (no ties), then this simplifies to minimising the sum of the squares of the differences between the two ranks for each 2,3-reduced ratio.
Whoa, good find! My partner Lauren, who I mentioned earlier in this thread as being pretty good with stats, had heard about this correlation, but couldn't recall what it was.

So, just to be clear: we sort the occurrence counts from Scala into a popularity rank, and then sort whatever values our candidate metric generates into an popularity rank approximation, and then we take those two lists of integers — the former of which will be exactly monotonic, and the latter of which will be as monotonic as we can manage — and run them through this summuation-of-SED formula to get a single result, Spearman's rank correlation coefficient, or ρ ("rho")*. So the "two ranks" for each ratio you mention are the actual rank and our approximate rank. And then we rinse and repeat for each candidate metric until we find the one which gives the best (smallest) ρ.

If so, I agree we should use this technique.

*Actually, the formula looks to be a tad more complex than that. You extracted the important bit. But we actually need to take the above value, multiply by 6, divide by n(n2 - 1) where n is the number of observations (about 800 of them, we have), and then subtract all that from 1.

I don't suppose there's an established method of weighting each term's SED with a falloff of importance, to represent how we care very much about the ranking being accurate at the top of the list but less and less so as we go deeper into the list, due to having fewer data points the deeper we go? It feels like a common enough situation that an established method could exist. Otherwise we could develop an appropriate method. That or set a hard cut-off, though that feels a bit clumsy.

Out of curiosity, I ran our metric to beat, SoPF>3 (or should we start calling it sopfr(rgh5)?) against this. It gives ρ = 0.69655172413 when applied up to where 29:1 appears in the list. I don't have the result for the whole list yet because I haven't modified my code to handle SoPF>3 for numbers with huge primes in them yet.

### Re: developing a notational comma popularity metric

Posted: Fri Jun 26, 2020 4:03 am
cmloegcmluin wrote:
Thu Jun 25, 2020 6:46 am
I feel like there's gotta be some pithier and clearer term for "they sort the same"... like, mutually monotonic?

Anyway, that's a good observation re: 29:1 and 31:1's popularity order being switched (vis-à-vis their prime limit). That could mean either:
b. the weight on primes may need to be more complex than something like what we've looked at already, p0.9, i.e. it may need to take harmonic concepts into account. If someone has an argument for why the 31st harmonic is more useful to compose with than the 29th, and can manifest said argument in the form of a mathematical formula, we could experiment with that.
After looking at the next few primes' statistics, I noticed another few primes that appear before smaller primes, with potential explanations:
• 31 (because 31:32)
• 47 (because 47:48)
• 41 appears after 43 (probably because of 81:82 being within 2 tinas of 80:81?)
• 97 appears before 67-89 (because 96:97)
I have no idea how this could be systematically extended to an actual function of each prime...

### Re: developing a notational comma popularity metric

Posted: Fri Jun 26, 2020 6:26 am
I do not currently have access to a powerful hunk of math software such as MATLAB or Mathematica. WolframAlpha's online regression analysis tools seem to be somewhat limited; specifically, they only ever work (for me, anyway) with data in three dimensions or fewer. If we could find a way to do a regression analysis on the prime exponent vectors (monzos, or kets) of these notational ratios along with their popularities, we could find a big 'ol polynomial w/ different coefficients for each prime. Dunno if quadratic would work or if we'd have to go cubic or quartic, but that might do the trick.

### Re: developing a notational comma popularity metric

Posted: Fri Jun 26, 2020 7:24 am
I don't suppose there's an established method of weighting each term's SED with a falloff of importance, to represent how we care very much about the ranking being accurate at the top of the list but less and less so as we go deeper into the list, due to having fewer data points the deeper we go? It feels like a common enough situation that an established method could exist. Otherwise we could develop an appropriate method. That or set a hard cut-off, though that feels a bit clumsy.
I skimmed this paper, and can't say I 100% get it yet, but it seems like it might be just what I was looking for: https://pdfs.semanticscholar.org/9883/b ... 01af5a.pdf