## developing a notational comma popularity metric

cmloegcmluin
Site Admin
Posts: 993
Joined: Tue Feb 11, 2020 3:10 pm
Location: San Francisco, California, USA
Real Name: Douglas Blumeyer
Contact:

### Re: developing a notational comma popularity metric

Dave Keenan wrote: Wed Oct 28, 2020 7:45 pm So far, the best such usefulness-ranking function I have found (based on maximising the number of existing commas that it ranks as the most useful in their zone), is:

usefulness_rank = lb(N2D3P9) + 1/12 × AAS1.37 + 2^(ATE-10.5)
Just so we've got this straight, that's the \text{lpe} metric (short for lb + pow + exp) with s = \frac{1}{12}, b = 1.37, t = 2^{-10.5} = 0.00069053396, right?

I confirm that this matches 91 of the commas. I do not necessarily confirm that it's the best possible metric (in the lpe family, or otherwise).
There are certainly 2 commas (for some rarely-used symbols) where we did a bad job. That is, we assigned commas that were far less useful than the most useful comma in the capture zone. These were identified earlier, based on their N2D3P9 values alone
By the way, here's the latest (Aug. 16th) list of JI notation commas with high (>137) N2D3P9. We had only committed to changing the 2 worst of them, which were dramatically worse than the others.
I believe for
we should replace 19/4375s [-8 10 -4 -1 0 0 0 1⟩ with 1/575s [6 2 -2 0 0 0 0 0 -1⟩
and for
we should replace 14641k [-17 2 0 0 4⟩ with 143/5k [-8 2 -1 0 1 1⟩
You said in a recent email that these two replacement commas are each the most useful in their zones by all 8 metrics we're trying. That's certainly a strong argument. I'm fine with them. I note that neither $$\{\frac{143}{5}\}_{\scriptsize{2,3}}$$ nor $$\{575\}_{\scriptsize{2,3}}$$ are yet notated by a Sagittal comma, so that's nice. The N2D3P9 of the 1/575s but 183.68 is on the higher side and would make that list of commas with high N2D3P9 that I just shared. But maybe we should go back over that list and see if all those commas are the most useful in their zones, therefore justifying their subpar popularity.

We're not jumping the gun by reassigning these before we complete the final step (popularity, usefulness, badness) and create the full badness metric, are we? Or was badness only relevant for assigning tinas — in other words we don't care about mina error?
Our recent replacement of 47M with 85/11M for was validated.
I think you're on the right track, but slightly confused. The primary comma for has been 85/11M since before I showed up this year. You can find it in row 136 of George's spreadsheet calculator shared here (note: this version I'm linking here is out-of-date; I'm sharing it as proof of how long this 85/11M has been this way. The up-to-date version of the spreadsheet calculator can be downloaded here). The change we made that involved a 47-limit comma was this: we unsplit the 75th mina, eliminating the 1/47S from the notation altogether, because we found that its sum-of-elements was identical to the sum-of-elements for the comma it split the mina with, and therefore said sum-of-elements could not be one side or the other of the bound. My original suggestion was to knock the Extreme notation down three primes from 47- to 37-limit by changing that one comma somehow, but in investigating the possibilities, that's what we discovered and it ultimately became our primary justification for the change. Details here.

You asked me in a recent email what the other recent changes we made were. There's actually only one that we've actually made, and that was the 140th mina. Although now that I've refined my understanding of the domain, I don't actually think of any minas greater than 116, which is the mina just before the half-apotome. Every mina greater than that may have a novel flag and accent combo (which Dave and I have been calling "flacco" for short, haha) but its bounds and commas are perfectly symmetrical about the half-apotome mirror, or as Dave put it, they are "dependent" on the choices below the half-apotome. So nowadays what I would say we did was change the 93rd mina (140 - 116.5 = 23.5; 116.5 - 23.5 = 93). Although I suppose, since we also patched a bug with the system where in the High precision there was some unnotated no man's land between the largest single-shaft symbol and where things picked up coming back from the next apotome anchor, we were confronting issues which cannot be reduced to the half-apotome equivalent world but which are in fact necessary groundwork for it. And specifically what we did was change the symbol from to  and shifted its lower bound to the S|M size category bound, to be symmetrical with the L|SS boundary, AKA the largest single-shaft symbol size. We did not change its comma. Actually, we found that the new symbol suited that comma even better, making its sum-of-elements closer to the primary comma than it was before (looking at it more carefully now, I see that it's not just closer, it's now exact; + = [ 5 -4 -1 0 0 1 ⟩ + [ -5 3 -1 1 1 -1 ⟩ = [ 0 -1 -2 1 1 ⟩ which is the 77/25C exactly).
But it might need to be reconsidered in the light of the information below, because we already had a symbol for an 11:85 comma. for 85/11C.
Probably this call for reconsideration is moot because the specific item under reconsideration was not recently changed. But in case it is valuable, I point to a post I made some time ago which shows that ~41% of Sagittal commas are related by the Pythagorean comma (and therefore are in the same 2,3-equivalence class). And that's just the Pythagorean comma, one of several 3-limit commas that could separate one Sagittal comma from another. Based on this chart (there's some tweaks I'd like to make to it, so don't take it as word of the gods or anything, but for this purpose it should suffice. Well one big tweak is that it contains symbols whose primary commas are greater than the half-apotome and are thus essentially duplicates, per the previous paragraph (oh wait, this chart is better)) one can eyeball that there are a number of 2,3-free classes which have 3 Sagittal commas representing them. So clearly it is the case that pushing for unique 2,3-free class representation was not the highest priority when designing the JI notation. I'm not at all suggesting that was a mis-prioritization; I'm just making the observation.
I see at least two reasons here, not to use the most "useful" (according to the above function, or others like it).
1. The most "useful" is outside the symbol's capture zone at a higher precision level.
2. The most "useful" is a comma for a 2,3-equivalence class that already has a symbol (based a more useful comma for the same equivalence class).
I'm sure there's more reasons. I tried to write something intelligent-sounding about sweet spots in roughly-equally-spaced new useful commas and splitting inas but I couldn't convince myself it was correct. *shrug*

Reason #1 concerns me a tad, because it runs counter to the way the story is generally told (or at least how I've come to understand it): that each higher precision level was created in sequence, and not allowed to modify the previously established levels (other than to nudge their bounds from the outdated EDA to its latest greatest EDA). An instance of applying the design principle that adding new features shouldn't make it harder to do the simpler stuff. That said, it is hard to explain why 's primary comma is the 19s, other than that you were contemplating introducing the concept of accents, and reserving the 5s as an excellent choice for the first of them.
So optimising the above usefulness function(s) based on sum-of-squared-errors in usefulness, instead of a simple count of commas matched, will not be useful. Assignments like those above will skew the result in ways that are not meaningful.
Whoa. A+ insight right there. You're absolutely right.

I see that reverting to a simple count solves the problem because we can then simply "shave off" the 32 or so of the commas which were chosen for reasons other than purely their usefulness.

But I lament that by reverting to a simple count we'd lose some valuable differentiating detail on the results. Couldn't we continue using sum of squares, which is the best measuring technique we've thus far developed, but exclude these 32 commas from the data set we run against?
I want to remind us of what we're doing here. The most pressing need at present, is a metric for choosing commas (and/or metacommas) for tina accents. I don't think either of the above numbered reasons are likely to occur in the case of candidate commas for tina accents. So I say the above usefulness metric is Good Enough™ and we should just run with it.
Short answer: I'm okay with that. Though...
1. I'll admit I was hoping the best metric would at least have a consistent expando function for AAS and ATE, if not the same breed of function for the N2D3P9 (to be clear, I hoped for either *ee or *pp, but hopefully even lee or rpp).
2. And it does seem like for how much we invested into developing N2D3P9 we could proceed a bit more methodically at this step.
3. And I'd like to give a day for others who have been active on this topic to weigh in.
4. And I would like to at least hear your thoughts on my suggestion to eschew particular commas from the data set so we can stay in sos-mode rather than revert to boolean-mode.
5. Or perhaps you'd like me to use my sum-of-squares ability to perfect parameter values in that vicinity. It'd be nice if we could find some psychologically motivated round-ish numbers for everything, like we did for N2D3P9.
By the way, I have to say, while at first I was not the biggest fan of the mouthful which is the name "N2D3P9", and I do still find myself vacillating on how I stress its syllables, I absolutely adore that I don't have to wait to spin up a command line script in the code to calculate it, but can just plug it straight into a calculator or sometimes even calculate it in my head, because the name is a huge boon to effortlessly remembering exactly what you need to do to compute it. Pats on the back, all around!

Dave Keenan
Site Admin
Posts: 1300
Joined: Tue Sep 01, 2015 2:59 pm
Location: Brisbane, Queensland, Australia
Contact:

### Re: developing a notational comma popularity metric

cmloegcmluin wrote: Thu Oct 29, 2020 4:43 am Just so we've got this straight, that's the \text{lpe} metric (short for lb + pow + exp) with s = \frac{1}{12}, b = 1.37, t = 2^{-10.5} = 0.00069053396, right?
Right.
I confirm that this matches 91 of the commas. I do not necessarily confirm that it's the best possible metric (in the lpe family, or otherwise).
And if we replace those two commas that have N2D3P9 > 1000, then it matches 93 of the 123.
The N2D3P9 of the 1/575s but 183.68 is on the higher side and would make that list of commas with high N2D3P9 that I just shared. But maybe we should go back over that list and see if all those commas are the most useful in their zones, therefore justifying their subpar popularity.
At some stage, yes. I see this as being part of a review of existing comma assignments, that would take us too far away from the immediate problem at this time.
We're not jumping the gun by reassigning these before we complete the final step (popularity, usefulness, badness) and create the full badness metric, are we? Or was badness only relevant for assigning tinas — in other words we don't care about mina error?
That's a good point. Distance from the centre of the capture zone (or nearness to its edges) ought to have some bearing on the matter. There is no need to reassign these now.
The change we made that involved a 47-limit comma was this: we unsplit the 75th mina, eliminating the 1/47S from the notation altogether, because we found that its sum-of-elements was identical to the sum-of-elements for the comma it split the mina with, and therefore said sum-of-elements could not be one side or the other of the bound.
Thanks for setting me straight on that.
So nowadays what I would say we did was change the 93rd mina (140 - 116.5 = 23.5; 116.5 - 23.5 = 93). ... And specifically what we did was change the symbol from to  and shifted its lower bound to the S|M size category bound, to be symmetrical with the L|SS boundary, AKA the largest single-shaft symbol size. We did not change its comma.
Thanks for reminding me what this other recent change was about. So neither of these recent changes relate to popularity, usefulness or badness, and so can be ignored here.
Probably this call for reconsideration [of 85/11M] is moot because the specific item under reconsideration was not recently changed.
Indeed it is moot. Thanks for setting me straight on that too.
Dave Keenan wrote: I see at least two reasons here, not to use the most "useful" (according to the above function, or others like it).
1. The most "useful" is outside the symbol's capture zone at a higher precision level.
2. The most "useful" is a comma for a 2,3-equivalence class that already has a symbol (based a more useful comma for the same equivalence class).
I'm sure there's more reasons. I tried to write something intelligent-sounding about sweet spots in roughly-equally-spaced new useful commas and splitting inas but I couldn't convince myself it was correct. *shrug*

Reason #1 concerns me a tad, because it runs counter to the way the story is generally told (or at least how I've come to understand it): that each higher precision level was created in sequence, and not allowed to modify the previously established levels (other than to nudge their bounds from the outdated EDA to its latest greatest EDA). An instance of applying the design principle that adding new features shouldn't make it harder to do the simpler stuff.
It isn't that it's not allowed to modify a lower level when designing a higher one. It's that such modification must not make the lower level more complicated in the sense of making it more difficult to use. i.e. it must remain easy to do the most commonly needed things.
That said, it is hard to explain why 's primary comma is the 19s, other than that you were contemplating introducing the concept of accents, and reserving the 5s as an excellent choice for the first of them.
I don't remember the sequence of events (which could probably be excavated from the email archive). But you need to remember that we didn't first invent a bunch of flags and accents and then find commas to assign to them. We first listed the ratios we wanted to notate, in order of popularity, and tried to design symbols for them. Only later did the process work in both directions. But its conceivable that, in an alternate sagittal timeline, could have initially been a symbol for 5s, and it was later realised that using acute or grave accents for 5s was simpler, thereby freeing up to be redefined as 19s.
So optimising the above usefulness function(s) based on sum-of-squared-errors in usefulness, instead of a simple count of commas matched, will not be useful. Assignments like those above will skew the result in ways that are not meaningful.
Whoa. A+ insight right there. You're absolutely right.

I see that reverting to a simple count solves the problem because we can then simply "shave off" the 32 or so of the commas which were chosen for reasons other than purely their usefulness.

But I lament that by reverting to a simple count we'd lose some valuable differentiating detail on the results. Couldn't we continue using sum of squares, which is the best measuring technique we've thus far developed, but exclude these 32 commas from the data set we run against?
By all means try it. I believe it will give exactly the same result, because the function I give above will have exactly zero error on the remaining 91, and you can't have a lower SoS than zero. Eliminating those 32 is circular and self-fulfilling.

Why choose those 32 to eliminate as outliers? Some of them may well have been chosen on the basis of popularity, slope and 3-exponent alone, just using a different (possibly subjective, possibly George's own) usefulness function based on those 3 properties, but with the popularity metric being SOPFR instead of N2D3P9.

Perhaps only eliminate those whose usefulness error (according to the above lpe function) is very large. Say greater than 1.
I want to remind us of what we're doing here. The most pressing need at present, is a metric for choosing commas (and/or metacommas) for tina accents. I don't think either of the above numbered reasons are likely to occur in the case of candidate commas for tina accents. So I say the above usefulness metric is Good Enough™ and we should just run with it.
Short answer: I'm okay with that. Though...

• I'll admit I was hoping the best metric would at least have a consistent expando function for AAS and ATE, if not the same breed of function for the N2D3P9 (to be clear, I hoped for either *ee or *pp, but hopefully even lee or rpp).
I see no reason to expect that. The lpe can be read as: What really matters is apotome-slope (to ensure the chosen comma will also be useful for notating temperaments), but we need a guard that prevents the more extreme values of 3-exponent (which would require too many sharps or flats on average).
• And it does seem like for how much we invested into developing N2D3P9 we could proceed a bit more methodically at this step.
Sure. OK.
• And I'd like to give a day for others who have been active on this topic to weigh in.
Fair enough.
• And I would like to at least hear your thoughts on my suggestion to eschew particular commas from the data set so we can stay in sos-mode rather than revert to boolean-mode.
Done, above.
• Or perhaps you'd like me to use my sum-of-squares ability to perfect parameter values in that vicinity. It'd be nice if we could find some psychologically motivated round-ish numbers for everything, like we did for N2D3P9.
My spreadsheet is doing SoS-usefulness-error now too. That's when I realised the above problem. But I'd certainly like you to check that my result is the best over all 8 combinations. I already gave it rounded numbers. The reappearance of 1.37 (the Zipf law exponent from the Scala archive stats) in this context is freaky.

cmloegcmluin
Site Admin
Posts: 993
Joined: Tue Feb 11, 2020 3:10 pm
Location: San Francisco, California, USA
Real Name: Douglas Blumeyer
Contact:

### Re: developing a notational comma popularity metric

Dave Keenan wrote: Thu Oct 29, 2020 8:35 am Right[, that's the lpe metric].
And if we replace those two commas that have N2D3P9 > 1000, then it matches 93 of the 123.

maybe we should go back over that list and see if all those commas are the most useful in their zones, therefore justifying their subpar popularity.
At some stage, yes. I see this as being part of a review of existing comma assignments, that would take us too far away from the immediate problem at this time.
Agreed.
We're not jumping the gun by reassigning these before we complete the final step (popularity, usefulness, badness) and create the full badness metric, are we?
That's a good point. Distance from the centre of the capture zone (or nearness to its edges) ought to have some bearing on the matter. There is no need to reassign these now.
That's a better way of articulating the goal than -ina error. Thanks. Alright, I took that note along with the notes I already have about upcoming considerations for JI notation tweaks.
So neither of these recent changes relate to popularity, usefulness or badness, and so can be ignored here.
Right. Thanks for keeping us on topic and distilling that information overload down for me
It isn't that it's not allowed to modify a lower level when designing a higher one. It's that such modification must not make the lower level more complicated in the sense of making it more difficult to use. i.e. it must remain easy to do the most commonly needed things.
...in an alternate sagittal timeline, could have initially been a symbol for 5s, and it was later realised that using acute or grave accents for 5s was simpler, thereby freeing up to be redefined as 19s.
You've cleared up my misconceptions about the rationale here. Thanks.
Couldn't we continue using sum of squares, which is the best measuring technique we've thus far developed, but exclude these 32 commas from the data set we run against?
By all means try it. I believe it will give exactly the same result, because the function I give above will have exactly zero error on the remaining 91, and you can't have a lower SoS than zero. Eliminating those 32 is circular and self-fulfilling.
Oh yeah, well, good point, if it is indeed the case that all 32 of 'em are to be discarded. I wasn't making that assumption.

Do you have a handcrafted list of the ones you wouldn't want to include, or would we define it somehow, say, by whichever ones are not a match when s and t are both 0 (i.e. by pure N2D3P9)? Or... (reading ahead now)
Why choose those 32 to eliminate as outliers? Some of them may well have been chosen on the basis of popularity, slope and 3-exponent alone, just using a different (possibly subjective, possibly George's own) usefulness function based on those 3 properties, but with the popularity metric being SOPFR instead of N2D3P9.

Perhaps only eliminate those whose usefulness error (according to the above lpe function) is very large. Say greater than 1.
...I could try that.
I see no reason to expect that [the best metric would at least have a consistent expando function for AAS and ATE, if not the same breed of function for the N2D3P9]. The lpe can be read as: What really matters is apotome-slope (to ensure the chosen comma will also be useful for notating temperaments), but we need a guard that prevents the more extreme values of 3-exponent (which would require too many sharps or flats on average).
Fair enough. That makes sense to me. Thanks for unpacking.
My spreadsheet is doing SoS-usefulness-error now too. That's when I realised the above problem.
Ah, okay, great.
But I'd certainly like you to check that my result is the best over all 8 combinations.
When you say "over all 8 combinations" I think you mean the 8 different metrics (lee through rpp), in which case I think you meant to check your suggestions for replacing the 19/4375s and 14641k with the 1/575s and 143/5k, respectively. Did I get that right?

I would think you'd also want me to confirm that your particular lpe above (let's call it lpez, the z for Zipf?) is the best metric in toto.

I had noticed that 1.37 but forgot to mention it. Freaky indeed. It's stalking us!

Dave Keenan
Site Admin
Posts: 1300
Joined: Tue Sep 01, 2015 2:59 pm
Location: Brisbane, Queensland, Australia
Contact:

### Re: developing a notational comma popularity metric

cmloegcmluin wrote: Thu Oct 29, 2020 11:12 am Do you have a handcrafted list of the ones you wouldn't want to include ...
No.
Dave Keenan wrote: Why choose those 32 to eliminate as outliers? Some of them may well have been chosen on the basis of popularity, slope and 3-exponent alone, just using a different (possibly subjective, possibly George's own) usefulness function based on those 3 properties, but with the popularity metric being SOPFR instead of N2D3P9.

Perhaps only eliminate those whose usefulness error (according to the above lpe function) is very large. Say greater than 1.
...I could try that.
This could just be part of the squared error calculation for each existing comma (return zero if the error is greater than some threshold).

But I think we still need to stop using capture zones at the level the symbol is introduced, and switch to using capture zones at the extreme precision level only.
But I'd certainly like you to check that my result is the best over all 8 combinations.
When you say "over all 8 combinations" I think you mean the 8 different metrics (lee through rpp),
Yes.
in which case I think you meant to check your suggestions for replacing the 19/4375s and 14641k with the 1/575s and 143/5k, respectively. Did I get that right?
No. Forget about them (for now). Sorry I wasn't clearer.
I would think you'd also want me to confirm that your particular lpe above (let's call it lpez, the z for Zipf?) is the best metric in toto.
Yes. That's it. Thanks.
I had noticed that 1.37 but forgot to mention it. Freaky indeed. It's stalking us!
Actually. I think I told a furphy. It seems you can get 91 matches with a wide range of values of b, with different values of s and t. That's a good reason to want a continuous error measure like a SoS to constrain them more.

Or perhaps they can be more tightly constrained just by more exact matches, due to eliminating all candidate commas that fall outside a symbol's Extreme Precision Level capture zone.

How can there be any benefit in a metric that minimises the sum of squared errors in usefulness, but results in matching only say 50 of the existing 123 assignments instead of 91 or more?

cmloegcmluin
Site Admin
Posts: 993
Joined: Tue Feb 11, 2020 3:10 pm
Location: San Francisco, California, USA
Real Name: Douglas Blumeyer
Contact:

### Re: developing a notational comma popularity metric

Dave Keenan wrote: Thu Oct 29, 2020 2:30 pm
Perhaps only eliminate those whose usefulness error (according to the above lpe function) is very large. Say greater than 1.
...I could try that.
This could just be part of the squared error calculation for each existing comma (return zero if the error is greater than some threshold).
Good thinking.
But I think we still need to stop using capture zones at the level the symbol is introduced, and switch to using capture zones at the extreme precision level only.
You say "still". Did you suggest this recently and I accidentally ignored it?

It was my suggestion to try weighting the error by the size of the secondary comma zone (as opposed to not weighting at all), but I don't recall us ever considering comparing each comma only against the commas in its Extreme precision level capture zone.

It doesn't seem unreasonable to me, but I'm also unable to articulate why running against that data set would lead us to find a better usefulness metric. Could you explain your thinking on that?
Or perhaps they can be more tightly constrained just by more exact matches, due to eliminating all candidate commas that fall outside a symbol's Extreme Precision Level capture zone.
It seems like you're suggesting it in order to reduce how big the errors can get per comma, because there will be far fewer competitors, each of which might potentially ding the score badly? That makes sense mechanically, but again I can't tie it to a psychological motive in terms of what the usefulness metric means.

If we do go with this approach, I can really easily grab the new data set for you, in the same format as the previous one. Good news is: it'll run even faster, since it will never be searching any medina-sized swaths; only mina-sized ones. Essentially it'll run 4x as fast since it'll only cover the half-apotome once instead of basically once for each precision level (only ~2/3 of the Extreme level, I guess, because about a 1/3rd of the symbols there already exist in lower levels). Not that it took forbiddingly long to run in the first place (less than an hour).
I had noticed that 1.37 but forgot to mention it. Freaky indeed. It's stalking us!
Actually. I think I told a furphy.
Always nice to learn some new Australian colloquialisms.
It seems you can get 91 matches with a wide range of values of b, with different values of s and t. That's a good reason to want a continuous error measure like a SoS to constrain them more.

How can there be any benefit in a metric that minimises the sum of squared errors in usefulness, but results in matching only say 50 of the existing 123 assignments instead of 91 or more?
Wait, are you seeing that happen? Or just pointing out that it could happen? And if so, are you arguing against using sos-mode, then, and going back to boolean-mode?
I would think you'd also want me to confirm that your particular lpe above (let's call it lpez, the z for Zipf?) is the best metric in toto.
Yes. That's it. Thanks.
Okay. I understand that much then. But I'm now unsure whether you want to use sos-mode. And also, do you want me to run it against secondary comma zones still, or go for Extreme capture zones.

No rush to clarify. It'll certainly take me at least a day to get my code capable of adapting the sort of recursive searching that Excel's evolutionary solver and my popularity metric LFC does to this problem.

Dave Keenan
Site Admin
Posts: 1300
Joined: Tue Sep 01, 2015 2:59 pm
Location: Brisbane, Queensland, Australia
Contact:

### Re: developing a notational comma popularity metric

cmloegcmluin wrote: Fri Oct 30, 2020 2:48 am
Dave Keenan wrote: Thu Oct 29, 2020 2:30 pm But I think we still need to stop using capture zones at the level the symbol is introduced, and switch to using capture zones at the extreme precision level only.
You say "still". Did you suggest this recently and I accidentally ignored it?
Not recently. But here:
viewtopic.php?p=2430&hilit=extreme#p2430 and
viewtopic.php?p=2555&hilit=extreme#p2555 (scroll to end of post)

And I thought it was the obvious solution to the problem I raised here (bolding added):
Dave Keenan wrote: We rightly assigned:
to 19s [-9 3 0 0 0 0 0 1⟩ instead of the more "useful" 5s [-15 8 1⟩.
to 1/17k [-7 7 0 0 0 0 -1⟩ instead of the more "useful" 25/7k [-5 2 2 -1⟩
to 1/19C [-10 9 0 0 0 0 0 -1⟩ instead of the more "useful" 1/25C [11 -4 -2 0 0 0 0 0⟩
to 11/19M [4 -2 0 0 1 0 0 -1⟩ instead of the more "useful" 1/7M [-13 10 0 -1⟩

and there are several others like these.

I see at least two reasons here, not to use the most "useful" (according to the above function, or others like it).
1. The most "useful" is outside the symbol's capture zone at a higher precision level.
...
So optimising the above usefulness function(s) based on sum-of-squared-errors in usefulness, instead of a simple count of commas matched, will not be useful. Assignments like those above will skew the result in ways that are not meaningful.

cmloegcmluin wrote:
Or perhaps they can be more tightly constrained just by more exact matches, due to eliminating all candidate commas that fall outside a symbol's Extreme Precision Level capture zone.
It seems like you're suggesting it in order to reduce how big the errors can get per comma, because there will be far fewer competitors, each of which might potentially ding the score badly? That makes sense mechanically, but again I can't tie it to a psychological motive in terms of what the usefulness metric means.
No. I'm saying that we should go back to counting exact matches. But that can only work if we restrict the comma candidates for each symbol to their Extreme level zones, otherwise you get the problem I described in my quote above.
If we do go with this approach, I can really easily grab the new data set for you, in the same format as the previous one.
Please do.
It seems you can get 91 matches with a wide range of values of b, with different values of s and t. That's a good reason to want a continuous error measure like a SoS to constrain them more.

How can there be any benefit in a metric that minimises the sum of squared errors in usefulness, but results in matching only say 50 of the existing 123 assignments instead of 91 or more?
Wait, are you seeing that happen? Or just pointing out that it could happen? And if so, are you arguing against using sos-mode, then, and going back to boolean-mode?
Yes. I'm seeing that happen. Yes. I'm arguing for going back to boolean mode.
Okay. I understand that much then. But I'm now unsure whether you want to use sos-mode. And also, do you want me to run it against secondary comma zones still, or go for Extreme capture zones.
Boolean mode, extreme capture zones.

I'm glad you liked "furphy".

cmloegcmluin
Site Admin
Posts: 993
Joined: Tue Feb 11, 2020 3:10 pm
Location: San Francisco, California, USA
Real Name: Douglas Blumeyer
Contact:

### Re: developing a notational comma popularity metric

Don't be mad... I spent my Friday knocking out some loose ends. I just get distracted when stuff gets frayed and I can't focus on the main thread. But great news!
Dave Keenan wrote: Mon Oct 26, 2020 10:26 am Awesome. That's such a manageable number that I suggest you forget about my complicated denominator-generating procedure that this result was supposed to be fodder for, and just try every numerator as a potential denominator to generate ratios, calculate their N2D3P9, throw away those greater than 5298.19065, then sort them on N2D3P9. There are only 1014 × 1013 / 2 = 513 591 ratios to try.

For this purpose, it would be more useful to have the copfr of each numerator rather than its n2 or n2p. This is readily obtained as copfr = round(lb(numerator/n2)). And it would be more useful to have the numerators sorted by numerator rather than by n2 or n2p. I suggest preprocessing the existing file to generate a file with numerator, gpf and copfr, in numerator order, before feeding it to a ratio generator/tester/sorter.
Dave Keenan wrote: Mon Oct 26, 2020 11:39 am
cmloegcmluin wrote: Mon Oct 26, 2020 11:24 am Thanks for this. I will plug in the benefits of these sorted lists soon, following your suggestions.
You say "lists" plural, but you should only use the list that's sorted by numerator (and has copfr). That way it's just a pair of nested loops, the outer one stepping along the list and using each element as a numerator, the inner one stepping along the same list and using each element as a denominator but only up until you reach the same index as the outer loop. Since d < n.
It took me a bit to figure out what to do exactly, but I got this working this afternoon.

The part that took me the longest was just figuring out that this list of yours gives me the ability not to dramatically speed up the computation of prime exponent extremas for numerators and denominators of 2,3-free classes, but to bypass prime exponent extremas altogether.

I think I was a bit fixated on the prime exponent extrema strategem because it gets used in two different places in the code, and in one of those places the surrounding code is unfortunately coupled to that implementation. The first of these two places (where the prime exponent extremas were used to filter JI pitches by max N2D3P9) is the obvious place: in the script which prepares the table of 2,3-free classes by N2D3P9 up to a given max N2D3P9. The second less obvious place is in the find-commas script.

It's the latter of these two places where the code is tightly coupled, and it's the latter of these two places which brought me to direct my efforts toward this front today. It had just really bothered me how long the find commas script was taking to run! This is the script which gets run 123 times in order to gather the commas per zone as the data set to use for fitting a usefulness metric and its parameters. So because of the coupling, I won't immediately be able to reap the benefits of this improvement there without some refactoring. Which is acceptable... we have already gathered the data set we need for now.

That disappointment aside, I'm happy to report that I've installed your list in the first of the two places, and so we can now compute the most popular 2,3-free classes by N2D3P9 up to 136 in less than 0.2 seconds, where before it took over 20 seconds (100x faster). It can calculate up to N2D3P9 < 307 in 0.3 seconds, where before it took overnight. And, of course, we can now compute them all the way up to 5298.2 at all, in less than 7 seconds, where before we couldn't even compute them at all! It's so much data, though, that a table formatted for the forum would need to be split up into three pages worth of posts to share here, so I'll just attach it here in a spreadsheet form. There were 4981 of them, which is in the ballpark of 5298.

By the way, I had to insert one line to the array at the beginning, for 1 (gpf 1, copfr 0); otherwise we don't consider {1}2/3, {5}2/3, {7}2/3, etc.
Attachments
popular23freeClassesUpToN2D3P9of5298.xlsx
(233.98 KiB) Downloaded 24 times

Dave Keenan
Site Admin
Posts: 1300
Joined: Tue Sep 01, 2015 2:59 pm
Location: Brisbane, Queensland, Australia
Contact:

### Re: developing a notational comma popularity metric

Excellent work! Thank you. With that list, we've definitely got possible tina commas covered.

But shouldn't the column with the symbols in it be titled "notating symbol classes", or at least have the word "symbol" in its title somewhere?

And thanks for the new set of candidate commas (by email) for the extreme capture zones.

cmloegcmluin
Site Admin
Posts: 993
Joined: Tue Feb 11, 2020 3:10 pm
Location: San Francisco, California, USA
Real Name: Douglas Blumeyer
Contact:

### Re: developing a notational comma popularity metric

Dave Keenan wrote: Sat Oct 31, 2020 12:06 pm Excellent work! Thank you. With that list, we've definitely got possible tina commas covered.
\m/ \m/
But shouldn't the column with the symbols in it be titled "notating symbol classes", or at least have the word "symbol" in its title somewhere?
Yeah... nice catch. I think some things got a little scrambled when flaccos came to be. It should definitely be "notating symbol class" now.
And thanks for the new set of candidate commas (by email) for the extreme capture zones.

I have got the extreme capture zones plugged in on my end, and have gone back to boolean mode. Still not recursive/evolutionary/whatevs yet, though.

Dave Keenan
Site Admin
Posts: 1300
Joined: Tue Sep 01, 2015 2:59 pm
Location: Brisbane, Queensland, Australia
Contact:

### Re: developing a notational comma popularity metric

I have the extreme capture zones now and the Excel evolutionary solver still finds the highest number of matches (101) with a version of the LPE metric. But the other 7 metrics are not much worse. 100 for RPP and LPP, 99 for LEE and LEP, 98 for RPE, 97 for REE.

Some sets of LPE parameters that give 101 matches are
b = 1.217882307
s = 0.061869715 = 1/16.163
t = 0.000932289 = 2^-10.067

b = 1.2
s = 1/8
t = 2^-11

b = 1.5
s = 1/32
t = 2^-10

b = 2
s = 1/85
t = 2^-10

The b = 1.5 case above can also be written as

lb(N2D3P9) + (AAS/10)^(3/2) + 2^(ATE-10)

I think that choosing between these and the simpler LP metric (ignores ATE) with b = 1.41, s = 0.096, will come down to looking at exactly which commas they include that the others reject, and vice versa.