Page 22 of 50

Re: developing a notational comma popularity metric

Posted: Sun Aug 02, 2020 8:24 am
by Dave Keenan
I just realised we're arguing over a metric for comparing metrics. And these metrics are for comparing ratios for notational commas that are so small that almost no-one will ever use them for notation.

We really have to wind this up soon.

Re: developing a notational comma popularity metric

Posted: Sun Aug 02, 2020 10:17 am
by cmloegcmluin
FYI, I fat-fingered something somehow and logged myself out the first time I wrote this, and not even ChromeCacheView saved me. So I apologize if I'm a bit short. I'm mad at myself for not being more careful and methodical when posting here (I guess I got swept into a frenzy of writing for an hour or whatever and lost track of not having saved anything to my paste buffer or soft-saved with "preview").
Dave Keenan wrote: Sun Aug 02, 2020 7:32 am 1. You are wrong when you claim that under the hood lb(p) is log2(p), i.e. log(p,2). Maybe you implemented it that way, but under most "hoods" lb(p) is a primitive.
I didn't know an operation called "lb" existed until you shared it in this post, where you said:
Dave Keenan wrote: Tue Jul 07, 2020 3:23 pm Don't even call it log2(). Call it lb() (for log-binary, by analogy with ln() for log-natural). This is ISO standard notation.
I don't think we have redefined lb since then. And I am having trouble understanding this to mean anything other than "under the hood lb(p) is log2(p), i.e. log(p,2)". Can you explain the distinction? What do you mean by "primitive" in this case?
Under the hood, loga(p), i.e. log(p,a), is usually implemented as lb(p)/lb(a).
I am familiar with this logarithmic identity. In fact, that is how it is implemented in my own code.
That's almost certainly why your code ran so much faster when you changed from loga(p) to k×lb(p).
I have so little understanding of what you're trying to get across that I can't figure out what you're saying.

I would not describe myself as having "changed from loga(p) to k×lb(p)". I would describe myself as having "locked a down to 2 when used as a base and only when used by the solver".

I can at least blindly respond with confidence in my own terms that the reason my code ran so much faster was that it was no longer checking hundreds or thousands of different possibilities whenever it was asked to use a as a base, but only 1.
2. You are right when you point out that we are being inconsistent in counting log as a chunk, but not counting the unwritten exp in the case of ry.
At least I got one thing right!
I now believe we should count ry as two chunks, just as I believe we should count loga(p) as two chunks,
Whoops! Well, I guess we're going to have to continue the debate. I still think they should be counted as 1 chunk. To say that + - × ÷ are 1 chunk while log and exp are 2 chunks feels wrongfully arbitrary. I don't see a strong argument for that.

Moreover, it would be quite painful to make this work in my code without a rewrite from the ground up as I alluded to earlier, to make it approach the problem with chunks as its primary elements, not submetrics and parameters.
and k×lb(p) as two chunks
Yes, I do agree that's two chunks, because k is 1 chunk and lb is 1 chunk.
You seem to have missed where I said I want to count every model parameter as a chunk (as well as counting every function more complex than + - × ÷ as a chunk).
Ah okay, I do see that you described + - × ÷ as functions. I shouldn't have stated it as if you hadn't.
But I don't want to count literal constants as chunks, as you seem to think I do.
The short paragraph where I say something like that was intended as a sort of reductio ad raspberries w/r/t treating these things as 2 chunks. To me, any chunk should be able to stand alone. If you say something like loga is 2 chunks, you seem to be saying that log is a chunk and the 2 is a chunk, which I think you are, but if either of those chunks are forced to stand alone (as I think they should be able to) they can't. "log" is implicitly log10 or logn usually, but that's not what I mean, anymore than an a floating in an equation usually implies it to be a coefficient. Implications aside, a logarithm without a base doesn't stand alone, and a number (or "literal constant") without an operation doesn't stand alone.
You can count + - × ÷ as chunks too if you want, but they seem like they should count for less than higher-order functions like logs, exponentials, powers, and roots.
Ok, here it is. I take this is at least the beginning of your argument for + - × ÷ as 1 chunk while exp and log are 2 chunks. Because they are "higher order". I don't disagree that they are more complex, in the sense that they're taught later in school, they're harder to understand, they build on each other or in some cases are even hierarchically higher hyperoperations of each other. But by that reasoning why wouldn't × ÷ be more thunks than + -? I don't want to start down the road of weighting chunks by arbitrary definitions of mathematical complexity. Soon enough we'll have some things weighing 1.5 chunks. That is not something I want to try to bite off here. Sorry for mixing metaphors.

The conception of chunks I've been working with has been more in terms of how complex it is to state/describe/explain these metrics, in more of a natural language sense, like how many clauses would be required to speak aloud a metric. How hard is it to understand what it is? It isn't any easier or harder to understand the fact that we're using a as a logarithmic base than it is if we used it as a coefficient or an exponent.

If you disagree with the above statement, we might be a bit of an impasse.

For another example, this is why I suggested earlier that if all submetrics were of the same type, even if there were three or four copies of it totaled up, it would still only count as one chunk. Because we could explain it to people starting with the clause, "It's three different sofpr's totaled." I'm not counting functions or arguments or mathematical challenge levels here. I'm counting complexity in terms of articulating the idea of the metric.
So I count log2(p) as one chunk, just as I would count log4/3(p) as one chunk were it not for the fact that your 4/3 is not a literal constant, but merely one possible value of the parameter `a`, which might no longer be 4/3 if we were to train the model on a different set of data, or merely weight the existing data differently.
You've used this word "train" a couple times lately but I have no idea what you're talking about. I'm not training anything on my end. I am training a neural net on a separate project I recently started with a friend who is a professional musician but actually the project doesn't even have anything to do with music. That aside...

I'm so completely confused by what you're trying to say here...

If you want to get philosophical, we do use these parameters such as "k" in sometimes different ways. Sometimes we use it to represent an "unresolved" variable, like one in what my code calls a "scope", where it ranges maybe somewhere from 0 to 2 and we're going to try a bunch of different values in for it and see which one is best. And other times we use it to represent a "resolved" variable, like when we spell out these metrics with the fancy new double-dollar sign bbCodes; the k has definitely been found to be something ideal-ish, but we just write the formula with k so in one place we can see the overall shape and structure of the metric, and in other place on the next line we can see what the actual final values for the k and other parameters are. Right?

If you imagine we soon find and settle on a final metric and share it out with the wider world, it might include the parameter k, but it will have been resolved to some value, like 0.84 or something. In no case can I imagine or understand what it would mean to share out to the world a metric which still has an unresolved variable in it, like an `a` which could be 4/3 or it could be anything else. How would anyone use that? And why would that be any different than a metric in which `a` could be 2 or it could be anything else?

Sorry if that seems a bit desperate but I really have no idea what your point is.
In determining the complexity of a metric, we must count each parameter as a chunk, independent of what function or operator is applied to it, otherwise my extreme metric that has a separate parameter for each prime, but uses no more operations than wyk or wyb, would be hands-down winner.
I agree and hope I haven't somehow given the impression that I think otherwise.
That's why I want to count log2(p) as one chunk, but loga(p) as two chunks, even if `a` happens to come out as 2. And thanks to your recent observation, I also want to count r2 as one chunk, but count ry as two chunks, even if `y` happens to come out as 2.
Okay, let me just get this unambiguously clear. Changing from:

$$\text{f}(n,d) = \sum_{p=5}^{p_{max}} \big((\operatorname{log_a}{p}){n_p d_p}\big)$$
$$a = 2, \text{ gives } SoS=..., SoS(1)=...$$

to:

$$\text{f}(n,d) = \sum_{p=5}^{p_{max}} \big((\operatorname{log_2}{p}){n_p d_p}\big)$$
$$\text{ gives } SoS=..., SoS(1)=...$$

is a reduction of 1 chunk? Just because the 2 is inlined? I can't agree with that. It's the same thing. In the first way of writing it, the use of a in the formula with the value of a on the second line is just an act of convenience, to aid our ability to identify patterns between the metrics. Maybe when we share the metric out, no one will ever know we used letters like a, y, w, b, c, etc. because they'll only ever be exposed to their final resolved values.
Perhaps you thought I was counting k×lb(p) as two chunks because lb(p) = log2(p) and I was counting the "2" as a chunk as well as counting the "log" as a chunk. That is not the case. I count k×lb(p) as two chunks because the "k" is one chunk and the "lb" is one chunk.
No, I agree with your breakdown of these two chunks, per the above. I would never have counted k as free. I did understand that you thought lb was 1 chunk while `log_2` was 2 chunks, but I didn't understand why you would think that.
I just realised we're arguing over a metric for comparing metrics. And these metrics are for comparing ratios for notational commas that are so small that almost no-one will ever use them for notation.

We really have to wind this up soon.
I am ultimately in this to help humanity make music, and I agree this particular path we've gone down the past couple months have not been the most direct or efficient path to that.

Nonetheless — as exasperated as I may get here sometimes — it is a genuine pleasure to learn and exercise my mind on these problems, and I'm proud of what we're accomplishing here.

Re: developing a notational comma popularity metric

Posted: Sun Aug 02, 2020 10:49 am
by Dave Keenan
You're still strawmanning me. You're not reading what I'm actually writing. You're apparently reading what you expect to see. I never said log2, or any other function, should count for 2 chunks.

Of course we will give constants, not variable names, in the final result. And just read "training" as "optimising" or "running the solver on the data".

Re your example of two metrics where the only difference is changing an `a` to a 2:

If the 2 was dependent on the data, i.e. if it was the result of optimising `a`, then they have the same chunk count.

If the 2 was put into the metric before optimising, then the latter metric has one less chunk.

Re: developing a notational comma popularity metric

Posted: Sun Aug 02, 2020 11:43 am
by cmloegcmluin
Dave Keenan wrote: Sun Aug 02, 2020 10:49 am You're still strawmanning me. You're not reading what I'm actually writing. You're apparently reading what you expect to see.
I know how it feels to be strawmanned, and I know it sucks. I should have addressed that in my last post, and I'm sorry I didn't. I am definitely not trying to strawman you. I may be presenting reductio ad raspberries in a constructive effort to force us to confront things which are clear as day to one person but obscure to the other. But I'm not trying to trick you or anyone else into a false image of your argument so that mine can win. I am after consensus on some truth on this matter, not after getting my way at any cost.

I also know how it feels when I can clearly see someone seeing only what they want to see and not what is there. And I know that it feels super frustrating too. So I'm sorry I'm causing that feeling for you.

I can see that you are putting a commendable effort into putting yourself in my shoes and discerning what my mistaken interpretations of your statements could be caused by, and I notice that and I really appreciate it. I recognize that I may not be being as good of a debate partner to you as you are to me in that respect, so I'm sorry for that too.

For what it's worth, I did at least try. I truly did invest a ton of mental effort in trying to put myself in your shoes and see it from your perspective and figure out what you mean. But I still can't figure out what you mean! Perhaps I should just say less and listen more, or at least just ask a ton of questions until I get it.
I never said log2, or any other function, should count for 2 chunks.
Just a few hours ago you said: "I now believe we should count ry as two chunks, just as I believe we should count loga(p) as two chunks".

To me, that seems like the same thing as saying some functions should count for 2 chunks.

Perhaps you're saying the function is only 1 chunk because the function is exponentiation which is 1 chunk and the value of y is another chunk but the chunk which is for the value of y is not part of the function itself. That seems to check out with things you've said. The reason that conception doesn't really work for me is that, similar to as I said before, a function without its argument can't stand alone, and an argument such as the parameter y without a function to use it can't stand alone. But again, I should say less and ask more.

Please say more about in what sense I am making you feel like I am misconstruing your position to be such that a function should count for 2 chunks. I truly don't mean to misrepresent you. I just want to understand.
Re your example of two metrics where the only difference is changing an `a` to a 2:

If the 2 was dependent on the data, i.e. if it was the result of optimising `a`, then they have the same chunk count.

If the 2 was put into the metric before optimising, then the latter metric has one less chunk.
Interesting. Okay. This statement is definitely giving me a new perspective on how you're approaching the problem. Enough to ask more questions, anyway.

1.

First of all - I just want to verify - did you notice that the difference wasn't really "changing an `a` to a 2", but even more subtle than that? To be specific, did you notice that first example included an "a=2"? In light of that, I would put the difference in another way. I would say in the second example, it's a `log_2`, while in the first example, it's a `log_2` but where the 2 is indicated by way of an intermediate symbol `a` whose value of 2 is defined on the next line. But quibbling over this is probably unproductive because I can see that what you really care about w/r/t chunk counting is not this difference in formatting (which I had hoped it wasn't, and I'm glad that pitching this absurd example spurred you to say more which will hopefully get us to the next step in understanding each other).

2.

What exactly do you mean by "dependent on the data, i.e. if it was the result of optimising"? Would another way of saying that be "one of the numbers our spreadsheet or code spits out"? As opposed to one of the numbers we choose through discussion on the forum?

3.

Why do you think it matters when the 2 was put into the metric — before or after optimizing? I would say this is irrelevant to the chunk count. I say that because the chunk count is for the people we explain this metric to, and they do not need to know anything about how we found these parameters.

I mean on some level our process should be part of our defense of our choice of metric. We don't want to just throw it out in the world accompanied by its lower SoS relative to SoPF>3 and just say "trust us, this is the best alternative". We'd like to say a few words about each parameter/submetric and why we think it's a good representation of the problem and solution.

And now I'm starting to feel how maybe you're thinking of chunk count more as a negative score for us as creators of the metric, and that we should punish ourselves more for values we have to use a computer to find versus values we decided on using mathematical/philosophical/psychological/musical reasoning here on the forum. Because when we defend the metric, we should feel a need to apologize on some level for numbers that we needed to generate. Is that how you're thinking about it?

I mean, I thought the code and spreadsheets helped us see that 2 was a good base for us to use. If we can trace 2 to the code/spreadsheet, do we not need to punish ourselves for that with an extra chunk? I somehow doubt that this exactly what you're thinking, but again, perhaps this guess will help me triangulate what you do in fact mean.

Do you disagree with my conceptualization of a "chunk" as a measurement of the complexity required to state/describe/explain a metric to someone? Would you prefer it be defined in some other way? Would you prefer it represent a negative score for us as creators of the metric, or that it represent mathematical complexity/challenge of the functions it uses, or any other thing? Maybe please just give your ideal definition of a chunk so we can work with that.

Re: developing a notational comma popularity metric

Posted: Sun Aug 02, 2020 9:03 pm
by Dave Keenan
I only have time for a brief reply. More tomorrow. I just want to reassure you that I never imagined the strawmanning was intentional. Perhaps I shouldn't have called it strawmanning at all. Sorry. And I want to reassure you that you have now figured out what I'm saying. Thank you. And I agree that your way of looking at it is valid too. They have different uses.

Re: developing a notational comma popularity metric

Posted: Mon Aug 03, 2020 2:28 am
by cmloegcmluin
Phew, well that's a relief.

I agree that your way of looking at chunks is valid too, and that they have different uses.

I've realized that a change I made to my solver code recently makes this disagreement over the definition of chunk a lot less stressful for me. This change was: per chunk count, the solver now spits out not just a single best metric, but the best metric per combination of parameters, as long as it beat SoPF>3 (of course it sorts them so I can see the best ones right away, important once you get to chunk count 3 and there begin to be hundreds of such parameter combinations which can beat SoPF>3). That way we won't lose out on alternatives which maybe have an ever-so-slightly worse SoS but which make more psychoacoustic sense to us. And in particular to this issue, it means that even if in the end we have a translation layer between my (and my code's) definition of chunks and your definition of chunks, we'll still be able to compete the correct metrics against each other, because we'll have all the information we need to do so. That is, I had been worried that if my code couldn't attribute the correct chunk count to metrics then it might lose some valuable results by letting them get defeated and lost in a more competitive chunk count bracket; but now I'm not worried because even if they competed in the wrong bracket, I can find the results and manually cut and paste them over.

I'm no export on strawmanning. Maybe I'm metastrawmanning it. Perhaps many strawmanners unintentionally misrepresent positions. It seems to be alternately referred to as a form of argument (seems intentional) and a logical fallacy (seems unintentional). Probably there's a grey area between the two.

Alright. I know you want to say more, so I'll cut myself off for now, and get back to work on the code. It's getting very close now to being ready for its final runs.

Re: developing a notational comma popularity metric

Posted: Mon Aug 03, 2020 12:25 pm
by Dave Keenan
You have explained that your chunk count is aimed at measuring complexity from the user's point of view, and as such I agree it makes no difference whether the 2 in r2 was put into the model before, or as a result of, training (= fitting = optimisation = running the solver = minimising the error relative to the data).

My chunk count is instead aimed at validating the model. In particular, giving us a feel for how likely it is that the low error might be due merely to the complexity of the model, rather than its ability to predict human psychoacoustics. I contend that it is too early to be concentrating on user convenience if that means we are not adequately measuring model validity.
cmloegcmluin wrote: Sun Aug 02, 2020 11:43 am Perhaps you're saying the function is only 1 chunk because the function is exponentiation which is 1 chunk and the value of y is another chunk but the chunk which is for the value of y is not part of the function itself. That seems to check out with things you've said.
You've got it. Hallelujah. ;) Well almost. It's not "the value of y" that I cost at one chunk of complexity, it's the existence of y as a parameter of the model — something whose value is obtained by minimising some error-measure over the data. Model parameters are distinct from the constants that are put into the model in advance and will not change in response to the data. Model parameters are also distinct from the variables that change with each datum.

The model parameters, like the weights of a neural net, are variables before and during the optimisation or training, and are constants after that, while other values are either constant before, during and after, or variable before, during and after. We might say that the model has "true-constants", "true variables" and parameters (which start out as variables but end up as constants).

The reason I count a parameter as a chunk of complexity, independent of what function is applied to it, is summarised by the phrase "Von Neumann's elephant" — the fact that a low error from the trained model may simply represent the fact that you have so many parameters that it is easy for the model to adapt to the data, including the noise in the data. This is also referred to as overfitting or overtraining.
The reason that conception doesn't really work for me is that, similar to as I said before, a function without its argument can't stand alone, and an argument such as the parameter y without a function to use it can't stand alone. But again, I should say less and ask more.
Even if I accept that they can't stand alone, I don't understand why it matters. One can still ask, of any function-application: Are any of its arguments model-parameters? And if so, count an additional chunk of complexity for each argument that is a model parameter. But a model parameter may be an argument of more than one function-application in the model, and we don't want to count every occurrence, so instead we can ask: Are any of its arguments model-parameters that we haven't yet counted? Similarly we don't want to count every application of the same function.

But in fact I don't understand what you mean when you say they can't stand alone. I can certainly count all the model parameters separately from counting all the functions used (not counting every application), and then sum those two counts.
Please say more about in what sense I am making you feel like I am misconstruing your position to be such that a function should count for 2 chunks. I truly don't mean to misrepresent you. I just want to understand.
For example, `log_2(p)` where the 2 is a true constant and the p is a true variable (no parameters involved), only counts as one chunk.
1. First of all - I just want to verify - did you notice that the difference wasn't really "changing an `a` to a 2", but even more subtle than that? To be specific, did you notice that first example included an "a=2"? In light of that, I would put the difference in another way. I would say in the second example, it's a `log_2`, while in the first example, it's a `log_2` but where the 2 is indicated by way of an intermediate symbol `a` whose value of 2 is defined on the next line. But quibbling over this is probably unproductive because I can see that what you really care about w/r/t chunk counting is not this difference in formatting (which I had hoped it wasn't, and I'm glad that pitching this absurd example spurred you to say more which will hopefully get us to the next step in understanding each other).
If I noticed it when I first read it, I had forgotten it by the time I managed to squeeze in a response before my sister arrived to go hiking. But as you rightly understood, it makes no difference from my point of view.

BTW, your toy example metrics would always be zero. I think you meant (np+dp) when you wrote npdp. :)
2. What exactly do you mean by "dependent on the data, i.e. if it was the result of optimising"? Would another way of saying that be "one of the numbers our spreadsheet or code spits out"? As opposed to one of the numbers we choose through discussion on the forum?
Yes.
3. Why do you think it matters when the 2 was put into the metric — before or after optimizing? I would say this is irrelevant to the chunk count. I say that because the chunk count is for the people we explain this metric to, and they do not need to know anything about how we found these parameters.
I agree it's not relevant to the users whether the 2 was a constant or a parameter. But it's relevant to whether or not we might be overfitting. However it is relevant to the user if we can't round the parameter to some integer, or simple ratio or surd, or named constant like π, ϕ, e, without increasing the error. 2.017 is more complex than 2 from the user's point of view. Your chunk count doesn't consider that at all. My chunk count effectively assumes the worst case for that.
I mean on some level our process should be part of our defense of our choice of metric. We don't want to just throw it out in the world accompanied by its lower SoS relative to SoPF>3 and just say "trust us, this is the best alternative". We'd like to say a few words about each parameter/submetric and why we think it's a good representation of the problem and solution.
I agree. And as such I think the emphasis should be on model validity, not user convenience. If it happens that we can round some parameter to an easily remembered value without significantly increasing the error, or rewrite the formula in a way that is easier to remember, that's just a bonus, or icing on the cake, but should not influence us too much in choosing the metric.
And now I'm starting to feel how maybe you're thinking of chunk count more as a negative score for us as creators of the metric, and that we should punish ourselves more for values we have to use a computer to find versus values we decided on using mathematical/philosophical/psychological/musical reasoning here on the forum. Because when we defend the metric, we should feel a need to apologize on some level for numbers that we needed to generate. Is that how you're thinking about it?
That's very close to how I'm thinking about it. I think I've explained enough above, to indicate how it's not quite that.
I mean, I thought the code and spreadsheets helped us see that 2 was a good base for us to use.
Not at all. We could equally have chosen to standardise on base e instead of base 2 — `text{ln}` instead of `\text{lb}`. It was a completely arbitrary choice in regard to the validity or accuracy of the model. I suggest 2 because it is already commonly used in music theory.
If we can trace 2 to the code/spreadsheet, do we not need to punish ourselves for that with an extra chunk?
Yes, if by "trace it to the code or spreadsheet" you mean we let the code or spreadsheet adjust it to minimise the error over the data, and all the parameters are independent, i.e. none are redundant in the sense that the effect of changing one can be completely undone by changing others.
I somehow doubt that this exactly what you're thinking, but again, perhaps this guess will help me triangulate what you do in fact mean.
Whether or not we should standardise the log base (and use a coefficient of the log as the parameter instead of using the base as the parameter) is, for me, completely orthogonal to the question of how we should count chunks.

I understand it is not orthogonal for you, because (correct me if I'm wrong) you want to count constants the same as parameters, and for some reason I don't understand, you still see a constant 2 when you see lb. If so, would you see a constant e that you need to count, if we used ln instead of lb? Or maybe I've got it wrong. Maybe it's that you don't count log bases at all for some reason, no matter whether they are constants or parameters, but you do count coefficients? Please set me straight here.

In many cases we already have a coefficient for the log, and in that case, standardising the base is simply removing a redundant parameter. In the cases where standardising the base requires introducing a new coefficient as a parameter, I don't find that any more complex. There's still one parameter and one log function.

I understand that, because `\text{lb}` is not a common function on calculators or spreadsheets or programming languages, we will have to describe it (somewhere) as log2, and because you count constants in the same way you count parameters, you think that a constant base is just as complex as a parametric base, and so you count the case with a coefficient as one more chunk. Given that `\text{ln}` is ubiquitous, if we instead standardised on base e, would you then agree that k×ln(p) has the same complexity, even from the user's point of view, as loga(p)? If not, why not?
Do you disagree with my conceptualization of a "chunk" as a measurement of the complexity required to state/describe/explain a metric to someone?
There is a time and place for such a complexity measure, but I don't think it is here and now.
Would you prefer it be defined in some other way?
Yes.
Would you prefer it represent a negative score for us as creators of the metric, or that it represent mathematical complexity/challenge of the functions it uses, or any other thing? Maybe please just give your ideal definition of a chunk so we can work with that.
Hopefully I've explained it well enough above. But since you didn't seem to understand that I was describing my definition of a chunk of complexity the last two times I did it, here it is again: A chunk of complexity is a model parameter or a function other than + - × ÷.

Re: developing a notational comma popularity metric

Posted: Mon Aug 03, 2020 1:56 pm
by Dave Keenan
I think "under the hood" must mean something different to you. To me, in this context, it means "as implemented in library code or hardware".

Because the ubiquitous IEEE floating point format stores numbers in a radix 2 format, it is fastest to compute the base-2 logarithm. So the lowest-level log operation is the base-2 log, lb, and variable-base log is computed by doing two lb's and a divide. Natural logs, ln, are computed by doing an lb and multiplying by a pre-computed constant which is 1/lb(e).

But this too, is completely orthogonal to chunk counting, from my point of view.

Re: developing a notational comma popularity metric

Posted: Mon Aug 03, 2020 4:52 pm
by cmloegcmluin
This has been a big help to me in seeing your point of view. Thanks for all of this detail and variety of angles of explanation. "2.017 is more complex than 2" in particular was helpful.

Overall, I will say: you're right; you've convinced me of your point of view.

I think it's a fair characterization that your chunk definition regarded "model validation" while mine regarded "user convenience." I also agree that dealing with model validation comes before dealing with user convenience. I would submit a minor modification to that dichotomy though: I don't think user convenience is of ultimate importance here, and I never really did; for me, it was the best way I knew of validating the model. I've been following your guiding words toward the outset of the project, "[our metric] need not be easily mentally calculated," in the sense that I recognize that most users, I hope, will never have to plug the formula into a calculator (maybe an online Sagittal calculator will spit out the datum for them), by which I mean to say that I don't think "user convenience" is of much importance. Having now gained the additional insight you have on the problem space from this exchange, I would characterize my gravitation toward what you've referred to "user convenience" not as an alternative or conflicting purpose, but more like an undesirably indirect attempt to address the same purpose. I might suggest we replace the characterization of my definition of chunk count as one after "user convenience" as one after "model simplicity", where I think the characterization of your definition of chunk count as one after "model validity" is quite suitable as it is. And when we compare "model simplicity" against "model validity", I think it's quite clear that the one we really want is "model validity", and that "model simplicity" has just been my subpar attempt to capture that. In characterizing what I was after as something with "user convenience" as the ultimate end, you'd be right to say I was putting the cart before the horse, taking "validity" and polluting it with wrong stuff; however, since I was actually only bringing the user into the argument by way of illustrating simplicity in a clumsy attempt at representing validity, I would like to convince you that we were closer to being on the same page than you may have thought; I think my "model simplicity" is pretty much the same as your "model validity", except that your "model validity" has some additional subtle and profound thoughts and insights which I had not yet attained.

Your terminology for "true constants", "true variables", and "parameters" is clear to me and helpful. Thanks for making that.

So it's probably clear from the above that get the general gist of your perspective now, and am very open to coming around to it. That said, I think it will become clear pretty quickly that I do still have trouble getting it. It hasn't all 100% clicked for me all at once.

I do admit that it still feels unintuitive for me to count "true constants" as 0 chunks while counting parameters (which resolve to constants after running a solver) as 1 chunk. That idea is what I was referring to when I wrote the sentence "I somehow doubt that this exactly what you're thinking, but again, perhaps this guess will help me triangulate what you do in fact mean." It appears, however, that that is in fact exactly what you're thinking. Unfortunately, your inline response to that sentence I wrote doesn't particularly help me understand your point of view. That's okay... this is all super abstract, weird, and complicated stuff. Probably more my fault than yours anyway. In any case, I'll need to press on this issue a bit more.

You say we need to punish ourselves for using parameters where we could use a constant; what's to stop us from cheating to evade the punishment? It feels like we could just lie at the end and make up some explanation for the constant the solver spits out and to justify it as if we had locked it down as a constant before running the solver. I don't mean that I desire to be evil here. I just mean that I don't comprehend the essence of the existence of any math/psych/phil/musical police here enforcing this sort of distinction, or perhaps rather than police I should characterize it as the laws themselves; I don't get what intellectual laws inform this sort of constraint. And as long as no one has been able to explain this law to me, I can hardly be expected to follow it if I would in the natural course of my behavior break it. Perhaps there's some extra insight you have on metric invention which you could share to assuage this concern of mine.
Dave Keenan wrote: Mon Aug 03, 2020 12:25 pm But in fact I don't understand what you mean when you say they can't stand alone. I can certainly count all the model parameters separately from counting all the functions used (not counting every application), and then sum those two counts.
Sorry I never got this point across, but I wouldn't worry about it. It's no longer relevant to the discussion.
(correct me if I'm wrong) you want to count constants the same as parameters, and for some reason I don't understand, you still see a constant 2 when you see lb. If so, would you see a constant e that you need to count, if we used ln instead of lb? Or maybe I've got it wrong. Maybe it's that you don't count log bases at all for some reason, no matter whether they are constants or parameters, but you do count coefficients? Please set me straight here.
As I said above, I am now susceptible to the position that constants do not count the same as parameters. But yes, until this post, I was maintaining the position that constants should count as 1 chunk, just as parameters do (because by my understanding of model validity/simplicity, it didn't matter whether a given constant had ever been a parameter).

So yes, I would see a constant e that I needed to count, had we used ln instead of lb. And I would have seen a constant 4/3 that I needed to count had we used `log_(4/3)` instead of lb. I have no idea what to do with the sentence "Maybe it's that you don't count log bases at all for some reason, no matter whether they are constants or parameters, but you do count coefficients". That's not what I think at all and I can't figure out how that fits in (I don't mean to criticize you for saying it... I recognize that it's probably quite possible to discern how that could be a logical response to something that has been said... but I'm exhausted and can't figure it out and now I'm probably spending more time and effort articulating my failure to understand the context of your comment than it would take to uncover said context, but in case, since it's not at all what I think, perhaps communicating that to you will suffice).

Maybe I'll just try to say my spiel about atomic chunks and subatomic chunk particles again, since you haven't specifically responded to that. Actually, this is about "standing alone" still, too, so maybe I will address that concept again after all. Here's how I would approach the introduction of a chunk.

"Ah, I want to put something else into my metric!

Let's put in some number... most math things need numbers of some kind!

Ah, but how will I use this number? I can't simply drop a number into my metric, or it won't mean anything! Numbers can't stand alone! I know that when I drop numbers in next to things, math conventions say that means to multiply them, but that's not really what I'm talking about... that'd be an implicit function. Whether or not orthographically I need to set down any extra ink to use this number, I need to apply by way of some function!

Perhaps my function will be "as a coefficient / multipliying"! Or perhaps it will be "as a logarithmic base"! Or perhaps it will be "as a power exponent!"

In any case I need to use a function to use it somehow!"

So, if I felt compelled to increase the accuracy of my metric by adding an extra chunk of complexity, I might start by saying I need to add something in somewhere. That something might be a 2, an e, a 4/3, a pi, whatever. And then I wouldn't be done because those don't do anything mathemusically by themselves. So I wouldn't have successfully completed the act of adding a chunk of complexity (hopefully toward the end of reducing my SoS) until I'd given that something a function. And I don't think it makes any sense to say that one function costs any more than other function.

And to me, lb is syntactic sugar for log_2. As ln is syntactic sugar for log_e. It's like what you said about icing. It doesn't change anything about how valid or simple or a metric is. It's just a niceness that we get to write it more cleanly in the end. That's why I "see" a constant 2 when I see lb. Am I missing something?
In many cases we already have a coefficient for the log, and in that case, standardising the base is simply removing a redundant parameter. In the cases where standardising the base requires introducing a new coefficient as a parameter, I don't find that any more complex. There's still one parameter and one log function.
I agree completely with the first statement. The second statement would make sense to me if I accepted that coefficients don't count as chunks. I have not been convinced of that yet.
I understand that, because `\text{lb}` is not a common function on calculators or spreadsheets or programming languages, we will have to describe it (somewhere) as log2, and because you count constants in the same way you count parameters, you think that a constant base is just as complex as a parametric base, and so you count the case with a coefficient as one more chunk.
I count the case with a coefficient as one more chunk because the coefficient is the chunk.

lb is the same chunk count to me as log_2 or log_e or ln or log_4/3. They are all one chunk. A chunk is an atomic unit of complexity. The two subatomic particles of a chunk are some value, such as 2, e, or 4/3, and some function applying them, such as as a logarithmic base, or power exponent, or coefficient. Writing the end result as lb or ln is just sugar/icing.
Given that `\text{ln}` is ubiquitous, if we instead standardised on base e, would you then agree that k×ln(p) has the same complexity, even from the user's point of view, as loga(p)? If not, why not?
I would not agree with that. Standardizing to base e changes nothing for me. k times ln(p) has 2 chunks of complexity - one for the log_e, and one for the k. log_a(p) has 1 chunk of complexity - one for the log_a, whatever that a turns out to be.

At least, that's what I have been arguing. If you can address my concerns earlier on in this post (re: laws, cheating, etc.) and convince me that there's something special about 2 or e whereby it doesn't count as a chunk where like 2.017 would count as a chunk, then it'd be different. But in that case I'd first change the whole thing to say "you've convinced me Dave: functions are alone 1 chunk, and then if their argument is another chunk" that'd say k times ln(p) has 3 chunks of complexity, one for the k, one for the log, and one for the e, but then immediately follow that with the effect of the second thing you would need to convince me of, which is "you've convinced me Dave, constants don't count for chunks, only parameters which resolve to constants do, and since e is one of those special numbers which we can get away with claiming as a constant, it doesn't count" so the k times ln(p) would be back down 2 chunks, but now one chunk would be for the k, and one for the log (where before it had been 1 for the k, and one for the log_e).
Hopefully I've explained it well enough above. But since you didn't seem to understand that I was describing my definition of a chunk of complexity the last two times I did it, here it is again: A chunk of complexity is a model parameter or a function other than + - × ÷.
Would you be able to state in plain language for me why the functions + - × ÷ are special and do not warrant a chunk, while any other function does. I would appreciate it if you could respond to the specific questions I posed earlier with respect to why × wouldn't be more chunks than + since it is strictly more complex.

Re: developing a notational comma popularity metric

Posted: Mon Aug 03, 2020 9:39 pm
by Dave Keenan
I'm sorry, but I really don't want to spend any more time on this, or soon we'll be trying to come up with a metric for comparing the function-complexity metrics that we're using to compare chunk-counting metrics for comparing ratio popularity metrics. :lol:

That has to ground out somewhere. The justification for any of these chunk-counting schemes is all pretty vague and arbitrary. I'm happy to admit that they are only intended to be quick and dirty. How about we just have two chunk-count columns. "DK chunk count" and "DB chunk count". :)

Perhaps we should put more weight (in determining the validity of the metrics) on how they perform on ratios that they haven't seen before, or how they perform with z=1, which is almost the same thing, rather than on their chunk counts.

I think we at least agree on what counts as a parameter. So it would be good to have a "Parameter count" column.

I'd much rather you spent your Sagittal time exhausting the possibilities of your code for suggesting new candidate ratio-popularity metrics, instead of exhausting my powers of explanation. :)

But I feel I owe you at least the following attempted explanations:

1. To me, for the purpose of crudely measuring the complexity of a model, the functions log2(x), √x, x2, 2x (where the "2"s are true constants) are boxes with one input and one output, as are x! (factorial), sin(x), cos(x) and tan(x), along with their inverses and their hyperbolic cousins (not that I see any application of them here). For this (model complexity) purpose, it makes no difference to me, that the former can be drawn as a box with two inputs (with a "2" feeding into one input) while the latter cannot.

2. It would be perfectly reasonable to treat + and - as 1 chunk, × and ÷ as 2 chunks and power, exponential, root and log as 3 chunks, or some similar scheme. I just don't think such fine resolution is warranted. You might think of me as having taken such a scheme, divided those chunk numbers by 5 and rounded them to the nearest integer.