it need not be easily mentally calculated.
So I think the level of complexity Dave is looking for is something like this: not necessarily mental-math-able, but people should be able to easily remember the formula. And whether or not that's exactly what Dave was thinking, I now think that's a pretty good standard.I challenge readers to come up with a simple function
And I don't think that previous formula is memorable. So I set out this morning to see if a simplification of it could fit the data almost as well. I would trade a little bit of accuracy for simplicity. And so I tried simplifying the sopf part of the equation:
sopafr(num) + k×sopafr(den) +
u×sopf(num/den) +
s×primelimit(num/den)
And, with mixed feelings, my code spat out an even lower sum of squares.
I must admit that part of my process is still manual. I simply don't have the computing power to brute force the optimal values for these parameters k, a, s, u. What I'm doing is choosing for each one a reasonable range of values and a resolution of points between them, and just running it for every point in that lattice. Then waiting for the code to give me the best results, and then tightening in around the best value gradually shrinking the search space and increasing the resolution of points within it that I check. So the explanation for finding a better sum of squares despite having strictly constrained the formula can only be that I goofed previously and didn't proceed methodically enough and somehow zeroed in on a local minimum but not the overall minimum.
If anyone knows a better process, let me know.
I guess I could automate it to make a list of any minima aka any point that gives a better sum of squares than any of its neighbor points (those off by 1 resolution step in either direction for any of the parameters), and then recursively for each of those points drill into the volume bounded by all of those neighbor points, and keep branching until I hit some accuracy threshold and then settle on the absolute minimum.
I should probably share something else that I glossed over in the previous post, but just for the record: I did in fact experiment in the other direction, with an even more complicated formula, where for sopf, I used a different scalar than k (called it j) on the smaller of the numerator and denominator, and I also used a different exponent on the primes there (called it b, instead of a). Unsurprisingly, the best values for k and j came out close to each other, and the best values for a and b came out close to each other, so I decided this level of complexity was not worth including.
So for my simplified formula, the best coefficients are coming out something like this now:
sop⅝fr(num) + ⅜sop⅝fr(den) + ⅛sopf(num/den) + ⅙primelimit(num/den)
To be exact, k =0.368, a=0.624, s=0.171, u=0.127, z=-1, cut-off-point=80 gives a sum-of-squares of 0.005357700411912983.
Without using sopf at all, the best fit I can find is 0.006905116251199162 (k=0.397, a=0.872, s=0.168, u=0, z=-1, cut-off-point=80). So I think we should keep sopf in there.
And I thought we should keep prime limit in on principle alone, to differentiate e.g. 25/7 from 17/1. But maybe raising the primes to powers could somehow cover that need. No. The best fit I could find without prime limit was 0.007112286611103173, (k=0.753, a=0.517, s=0.285, u=0, z=-1, cut-off-point=80).
And if I try reintroducing the complexity of splitting the numerator and denominator for sopf, as well as raising the primes to a power for sopf, the best values seem to be very near 1. So that's further evidence to just leave that complexity out. (Another would be that I didn't like how sometimes the numerator and denominator would get flipped for sopf but not for sopfr, or vice versa... just made me a little uncomfortable... probably fine but felt really strange.)
Perhaps there's a better way someone knows to express the formula to capture the effect of how it's not really the denominator part of the ratio that gets modified by k but whichever of the numerator and denominator is smallest after processing it by sop⅝fr. Maybe we have to put it like this:
𝜈(rough5(c1/c2)) = max(sop⅝fr(c1), sop⅝fr(c2)) + ⅜min(sop⅝fr(c1), sop⅝fr(c2)) + ⅙sopf(c1/c2) + ⅛gpf(c1/c2)
It looks like a lowercase "v", but it's the Greek lowercase "nu", which I chose because it could stand for "notational unpopularity". Maybe typing "v" would generally be okay when it's clear from context what we're talking about.
You'll also notice I replaced primelimit() with a better name. In the mathematical literature seems to be either P() for "largest prime divisor" (where p() is the "smallest prime divisor") or gpf() for "greatest prime factor".
http://www.nathanmcnew.com/PopularPrimes.pdf
https://projecteuclid.org/download/pdf_ ... 1250127234
http://www.math.cmu.edu/~mlavrov/arml/1 ... utions.pdf
http://nntdm.net/papers/nntdm-19/NNTDM-19-1-55-58.pdf
https://link.springer.com/chapter/10.10 ... -3638-9_27
https://mathworld.wolfram.com/GreatestPrimeFactor.html
I suggest we use gpf() as it looks more similar to sopf and is less ambiguous.
So then the only thing we'd really have to explain is what the heck sop⅝fr means; perhaps the best way to describe it is as the "sum of prime factors with repetition, but where each prime is raised to the power of 5/8".
And of this 𝜈(c) function we can say that it maximizes Spearman's ρ for rankz where z is Zipf's exponent, -1, over the top commas in Scala's popularity statistics where the entry had at least 0.05% of the total votes (that's another way of saying at least 19 of them, which in turn is the same as saying in the top 80).
Finally, how does 𝜈(c) perform on the simple pairs of low primes?
7/5 → 5.59 (1318 votes)
35/1 → 11.71 (875 votes)
11/5 → 6.99 (339 votes)
55/1 → 7.32 (119 votes)
11/7 → 7.55 (324 votes)
77/1 → 7.72 (111 votes)
So it's not exactly proportional, but then, the goal hasn't been for them to be proportional since Dave focused us in on matching ranks rather than values.
and how does it look?
So, interestingly, it is not a better fit by R2, which is what we were going for before. But Spearman's ρ is more what we want.
I do have another question though: should we experiment with taking the logarithms of the primes rather than raising them to some power < 1?