Yes. Except you don't really need to say "summation of SED" since "SED" already is the summation.cmloegcmluin wrote: ↑Fri Jun 26, 2020 2:16 am So, just to be clear: we sort the occurrence counts from Scala into a popularity rank, and then sort whatever values our candidate metric generates into an popularity rank approximation, and then we take those two lists of integers — the former of which will be exactly monotonic, and the latter of which will be as monotonic as we can manage — and run them through this summuation-of-SED formula to get a single result, Spearman's rank correlation coefficient, or ρ ("rho")*. So the "two ranks" for each ratio you mention are the actual rank and our approximate rank. And then we rinse and repeat for each candidate metric until we find the one which gives the best (smallest) ρ.
Sure. I should have written "then maximising Spearman's coefficient simplifies to minimising the sum of the squares of the differences between the two ranks for each 2,3-reduced ratio.".*Actually, the formula looks to be a tad more complex than that. You extracted the important bit. But we actually need to take the above value, multiply by 6, divide by n(n2 - 1) where n is the number of observations (about 800 of them, we have), and then subtract all that from 1.
Good idea. But I don't like the weighting of that paper you linked. It isn't a weighted Spearman's, it's a weighted Pearson's. And the weighted rank is 0.9rank.I don't suppose there's an established method of weighting each term's SED with a falloff of importance, to represent how we care very much about the ranking being accurate at the top of the list but less and less so as we go deeper into the list, due to having fewer data points the deeper we go?
I think we should stick to Spearman's, and the weighted rank should be rank -1.37.
I'd rather not.... SOPF>3 (or should we start calling it sopfr(rgh5)?) ...