• Hi Guest Just in case you were not aware I wanted to highlight that you can now get a free 7 day trial of Horseracebase here.
    We have a lot of members who are existing users of Horseracebase so help is always available if needed, as well as dedicated section of the fourm here.
    Best Wishes
    AR

Calculating standard times using Z scores

Calculating standard times is an arduous task, due to the excessive rail movements on British racecourses, plus various other practices, for example the official race distances in Ireland!
Using Excel, I am going to work out the standard time for the 5f (1000m) at Deauville using the fastest 25 race times and calculating any outliers using "z scores"

In a white paper published in 1987 it was suggested that you use a multiplier of 2.2 to calculate any outliers, just looking at the graph this isn't going to work, so I am going to use 1.5 as used in SPSS.

Outlier.png

Using the 1.5 multiplier, in the times column it's now highlighting 4 outliers.
Outlier2.png

I deleted the 4 outliers, looking at the graph the 56.20s time also needs to be deleted.
Outlier3.png

Visually looking at the graph, removing the 56.20s time has improved the trendline
Outlier4.png

Looking at the prediction equation, the standard time should be 56.59s.
(5f*0.0498)+56.342 = 56.59s

Food for thought...

Mike.
 

Attachments

  • ZScoreOutliers.xlsx
    16.9 KB · Views: 97
Last edited:
Oh, for those watching -
Z-scores (and for those interested an alternative method known as a t-test) are built into Excel, so I figure most of us on here could have a crack at similar using Excel.

My own efforts in this line are moving along slowly - glacially currently - as I am trying to build my understanding up as much as using the stats approach to calculate useful data, plus of course fixing bugs and adding the odd latest brainwave (or brain fade as often seems to be the result). I'm looking at the AW tracks, particularly at the split times I currently use for speed figures on AW, hoping to get my AW speed figures onto a sounder basis. Apart from any other consideration the AW tracks provide a small group to play with the data for, and represents a much smaller task thatn doing it for the whole raft of turf tracks.

Dave
 
TheBluesBrother TheBluesBrother

Hello Mike

This is an excellent article and attachment. But I hope I am not being to picky when I say this!

I think that you have input the incorrect "x" value into the equation. I don't think it should be the 5f value but the midpoint of the x values you have chosen, in the final example I would say 15.5 as the midpoint of the x values is the number to use?

Otherwise brilliant idea and workings.

Kevin
 
I think that you have input the incorrect "x" value into the equation. I don't think it should be the 5f value but the midpoint of the x values you have chosen, in the final example I would say 15.5 as the midpoint of the x values is the number to use?

If you used 15.5 as the plugin, aren't you just calculating the mean/average, I personally use the 15th percentile figure when calculating standard times.

In the linear pars file below, when I want use the y=mx+b equation to calculate a universal time for the flat or jumps.

m = slope (time per furlong)
x = is the plugin for the distance you want to calculate in furlongs

To calculate a 5f flat distance

(5f x 13.504) - 10.239 (constant) = 57.28s

The examples I put up were only ideas...

Mike.
 

Attachments

  • LinearPars.xls
    32 KB · Views: 51
Last edited:
I think you can actually read Mike's data several ways and get a bit of a choice in results - for example:

The list of times unedited - 15th percentile would sit just below the 4th value of 56.09.
List minus outliers as chosen by Mike, 15th percentile (of 17 values) would be between 2nd and 3rd values of revised list ie between 56.88 and 56.90

OR you could use the trendline between the range 0 and 25, and from my calculations the 15th percentile would be at the point where x=3.75, and using the y=mx+c formula for a straight line graph that would give you a value of 56.53 (rounded).

Removing outliers is obviously a good idea, after that it's really a case of pick your method - as I argue regarding speed figures - it doesn't matter if the way you do it produces figures consistently 10% higher than mine, what matters is that the method is consistent so figures for different meetings can be compared with some degree of accuracy.

The standard time itself is, ultimately, a mythical beast - as the BHA will tell you, the Racing Post write the form book, so what we are seeing when we look at a standard figure is actually the racing post's idea of how long it will take an imaginary horse rated 100 to cover the ground on good going carrying 9-0. I believe they calculate it by using a simple mean, ie grab a bunch of race times, 'correct' them (presumably using the wfa - spit - scale, you know, the scale that lets 3yo fillies trounce better 4 yo's in mid summer) and average out what you are left with,

It works because the method itself is consistent, and provided enough data is used, and provided outliers are removed, it's as good a method as any.

What you are going to get, if you develop your own ideas, is an alternative to the RP set of values - now to my mind this is no bad idea, provided you revisit the calculations every year or two, or when you are aware that a change has occurred at a track, to make sure your times haven't been invalidated.

There are of course significant problems doing this - firstly tracks vary their actual race distances on a daily basis - the AW tracks are pretty clockwork like at least, with few rail moves reported, but the turf tracks frequently move the rails in/out on the beds and change the race distance by the equivalent of a second or two. NH tracks not infrequently change the actual race distance by a furlong or more, which adds or subtracts 15 or more seconds to the time it normally takes to run the supposed race distance.

Another problem is in getting accurate race times - the official figures from the RP need checking, I won't bang on about their sins again here but doing your own standard times is doomed to - if not failure then at least inaccuracy - unless you spend a fair amount of time checking times and distances of races.

If you opt for using median values or means to calculate standard times I think you'd need to decide on something like using the 100 fastest times, and remove outliers, so you are averaging or taking a median from a large number of data.

The final thing I'll pose here is this - if you recalculate standard times after say a year, you will almost certainly find some of them change - perhaps by a fair amount - do you then go back and re rate everything, or just apply the new values to future races? Is a new standard time that is faster than the old one a sign that your original value was a bit out, or does it show that the quality of horse running at the track has improved, or that the track configuration has changed marginally to lop a fraction off every race.....

Dave
 
TheBluesBrother TheBluesBrother How many standard deviations are you using ? Because it seems to me you go from 2 to 3. And where you are using 2 the time of 56.09 should have remained.

Also how far back do you take the win times from?
And could you show by example how you get the y figure.

Thanks
 
How many standard deviations are you using

None.

Also how far back do you take the win times from?

I work with the fastest times available, which is very difficult when dealing with Irish data, you have to take the official distances with a big pinch of salt.

As I have already mentioned, I like to use the 15th percentile when calculating standard times, Dave Bellingham of Raceform uses the median figure.

I was using the 5f at Deauville as an example to calculate a standard time, when I originally worked on the French standard times I arrived at a standard time figure of 56.90s for 5f, in the final example above the 15th percentile was 56.85s (56.9s), originally Dave Edwards had speed figures for the French Galop racecourses, and his standard time for the the 5f was also 56.9s, do we all agree that the standard time for the 5f (1000m) distance at Deauville should be 56.9s

And could you show by example how you get the y figure.

I have in the first post on this topic.

Mike.
 
Last edited:
TheBluesBrother TheBluesBrother So you are just using+ or - 1.5 as the outlier determinant.
It just that what I have read on the subject says you multiply the standard deviation by 2.2 (though I read somewhere this should be 1.423 (?)).
That figure becomes one deviation , and you use 3 of those.

y = 0.0498 , if you show how you get this figure then I cannot see it. I see your equation ,but this side of math was never my strong point.
 
y = 0.0498 , if you show how you get this figure then I cannot see it.

Excel calculates the equation.

So you are just using+ or - 1.5 as the outlier determinant.

Yes I am using a multiplier figure of 1.5 as used in SPSS, In a white paper in 1987 it was suggested that you should use a figure of 2.2 as your multiplier, but as we are dealing with tens of a second I use 1.5.

I would use standard deviations if I was calculating an upper or lower limit, in my days as a quality auditor we would use 6 sigma or plus or minus 3 standard deviations to set a control limit on a tolerance.

Mike.
 
Last edited:
y = mx+c is the equation of a straight line, where the m is the gradient of the line, which is the result of the calculation 'change in y divided by change in x' - the c is for 'constant' and is the y coordinate of your line at the x=0 point, In Mike's example, and others you'll see online, the equation might be written y=mx+b but that's just swapping the letters - the values are the same ones.

The 0.0498 is the gradient of the line, the idea is that by multiplying this by your chosen x coordinate and then adding the constant you will derive the correct y coordinate for that value of x.

I'm only eyeballing the graph so my numbers will be close but not spot on, but you'll get awful close to the same values as follows:

gradient m is change in y over change in x - take the end points of the graph, when x = 0 y equals 56.342 (as explained above, this is the 'constant' value, the y coordinate when x = 0). For the other end, when x = 25 we eyeball the graph and see the y coordinate is about 57.58 or so (we're measuring the line's y coordinate, not that blob above the line). So the trendline change in y is 57.58 minus 56.34 approx, or 1.24 in total, while x has changed by 25-0 = 25. Change in y (1.24) divided by change in x (25) gives us the gradient m = 0.0496..... which is pretty close the the more exact measurement that Excel did.

As Mike says, Excel does it for you - draw a trendline across some data and ask it to display the formula for the line. It doesn't hurt to know how it does it though.

Mike calculated that when x=5 (which is the 5th of 25 points along the line, so I'd tend to call that the 20th percentile) if I did the same calculation using the gradient I calculated I'd have got y = (0.0496 x 5) + 56.342 = 56.59 which is the same value Excel produced.


Z-scores are calculated from the data item value minus the mean of the data points, all divided by the standard deviation - ie a measure in SD of how far a data point is from the mean. Mike's choice of 1.5 is just that - a choice - clearly if he'd used a z score of 2.2 though he'd conclude that he had no outliers at all.

This all also proves that school text books titled 'Maths is Fun' are probably a bit wide of the mark.

Dave
 
Mike calculated that when x=5 (which is the 5th of 25 points along the line, so I'd tend to call that the 20th percentile)

To solve the problem with the deleted outliers, and ending with a higher percentile figure, just replace outliers with extra data points.

At the moment I am trying to find a strategy for playing black in the Queen's gambit at chess, currently I am 0.09 of a pawn behind, I'm getting there.

Mike.
 
Last edited:
To illustrate not only the relative intellectual qualities of Mike and myself, but probably also revealing all sorts of social status blots, whilst Mike is solving chess problems I spend my non racing moments trying to kill Aliens in XCOM 2 (War of the Chosen).

'To solve the problem with the deleted outliers, and ending with a higher percentile figure, just replace outliers with extra data points.'
- that's what I did there, I gave the line an x dimension of 25, 5/25 = 20%, by allowing the extrapolation back to the y axis. This doesn't matter a jot of course, as I said above - as long as you are consistent in your method all is well.

Dave
 
tractorboy tractorboy

Think of the equation as a set of rules that tells you how to draw a straight trend line on a graph - the trend line helps you visualise the relationship between two variables, the values you store in x and y. once you've drawn the graph or calculated the equation of the graph it allows you to work out what the y value should be for any given value of x (or vice versa, although the x values are usually known).

So, for any chosen value of x, the y value will be....

- is basically what the equation and graph tell you. In the graph Mike has drawn x is just 'sample number' - ie when x is 10 that's just means it's the tenth value in his list, it's a 5f graph/result not because he uses x=5 - it's a 5f graph/result because all the values he used came from 5f races. (When you think about it this makes sense, if x = furlongs then then by feeding a value of 10 in for x would give a result of 57.088 which is clearly not the standard time for 10f).

As the graph goes from 0 to 25, which is actually 26 values of course, then the n-th percentile would be at a position where

n = (x coordinate * 100)/26 - nb * means multiply, it saves confusing it with x

This rearranges to a slightly more useful

x = (n * 26)/100 so if you want the 15th percentile (n=15) the equation becomes x = (15 * 26)/100 = 390/100 = 3.9 - so you'd read the y value off the graph for when x=3.9

20th percentile x = (20 * 26)/100 = 520/100 = x coordinate 5.2

A straight line graph explains how as one variable, x, changes so it affects a second variable, y.... x is known as the independent variable and y is the dependent variable. The 'shape' of the graph - ie straight, curved, bell shaped, which direction it goes across the page - can give us a strong guide to the relationship of the two data sets. Of course you might get completely chaotic data, and when you can't join the dots to make a recognisable shape that can be a pointer that y doesn't depend on what you do to x. (For example graph people's shoe size against their IQ and you'll not find much of a relationship visible on the graph, but if you map shoe size against height it would probably look a lot more of a link was present).

Dave
 
tractorboy tractorboy

Think of the equation as a set of rules that tells you how to draw a straight trend line on a graph - the trend line helps you visualise the relationship between two variables, the values you store in x and y. once you've drawn the graph or calculated the equation of the graph it allows you to work out what the y value should be for any given value of x (or vice versa, although the x values are usually known).

So, for any chosen value of x, the y value will be....

- is basically what the equation and graph tell you. In the graph Mike has drawn x is just 'sample number' - ie when x is 10 that's just means it's the tenth value in his list, it's a 5f graph/result not because he uses x=5 - it's a 5f graph/result because all the values he used came from 5f races. (When you think about it this makes sense, if x = furlongs then then by feeding a value of 10 in for x would give a result of 57.088 which is clearly not the standard time for 10f).

As the graph goes from 0 to 25, which is actually 26 values of course, then the n-th percentile would be at a position where

n = (x coordinate * 100)/26 - nb * means multiply, it saves confusing it with x

This rearranges to a slightly more useful

x = (n * 26)/100 so if you want the 15th percentile (n=15) the equation becomes x = (15 * 26)/100 = 390/100 = 3.9 - so you'd read the y value off the graph for when x=3.9

20th percentile x = (20 * 26)/100 = 520/100 = x coordinate 5.2

A straight line graph explains how as one variable, x, changes so it affects a second variable, y.... x is known as the independent variable and y is the dependent variable. The 'shape' of the graph - ie straight, curved, bell shaped, which direction it goes across the page - can give us a strong guide to the relationship of the two data sets. Of course you might get completely chaotic data, and when you can't join the dots to make a recognisable shape that can be a pointer that y doesn't depend on what you do to x. (For example graph people's shoe size against their IQ and you'll not find much of a relationship visible on the graph, but if you map shoe size against height it would probably look a lot more of a link was present).

Dave
Double Chemistry next for all you slackers smoking behind the bike shed
 
Double cookery eh? (My ex colleagues always loved it when I referred to the chemists as 'the cookery dept' and biology as 'gardening'_)

Actually, as I'm sure has been spotted, I tend to overdo the explanatory stuff - on purpose - as many of my pupils of the past would struggle to understand things that the physics graduates who'd written the books thought was obvious. I try to ensure there is no wriggle room to misunderstand things, and regularly attend masked book burnings for computer 'how to' guides.

Dave
 
Back
Top