Scraper/Method to obtain Top Speed and RPR from Racing Post Website into Excel

giuseppe_esq · Jul 21, 2018

davejb said:
K Kalmar
giuseppe_esq

I've found that the Scoop 6 lines on the daily racecards on RP can cause an error when downloading cards, and programmed a quick 2 line fix for it to skip these lines, so if you've had the program stop with an error (today's cards should cause this problem for example) this will fix it.

Rather than zip up the whole program again, I'm just making the revised exe file available - ie the program code itself, without all the DLLs etc that tyou will already have installed. All you need to do is swap the exe file that is linked below for the giusep.exe that you already have from the program installation.... just delete the old one and copy the new version in its place.

I'd appreciate knowing this works okay for you.

Dropbox - giusep.exe

Dave

Hi Dave,

Thanks for this.

I ran your script today and it seemed to worked ok without any errors

I will download your revised script later.

Cheers
Giuseppe

davejb · Jul 21, 2018

Okay,
the Racing Post pages sometimes have a section labelled 'Worldwide Stakes' or something of that sort, and for some reason you get an 'access denied' sort of error trying to download those pages, which my original program bypasses. I found last night that the 'Scoop 6' cards on the site were causing the same problem, so added 'scoop 6' to the 'worldwide stakes' as something to ignore.

If you have no problems from the previous script then it's your choice, but I'd add the new version if possible just in case the problem surfaces later. There's always a possibility that additions/changes to the website pages will 'break' a scraper of course, and if those additions only appear occasionally it can take a while to fully bulletproof things.

Dave

davejb · Jul 23, 2018

Sorry to hijack your thread again Giuseppe_esq,

I've a couple of people asking me to write code to access other websites that they'd like to collect data from, and I'd like to explain my position on this.

If anyone is struggling to do something that I am already programming for, then I am more than happy to share my work to help other members out - but I have no desire to become an unpaid coder spending my time producing programs other people would like but cannot produce themselves. I am busy enough with the daily processing of data to keep my own race ratings etc current, and the programming I do to support this.

Sharing my work is fine, there is a lot of information on UK/Irish racing on my 'Early Days' daily thread that people are free to use as they wish, if (as happened with Giuseppe's request) I'm told somebody is having problems doing something I have already done then I will gladly try to help them.

Dave

davejb · Jul 23, 2018

That makes you a kiddie coder mate -
I'm 63!

Like any other language Python is logical in how it does things - putting different subroutines into their own 'def' structures can take a bit of getting used to (the global/local variable stuff is old hat, but Python doesn't always seem to get the scope right.... maybe Python 3 would handle it better. For scraping many websites are fairly straightforward, they're just big text files that you can edit and grab from with ease - the bit that held me back for quite a while was getting a site to let me in, one of the admin types on here pointed me towards that

opener = urllib2.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
urllib2.install_opener(opener)

... code and after that it was pretty much plain sailing. I've no idea how to simulate clicking on a form opener on a site, which is annoying as it means I have to cut and paste from the screen into a file, then let my program edit it - this is how I do the sectionals from ATR for my AWcard.

I really don't mind swapping info and if somebody is trying to get to grips with it I can probably save them the drag I had to endure when starting out, it's just that I want to knock the idea that I will write programs on request for all comers as if I was stuck for something to do all day

There are one or two folk on here who have helped me over the past year, they of course DO get to at least ask if I can code something up for them, as it really is a case of getting back what you put in on this site.

Good luck, if you get stuck you can always PM me about it.
Dave

ArkRoyal · Jul 23, 2018

davejb said:
Good luck, if you get stuck you can always PM me about it.

Or start a thread in the Gambling Geeks section. :handgestures-thumbup:

davejb · Jul 23, 2018

Somebody would have to tell me about it, I'm dreadful at spotting new stuff that I ought to be watching - I prefer to just stalk Mick.

Dave

ArkRoyal · Jul 23, 2018

https://www.theukbettingforum.co.uk/XenForo/categories/gambling-geeks.113/

valiant thor · Jul 24, 2018

Hi

davejb
Have you tried using requests instead of urllib2, easier to do a lot more things

Code:

import requests
from bs4 import BeautifulSoup as bs

url='http://xxxxxxxxxxxxxxxxxxx'
header={'user-agent':'Mozilla/5.0'}

r=requests.get(url,headers=header)

davejb · Jul 24, 2018

Hi

valiant thor
Yes, I've used requests - some of the parsing is a lot easier, it's mainly just down to the fact that I managed to successfully get into the RP using urllib2 to be truthful. I also find the line parsing using the standard language facilities helps me understand things a wee bit better sometimes than when I use re to extract fields. Whilst i certainly wouldn't put anyone off using tools like re, I don't find it any harder not to use them,....

Dave

valiant thor · Jul 24, 2018

davejb
Not realy done any real scraping for a long time now and to say Im a little ring rusty is an understatement :oops:

Trying to get back into it to build myself an automated database , but
a) the old brain matter isnt what it was
b) websites are a lot more complex nowadays (well it seems that way )
I find requests cuts out a lot of lines of coding compared to urllib2 ,but each to there own whichever is easiest, thats the beauty of python more than 1 way to skin a cat and as long as you get the result back that you want its job done :handgestures-thumbup:

I wouldnt use re (regex) for parsing its like making an easy job harder IMO

VT

davejb · Aug 6, 2018

I've had a word with the chap asking for help, and I thought I'd put a bit of info about the RP site (and a little about Timeform) to give folk trying to grab results/cards some idea of the structure/process to accomplish that.

Apologies for any typos, my computer can't spel fur toffy.

1) First off you'll just get a load of error messages about failing to connect unless you tell the RP site that you are a browser, I mentioned this above, but basically you need to send a header to the web address you are accessing so it'll respond with the data you want.
eg:
Make sure you've included a library/module to play with URL's, I use urllib2, so you need a line reading
import urllib2

at the top of your code. After that the header business is covered by :

opener = urllib2.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
urllib2.install_opener(opener)

2) RP cards - when you navigate in your web browser to the cards page, you go to a single webpage that contains a series of addresses, each address being the actual card for one race. So if you want to download all the cards for a day you have to go to that day's web page, then run through the page stripping out the web page addresses for each race.

addy = "Today's Racecards | 6 August 2018 | Racing Post"+str(target)
print "Opening "+addy <==damn, the text above has been automatically turned into a hyperlink.....see end of this post (A)
page = urllib2.urlopen(addy)
pagestring = page.read()
rowlist = pagestring.split("<tr>")

- This will open that first page, the string variable 'target' is a string containing the date of the day you are looking for, in yyyy-mm-dd format, so 6 august 2018 would be covered by saying

target = "2018-08-06"

...and that urlopen line will load the page at "Today's Racecards | 6 August 2018 | Racing Post"]Today's Racecards | 6 August 2018 | Racing Post[/ into the variable 'page' See (B) at end of this post

The next couple of lines start getting you ready to locate the web addresses of the actual cards to download, storing the information in rowlist.

3) The way I do this myself is to scan through rowlist using a simple loop, basically creating a text file with each line of code from the original address, but there's a bit of an edit going on using that 'got_world' that kicks it off.

Some lines from the cards file really don't like to play, anything in the bottom 'extras' on a daily card, by which I mean those sections that list the Scoop 6 races, or the Worldwide Stakes races - these will actually simply be repeats of lines from the main section of the page anyway, if the races are from UK/Eire, and so I use that got_world as a flag (proper coders would use true/false, as it's really a boolean decision, but I can't really get too excited, this works...) When I find a line reading 'Scoop 6' or 'Worldwide Stakes' this got_world flips to = 1, and my program ignores the rest of the page.

got_world = 0

for row in rowlist:
rowcontentlist = row.split("\n")
for lineofhtml in rowcontentlist:
outline = lineofhtml+"\n"
if lineofhtml.find("data-accordion-row=worldwide-stakes-races>") <> -1:
got_world = 1
if lineofhtml.find("SCOOP 6") <> -1:
got_world = 1
if got_world == 0:

rpdump.write(outline) <====== This line stores the lines I want to copy in the rpdump.txt file, which is used for temporary storage
linecount +=1
rpdump.close()
linecount -=1

Linecount tells me how many lines of text I've just copied, so I know how many to read through to extract data. This is just one way to do stuff, there are more elegant methods to do most of what I do, but I've found over the years that keeping things simple and as basic as possible makes it a lot easier to figure out when revisiting or trying to extend the code.

I then use

rpdump = open(path+"rpdump.txt","r")
lines = rpdump.readlines()

.... to read my temporary file of text lines into an array of lines - now by array I don't mean a python array structure, I'm talking in more general coding terms, 'lines' is essentially a single dimensional array of text lines that I can then pull specific lines from, as we'll see below when I feed each line in turn into the variable 'outline' ....

4) I'm not sticking all the code in here by a long way, anyone putting any effort in will manage to add what's needed, or I'll answer PM's foranyone who is stuck, but the next job is to strip out the web address for each race we have copied from that first 'front' page...

So I kick off with an
lnr = 0 which is just being used as a line counter as I go through the file, and ...

while lnr < linecount:
isokay = 0
tracknumber = 0
outline = lines[lnr]
a = outline.find("<a class=\"RC-meetingItem__link js-navigate-url")
if a <> -1:

- so what this is doing is going through my rpdump.txt info a line at a time, looking for what I think of as key phrases - a key phrase being a section of html code that flags up certain data items. When looking for the html address info for each race you'll see it is preceded by the text

RC-meetingItem__link js-navigate-url

by using 'outline.find' the line of text is read, and if that 'key phrase' is in it a number will be returned telling me how many characters along the line I can find it, failing to find it returns -1. So I can use the equal or doesn't equal minus one to find out if my key phrase is in a line or not. In thsi case 'if a<> -1' so if a doesn't equal -1 then the key phrase is in the line.

5) The next few lines iwork as follows (this bit follows on from the -1 above)

I read the line AFTER I found the 'key phrase' into inline - the key phrases once identified will tell you where a data item is, but in many cases the data is on a following line, or two lines after the key phrase, and so on. In this case we get the key phrase on one line and can be confident the next line holds the meeting into we want -

inline = lines[lnr+1]
instr = inline
pos = instr.find("/racecards") <=========== Another 'key phrase' that checks we are looking at a card
instr = instr[pos:-1] <============ the next few lines strip all the text from 'racecards' to the end of the section we need
instr = instr.strip("\"")
pos = instr.find("/racecards")
substr = instr[pos+11:]
pos = substr.find("/")
substr = substr[

os]
tracknumber = int(substr) <==== from this we now have the RP track number.

The RP track number is a unique number allocated to each track, MOST UK tracks have a number between 1 and 109, but there are tracks outside this range, such as the AW tracks, Irish are mostly in 174 to 203, again with exceptions like Dundalk (1138) - by skipping tracks that have a non UK ident you will avoid having to wade through Australian cards and the like.

Meetings with RP track numbers in the UK/Eire range I write the line out to a file called rpcards<date>.txt which therefore builds up to be a list of web address info - here's a sample:

Part two will follow - I exceeded the page limit!

(A) the web address used was addy = "https://www.
followed by .racingpost.com/racecards/"+str(target)
- but I can't stop this site software 'helping' turn it into a stupid bloody title, grrrr - turning off bbcode and 'pasting as text' seems to have no effect...

Similarly at (B) "https://www.
then followed by racingpost.com/racecards/"+str(target

davejb · Aug 6, 2018

Part 2:

/racecards/49/ripon/2018-08-06/707117
/racecards/49/ripon/2018-08-06/707120
/racecards/49/ripon/2018-08-06/707115
/racecards/49/ripon/2018-08-06/707114
/racecards/49/ripon/2018-08-06/707116
/racecards/49/ripon/2018-08-06/707118
/racecards/49/ripon/2018-08-06/707119
/racecards/596/cork/2018-08-06/708870

6) If you look at that sample you'll probably figure that these are in fact the individual card pages you are after, but they are all missing the first (identical) part of the page address, all you need do to complete each address is bolt the text

Horse Racing Cards, Results & Betting | Racing Post

onto the start of each line.

6) More looping - once the rpcards<date>.txt file has been completed, you then run through it a line at a time, bolting that bit on at the start, loading each web page in turn. As you load each page you scan through each line of it looking for more of those 'key phrases' that signal a data item you want is near, and grab it. This is all down to working out the 'key phrases' to look for, deciding how many lines to skip from the phrase to find the actual data line, and string slicing to extract what you are after.

You have a big advantage to all this - being computer generated (with some human input, but not a lot) the card (and results) pages are formatted the same all the way through - if the OR is 2 lines after a line reading 'here's the OR then' on one page it wll be 2 lines after on every page, every day, until they change the web page format. in a site revamp.

What I do is list all the key phrases I'm after, one after the other, then run a bunch of conditional statements (IFs) to sort out each type of data.... eg

lines = rpdump2.readlines()
lnr = 0
while lnr < linecount:
outline = lines[lnr]
a = outline.find("RC-cardPage-runnerName")
b = outline.find("data-order-rpr")
c = outline.find("RC-runnerFormRow__rpr")
d = outline.find("RC-runnerFormLink__results")
e = outline.find("data-test-selector=\"RC-runnerFormRow__outcome")
f = outline.find("<a href=\"/profile/horse/")
g = outline.find("title>")

-- several line have been skipped to get to this point, you do need to do some typing, I'm only trying to show you how to do it, not provide a complete script remember....

Right, above are 7 of my key phrases, each of these precedes a specific data item. If a phrase is found on a line then the variable (a,b,x etc) associated with it will turn from -1 to some positive value. SO every line that gets read in will make a to g all go to -1 most of the time, but when a key phrase is found that ONE of a to g that is associated with that data item will go positive. This is easy to check then....

7) Here's the first couple in their entirety then:-

# name of horse

if a <> -1:
inline = lines[lnr+1]
instr = inline
pos = instr.find("</a>")
instr = instr[

os]
instr = instr.strip(" ")
instr = instr.replace("'","")
horse_name = instr+","

# master rpr

if b <> -1:
inline = outline
instr = inline
pos = instr.find("=")
instr = instr[pos+2

os+5]
if instr == "-\">":
instr = "-"
instr = instr.strip("\"")
horse_rpr = str(instr)+","

If the key phrase 'RC-cardPage-runnerName' is found, then a will go to a positive value, ie a will not equal -1, and that first block of code runs - as we see the horse name is actually on the line after they key phrase, and a bit of simple string slicing extracts it - I then add a comma so when I write it to a CSV it'll go into a cell on its own.

The master RPR value for a horse is found if b triggers, the key phrase "data-order-rpr" was found. This time we don't need to skip any lines, we look for the = sign in the text and then do a few extra things to cope with a horse without a master rpr (master rpr = "-"), we remove a slash character, and then stroe it with a comma ready to put into a csv.....

The same is done for c -g, to save you having to check c- g find:
c - RPR of a single run in the form lines, d- data of the race e- the race comment f - this one was supposed to extract horse profile info but draws nothing useful, I think this is one of the data items (My Rating MR is another) that loads after the page itself has already loaded, and therefore has no value when you scrape it. g - more profile stuff, equally empty.

Obviously to find the OR etc requires key phrases to be figured out and similar routines written - I get the OR info from my HorseraceBase data, so don't scrape it from RP,

data-test-selector="RC-runnerFormRow__or"

looks like it appears 8 lines before the OR of a horse in it's form section, you can use a number of phrases for 'today's OR' - this is the code for Unite the Clans in the last race on today's list:


71 

- The OR is 71, so you could search for 'data-order-or=' and strip out from the = to the < to extract the value, or use
data-test-selector="RC-cardPage-runnerOr"

and copy the OR from either the next line (as above) or grab 2 lines later and knock the off..

8) Having extracted each bit in turn just write them to a text file masquerading as a csv, and shove a newline "\n" after the final data item on each line to go onto the next horse and row in the spreadsheet. In my version I collect the rpr last, so when I've got an rpr for my runner it triggers a line write to my final file.

Timeform results are not dissimilar, to grab results information from Timeform you download the 'front page' from the results. thens can through it finding the address for each race's results, then download each results page to grab data. You can get a sort of 'quick' result from the front page, but if you want racetimes you need to grab each page one per race, similar to the RP cards.

RP Results are a single page effort, the front page of the results section has most of what you are likely to need, so it's a one page download and then loop through it stripping data - if you want the in running comments etc you'll need to download each race result 'full result' page and you're back on the one page per race style of working.

If you struggle to find a suitable key phrase make a note of a runner or two and the value you are looking to fins, for example by looking at the final racecard of the day I saw it was won by Unite the Clans with an OR of 71 - you can find the same stuff I did by looking for Unite the Clans (simple text search in notepad/wordpad), once you've found the name then you can be fairly confident the rest of that horse's data will be somewhere below - so call up find again, type 71 into it, and you are very likely to end up with the data line showing Unite the Clans' current OR value.

Okay, this is quite a lot, but only a fraction of the code for grabbing RP card info, the rest should be quite straightforward to generate really - it's only string slicing and file operations for the most part, I was really trying to explain the structure of the data and show you a path through it that allowed it to be collected.

Questions of a general nature are fine here, but if anyone needs more of the code or wants to have something explained in depth it's probably best to PM me, I'm more likely to notice the request for a start!

I would prefer it if any code I put up is not passed on elsewhere - I wouldn't want everybody and his dog trying odd snippets out just to see what happens and end up with some sort of accidental DOS attack on the RP site, it's in all our interests to ensure the RP can run their site without problems caused by poorly understood scripts.

Final point - please remember I don't code on demand, I code for my own use and am happy to share information, but that doesn't translate into coding other peoples' ideas up because they can't. (My code's too sledgehammer like to attract too much business anyway).
Dave

Apologies - I can't stop the smilies appearing (tried bbcode off, no workee). If there's a smiley in the code it'll be where a colon and a p follow each other, or similar, ie

a_string = substr[pos1 : pos2] - if you don't space that out then you get a sticky out tongue smiley from the

of

os..... ('from the : p of : pos)

giuseppe_esq · Aug 19, 2018

davejb said:
Part 2:

/racecards/49/ripon/2018-08-06/707117
/racecards/49/ripon/2018-08-06/707120
/racecards/49/ripon/2018-08-06/707115
/racecards/49/ripon/2018-08-06/707114
/racecards/49/ripon/2018-08-06/707116
/racecards/49/ripon/2018-08-06/707118
/racecards/49/ripon/2018-08-06/707119
/racecards/596/cork/2018-08-06/708870

6) If you look at that sample you'll probably figure that these are in fact the individual card pages you are after, but they are all missing the first (identical) part of the page address, all you need do to complete each address is bolt the text

Horse Racing Cards, Results & Betting | Racing Post

onto the start of each line.

6) More looping - once the rpcards<date>.txt file has been completed, you then run through it a line at a time, bolting that bit on at the start, loading each web page in turn. As you load each page you scan through each line of it looking for more of those 'key phrases' that signal a data item you want is near, and grab it. This is all down to working out the 'key phrases' to look for, deciding how many lines to skip from the phrase to find the actual data line, and string slicing to extract what you are after.

You have a big advantage to all this - being computer generated (with some human input, but not a lot) the card (and results) pages are formatted the same all the way through - if the OR is 2 lines after a line reading 'here's the OR then' on one page it wll be 2 lines after on every page, every day, until they change the web page format. in a site revamp.

What I do is list all the key phrases I'm after, one after the other, then run a bunch of conditional statements (IFs) to sort out each type of data.... eg

lines = rpdump2.readlines()
lnr = 0
while lnr < linecount:
outline = lines[lnr]
a = outline.find("RC-cardPage-runnerName")
b = outline.find("data-order-rpr")
c = outline.find("RC-runnerFormRow__rpr")
d = outline.find("RC-runnerFormLink__results")
e = outline.find("data-test-selector=\"RC-runnerFormRow__outcome")
f = outline.find("<a href=\"/profile/horse/")
g = outline.find("title>")

-- several line have been skipped to get to this point, you do need to do some typing, I'm only trying to show you how to do it, not provide a complete script remember....

Right, above are 7 of my key phrases, each of these precedes a specific data item. If a phrase is found on a line then the variable (a,b,x etc) associated with it will turn from -1 to some positive value. SO every line that gets read in will make a to g all go to -1 most of the time, but when a key phrase is found that ONE of a to g that is associated with that data item will go positive. This is easy to check then....

7) Here's the first couple in their entirety then:-

# name of horse

if a <> -1:
inline = lines[lnr+1]
instr = inline
pos = instr.find("</a>")
instr = instr[os]
instr = instr.strip(" ")
instr = instr.replace("'","")
horse_name = instr+","

# master rpr

if b <> -1:
inline = outline
instr = inline
pos = instr.find("=")
instr = instr[pos+2os+5]
if instr == "-\">":
instr = "-"
instr = instr.strip("\"")
horse_rpr = str(instr)+","

If the key phrase 'RC-cardPage-runnerName' is found, then a will go to a positive value, ie a will not equal -1, and that first block of code runs - as we see the horse name is actually on the line after they key phrase, and a bit of simple string slicing extracts it - I then add a comma so when I write it to a CSV it'll go into a cell on its own.

The master RPR value for a horse is found if b triggers, the key phrase "data-order-rpr" was found. This time we don't need to skip any lines, we look for the = sign in the text and then do a few extra things to cope with a horse without a master rpr (master rpr = "-"), we remove a slash character, and then stroe it with a comma ready to put into a csv.....

The same is done for c -g, to save you having to check c- g find:
c - RPR of a single run in the form lines, d- data of the race e- the race comment f - this one was supposed to extract horse profile info but draws nothing useful, I think this is one of the data items (My Rating MR is another) that loads after the page itself has already loaded, and therefore has no value when you scrape it. g - more profile stuff, equally empty.

Obviously to find the OR etc requires key phrases to be figured out and similar routines written - I get the OR info from my HorseraceBase data, so don't scrape it from RP,

data-test-selector="RC-runnerFormRow__or"

looks like it appears 8 lines before the OR of a horse in it's form section, you can use a number of phrases for 'today's OR' - this is the code for Unite the Clans in the last race on today's list:


71 

- The OR is 71, so you could search for 'data-order-or=' and strip out from the = to the < to extract the value, or use
data-test-selector="RC-cardPage-runnerOr"

and copy the OR from either the next line (as above) or grab 2 lines later and knock the off..

8) Having extracted each bit in turn just write them to a text file masquerading as a csv, and shove a newline "\n" after the final data item on each line to go onto the next horse and row in the spreadsheet. In my version I collect the rpr last, so when I've got an rpr for my runner it triggers a line write to my final file.

Timeform results are not dissimilar, to grab results information from Timeform you download the 'front page' from the results. thens can through it finding the address for each race's results, then download each results page to grab data. You can get a sort of 'quick' result from the front page, but if you want racetimes you need to grab each page one per race, similar to the RP cards.

RP Results are a single page effort, the front page of the results section has most of what you are likely to need, so it's a one page download and then loop through it stripping data - if you want the in running comments etc you'll need to download each race result 'full result' page and you're back on the one page per race style of working.

If you struggle to find a suitable key phrase make a note of a runner or two and the value you are looking to fins, for example by looking at the final racecard of the day I saw it was won by Unite the Clans with an OR of 71 - you can find the same stuff I did by looking for Unite the Clans (simple text search in notepad/wordpad), once you've found the name then you can be fairly confident the rest of that horse's data will be somewhere below - so call up find again, type 71 into it, and you are very likely to end up with the data line showing Unite the Clans' current OR value.

Okay, this is quite a lot, but only a fraction of the code for grabbing RP card info, the rest should be quite straightforward to generate really - it's only string slicing and file operations for the most part, I was really trying to explain the structure of the data and show you a path through it that allowed it to be collected.

Questions of a general nature are fine here, but if anyone needs more of the code or wants to have something explained in depth it's probably best to PM me, I'm more likely to notice the request for a start!

I would prefer it if any code I put up is not passed on elsewhere - I wouldn't want everybody and his dog trying odd snippets out just to see what happens and end up with some sort of accidental DOS attack on the RP site, it's in all our interests to ensure the RP can run their site without problems caused by poorly understood scripts.

Final point - please remember I don't code on demand, I code for my own use and am happy to share information, but that doesn't translate into coding other peoples' ideas up because they can't. (My code's too sledgehammer like to attract too much business anyway).
Dave

Apologies - I can't stop the smilies appearing (tried bbcode off, no workee). If there's a smiley in the code it'll be where a colon and a p follow each other, or similar, ie

a_string = substr[pos1 : pos2] - if you don't space that out then you get a sticky out tongue smiley from the of os..... ('from the : p of : pos)

Hi

davejb

Thanks to your help with a scraper, i have used some of the data to create a data mine to try and get some selections. You are a true genius when it comes to programming,

The link for todays selections is below

https://www.theukbettingforum.co.uk...lections-using-data-mining.93262/#post-369732

Regards
Giuseppe

davejb · Aug 19, 2018

My code is pretty simple stuff really, I suspect anybody who took pride in their work would probably do it a lot more elegantly in rather less lines of code. Half the battle is having a good idea of what you are trying to do, and having some sort of clue about the logic of how things hang together = add in 50% bloody mindedness and you're done!

I'm glad you found the help did the job, enjoy yourself seeing what you can do.

Dave

giuseppe_esq · Aug 19, 2018

davejb said:
My code is pretty simple stuff really, I suspect anybody who took pride in their work would probably do it a lot more elegantly in rather less lines of code. Half the battle is having a good idea of what you are trying to do, and having some sort of clue about the logic of how things hang together = add in 50% bloody mindedness and you're done!

I'm glad you found the help did the job, enjoy yourself seeing what you can do.

Dave

Hi Dave,

I have worked with databases and in Business Intelligence for about 10 years.

I don’t always develop things in the most efficient way sometimes. My key focus is developing what the customer wants. Something that works, is robust and does the job.

Your script task is all three of these things. It scrapes data quickly, it works and it gives exactly what I needed. I cannot thank you enough and express how grateful I am for the time you put into creating the programme.

If I ever win a small amount I will send you something to buy yourself a drink or three.

Though looking at my results today, you might have to wait a while.

I have nothing but a great deal of respect for you.

Giuseppe

davejb · Aug 19, 2018

Cheers Giuseppe,
as for your efforts today - I shouldn't worry too much about them, my own ratings got one winner (the first of the day at odds of 5/6) and nothing else won. You get rogue days like this every so often, along with losing runs and short bursts of good days. The trick is to figure out which of the races you rated will produce the results you have calculated.

Dave

Laugro1968 · Nov 13, 2019

davejb

Hello Dave,

I see that the link you posted for the scraper isn't active anymore.
Would you mind posting the link again or send it to me private?

Thank you and best regards,
Laurent

davejb · Nov 13, 2019

Hi Laurent,
sorry but I don't actually maintain programs that I don't use myself, and I've had a PC change and a total software reinstall since producing most of the various scrapers people asked me for. I have two files currently stored (still) on Dropbox that provide scraping code, they are linked below. giusep.zip is of course the one I provided to Giuseppe_esq, what it did I, quite frankly, don't remember! The other grabs RP results data - feel free to use either if they work okay for you. Download and unzip them into a convenient folder, double click the exe files inside them once decompressed to run - they'll pop up a cmd window to run in and close it on exit. Output files go to the folder the programs are run in.

Please note that I do not maintain these, if they run, great, happy to oblige, but if they no longer work then I don't start recoding etc to make them work - I just have far too much on my own plate!

Dropbox - Error - Simplify your life

www.dropbox.com

Dropbox

www.dropbox.com

Dave

Laugro1968 · Nov 13, 2019

Thank you Dave!
The giuseppe.exe is supposed to return the OR, TS and RPR data from the Racing Post.
Unfortunately the .csv and .xls files it returns are empty. I guess the Racing Post must have changed their website structure.
I totally understand that you are no longer supporting the file. I will find a solution.

Thank you for sharing!
Laurent

davejb · Nov 13, 2019

Your luck is in L Laugro1968 because the problem is simply the same one I solved last night when I found the cards would no longer download - RP have made a tiny change to their page code and I had to change the programs I use to suit.... so a quick look (I found a copy of the original code) and I realised this was a problem I already fixed.

Delete the existing files from the first download, then follow the install process with this replacement version and you should be okay. I'm attaching a copy of the download i got a few minutes ago using this program for reference.

Dropbox - Error - Simplify your life

www.dropbox.com

Dave

Scraper/Method to obtain Top Speed and RPR from Racing Post Website into Excel

Colt

Dam

Dam

Dam

Administrator

Dam

Administrator

Filly

Dam

Filly

Dam

Dam

Colt

Dam

Colt

Dam

Yearling

Dam

Yearling

Dam

Attachments