Part 2:
/racecards/49/ripon/2018-08-06/707117
/racecards/49/ripon/2018-08-06/707120
/racecards/49/ripon/2018-08-06/707115
/racecards/49/ripon/2018-08-06/707114
/racecards/49/ripon/2018-08-06/707116
/racecards/49/ripon/2018-08-06/707118
/racecards/49/ripon/2018-08-06/707119
/racecards/596/cork/2018-08-06/708870
6) If you look at that sample you'll probably figure that these are in fact the individual card pages you are after, but they are all missing the first (identical) part of the page address, all you need do to complete each address is bolt the text
Horse Racing Cards, Results & Betting | Racing Post
onto the start of each line.
6) More looping - once the rpcards<date>.txt file has been completed, you then run through it a line at a time, bolting that bit on at the start, loading each web page in turn. As you load each page you scan through each line of it looking for more of those 'key phrases' that signal a data item you want is near, and grab it. This is all down to working out the 'key phrases' to look for, deciding how many lines to skip from the phrase to find the actual data line, and string slicing to extract what you are after.
You have a big advantage to all this - being computer generated (with some human input, but not a lot) the card (and results) pages are formatted the same all the way through - if the OR is 2 lines after a line reading 'here's the OR then' on one page it wll be 2 lines after on every page, every day, until they change the web page format. in a site revamp.
What I do is list all the key phrases I'm after, one after the other, then run a bunch of conditional statements (IFs) to sort out each type of data.... eg
lines = rpdump2.readlines()
lnr = 0
while lnr < linecount:
outline = lines[lnr]
a = outline.find("RC-cardPage-runnerName")
b = outline.find("data-order-rpr")
c = outline.find("RC-runnerFormRow__rpr")
d = outline.find("RC-runnerFormLink__results")
e = outline.find("data-test-selector=\"RC-runnerFormRow__outcome")
f = outline.find("<a href=\"/profile/horse/")
g = outline.find("title>")
-- several line have been skipped to get to this point, you do need to do some typing, I'm only trying to show you how to do it, not provide a complete script remember....
Right, above are 7 of my key phrases, each of these precedes a specific data item. If a phrase is found on a line then the variable (a,b,x etc) associated with it will turn from -1 to some positive value. SO every line that gets read in will make a to g all go to -1 most of the time, but when a key phrase is found that ONE of a to g that is associated with that data item will go positive. This is easy to check then....
7) Here's the first couple in their entirety then:-
# name of horse
if a <> -1:
inline = lines[lnr+1]
instr = inline
pos = instr.find("</a>")
instr = instr[

os]
instr = instr.strip(" ")
instr = instr.replace("'","")
horse_name = instr+","
# master rpr
if b <> -1:
inline = outline
instr = inline
pos = instr.find("=")
instr = instr[pos+2

os+5]
if instr == "-\">":
instr = "-"
instr = instr.strip("\"")
horse_rpr = str(instr)+","
If the key phrase 'RC-cardPage-runnerName' is found, then a will go to a positive value, ie a will not equal -1, and that first block of code runs - as we see the horse name is actually on the line after they key phrase, and a bit of simple string slicing extracts it - I then add a comma so when I write it to a CSV it'll go into a cell on its own.
The master RPR value for a horse is found if b triggers, the key phrase "data-order-rpr" was found. This time we don't need to skip any lines, we look for the = sign in the text and then do a few extra things to cope with a horse without a master rpr (master rpr = "-"), we remove a slash character, and then stroe it with a comma ready to put into a csv.....
The same is done for c -g, to save you having to check c- g find:
c - RPR of a single run in the form lines, d- data of the race e- the race comment f - this one was supposed to extract horse profile info but draws nothing useful, I think this is one of the data items (My Rating MR is another) that loads after the page itself has already loaded, and therefore has no value when you scrape it. g - more profile stuff, equally empty.
Obviously to find the OR etc requires key phrases to be figured out and similar routines written - I get the OR info from my HorseraceBase data, so don't scrape it from RP,
data-test-selector="RC-runnerFormRow__or"
looks like it appears 8 lines before the OR of a horse in it's form section, you can use a number of phrases for 'today's OR' - this is the code for Unite the Clans in the last race on today's list:
<span class="RC-runnerOr"
data-test-selector="RC-cardPage-runnerOr"
data-order-or="71">
71 </span>
- The OR is 71, so you could search for 'data-order-or=' and strip out from the = to the < to extract the value, or use
data-test-selector="RC-cardPage-runnerOr"
and copy the OR from either the next line (as above) or grab 2 lines later and knock the </span> off..
8) Having extracted each bit in turn just write them to a text file masquerading as a csv, and shove a newline "\n" after the final data item on each line to go onto the next horse and row in the spreadsheet. In my version I collect the rpr last, so when I've got an rpr for my runner it triggers a line write to my final file.
Timeform results are not dissimilar, to grab results information from Timeform you download the 'front page' from the results. thens can through it finding the address for each race's results, then download each results page to grab data. You can get a sort of 'quick' result from the front page, but if you want racetimes you need to grab each page one per race, similar to the RP cards.
RP Results are a single page effort, the front page of the results section has most of what you are likely to need, so it's a one page download and then loop through it stripping data - if you want the in running comments etc you'll need to download each race result 'full result' page and you're back on the one page per race style of working.
If you struggle to find a suitable key phrase make a note of a runner or two and the value you are looking to fins, for example by looking at the final racecard of the day I saw it was won by Unite the Clans with an OR of 71 - you can find the same stuff I did by looking for Unite the Clans (simple text search in notepad/wordpad), once you've found the name then you can be fairly confident the rest of that horse's data will be somewhere below - so call up find again, type 71 into it, and you are very likely to end up with the data line showing Unite the Clans' current OR value.
Okay, this is quite a lot, but only a fraction of the code for grabbing RP card info, the rest should be quite straightforward to generate really - it's only string slicing and file operations for the most part, I was really trying to explain the structure of the data and show you a path through it that allowed it to be collected.
Questions of a general nature are fine here, but if anyone needs more of the code or wants to have something explained in depth it's probably best to PM me, I'm more likely to notice the request for a start!
I would prefer it if any code I put up is not passed on elsewhere - I wouldn't want everybody and his dog trying odd snippets out just to see what happens and end up with some sort of accidental DOS attack on the RP site, it's in all our interests to ensure the RP can run their site without problems caused by poorly understood scripts.
Final point - please remember I don't code on demand, I code for my own use and am happy to share information, but that doesn't translate into coding other peoples' ideas up because they can't. (My code's too sledgehammer like to attract too much business anyway).
Dave
Apologies - I can't stop the smilies appearing (tried bbcode off, no workee). If there's a smiley in the code it'll be where a colon and a p follow each other, or similar, ie
a_string = substr[pos1 : pos2] - if you don't space that out then you get a sticky out tongue smiley from the

of

os..... ('from the : p of : pos)