• Hi Guest Just in case you were not aware I wanted to highlight that you can now get a 20% discount on Inform Racing.
    Simply enter the coupon code ukbettingform when subscribing here.
    We have a lot of members who are existing users of Inform Racing so help is always available if needed.
    Best Wishes
    AR
  • Hi Guest Just in case you were not aware I wanted to highlight that you can now get a free 7 day trial of Horseracebase here.
    We have a lot of members who are existing users of Horseracebase so help is always available if needed, as well as dedicated section of the fourm here.
    Best Wishes
    AR

MySQL

ArkRoyal

Administrator
Hi rpjd99 rpjd99
Hope you are keeping well Ray.

As posted elsewhere I have had to move away from MSSQL and am going to use MySQL instead.

I have just been re-reading your earelier post detailing your method of updating your DB tables and have a quick question for you.

If you had a 'new' course pop up in your scraped data would your 'scrape' also include the course id?

Best Wishes

Kevin
 
As posted elsewhere I have had to move away from MSSQL and am going to use MySQL instead.
Hi Kevin,

Yeah, I saw that and I must say that I'm surprised MSSQL wouldn't behave in Win 10. I take it that the 'Repair corrupt installation' option didn't work for you.

If you had a 'new' course pop up in your scraped data would your 'scrape' also include the course id?

No, it wouldn't. Courses are the only element to which I assign my own IDs in the DB. I do this to ensure a distinction between turf, polytrack, fibresand, hurdle, steeplechase and cross-country courses. So flat turf courses have an ID of 1xxx, hurdle courses 2xxx, steeplechase courses 3xxx, cross-country courses 4xxx, flat fibresand courses 5xxx and flat polytrack courses 6xxx. As an example, Cagnes-sur-Mer has five codes: 1033 for flat turf; 2033 for hurdles; 3033 for steeplechases; 5033 for flat fibresand from Jan 2000 to Dec 2008 and; 6033 for polytrack from Jan 2009. The source site doesn't make these distinctions.

When scraping results or race cards, I assign the basic code to the course, e.g. Cagnes = 33, and then add the relevant 1000 depending on the race discipline or flat race surface. Does that make sense?

Should a totally new course appear that I wasn't aware of, it wouldn't be in the list of acceptable courses I have in the scrapers (the majority of courses here are trotting courses only and these don't interest me). But as I always check the number of flat and jumps races before scraping, I would notice a completely new course appearing and could add this course to the list of acceptable courses, thus allowing it to be successfully scraped.

Ray
 
Hi Ray

I am looking to resolve a slightly different problem.

I want to be able to create tables of Courses, Trainers, Jockeys and Horses from a single download of results data that includes this info and obviously much more as well.
The download does not include any ID's for the Courses etc which means I will have to add my own ID's to the individual tables. Any thoughts/pointers as to the best way forward?

Kevin
 
Yeah, I saw that and I must say that I'm surprised MSSQL wouldn't behave in Win 10. I take it that the 'Repair corrupt installation' option didn't work for you.

I think it is my environment that is causing the problem, but it is surprising that basically re-installing the OS, but leaving my files and data in place, did not fix the issue.
 
My approach to it would be to do all the initial work in Excel. For each data item to be indexed, I'd create three columns: one for the name, one for the ID to be assigned and a third column that contains the upper (or lower) case version of the name with all spaces, apostrophes and hyphens removed [UCase(Application.Substitute(...))]. I'd then weed out all duplicates and assign an ID (there might be some trainer-jockeys that would need to have the same ID in the T and J tables). When downloading the second and subsequent days' data, I'd reduce each name in the new data to the basic version as in column three and do a check against that name in your master table. If the name already exists I'd add the ID number to the downloaded data; if not, I'd add the name and reduced name to the master list for later assignation.

After about a month's worth of data, I'd build these tables into the MySQL DB, including a column for the reduced version of the name, but I'd still leave the master tables in Excel for daily checking and highlighting of new CHTJ data items, which will need to have an ID added - manually or by macro. New courses I would add manually to the DB (no new ones here since I started, only new surfaces). I would also add a flag to all new data items that will be imported into the DB, so I could do some error checking on the names using WHERE ucTrainer LIKE '%SOME%PARTVER%SION'.

Checking on name strings will produce an overhead, but the FIND command is rapid. Just to give you an idea of what you'll need to search, since 1st January 2010 I have 156 courses, 2109 trainers, 2503 jockeys and 49,574 horses. The only bottleneck here would be the horses.

I can't see any other way of doing it.

Ray
 
Thanks Ray,

I appreciate that it is probably easier for us who are more au-fait with excel to do as you suggested and in all likelihood that is how I will proceed but it must be possible to do it direct in SQL.
Maybe by using an auto-increment field to provide the ID's?
 
I've attached a segment of a book that deals with web scraping in SQL. I've never done it myself, but it can be done apparently. Auto-increment will certainly do what you want, but you'll need to have some error checking to deal with typos in the names and consistent IDs for trainer-jockeys. I think you'll find QlikView to be very useful in identifying anomalies in the names/IDs as your DB builds up. You'll only need to import the names, IDs and U/LCase names from your DB into QV to do the checking, so you'll have an easy visual reference of any anomalies present.

Ray
 

Attachments

  • Web Scraping in SQL.pdf
    300.1 KB · Views: 14
Hi Ray,

Thanks for the book segment.
Perhaps I should have been more clear, I am not bothered about scraping the data in SQL more that I would rather do the pre-processing in SQL rather than excel.
I have asked myself why, and the answer is I can do it in excel but do not know how to do it SQL! Might be biting off more than I can chew as I want to get this up and running sooner rather than later and not having much of a clue as to where to start will probably mean that I can't get things finished in time to meet my deadlines. Looks like I may give myself a week or so playing around and if I get nowhere will fallback to using excel to get the data into a format more easily ingested by the SQL database.
 
Ah, now I get you. Let's assume all the data is in one table in the DB and that you have four columns labelled 'Course', 'Horse', 'Trainer' and 'Jockey'. The first thing I'd do is create two additional columns for each of the CHTJ columns as follows:
USE DBname;
ALTER TABLE TableName ADD CrsID INT
ALTER TABLE TableName ADD lcCrs VARCHAR(32)
ALTER TABLE TableName ADD HrsID INT
ALTER TABLE TableName ADD lcHrs VARCHAR(25)
ALTER TABLE TableName ADD TrnID INT
ALTER TABLE TableName ADD lcTrn VARCHAR(40)
ALTER TABLE TableName ADD JckID INT
ALTER TABLE TableName ADD lcJck VARCHAR(40)

I'd then write a routine to reduce the names to their basic form, i.e. no spaces, apostrophes or hyphens, and write them to their respective columns. So, something like:
USE DBname;
UPDATE TableName
SET
lcCrs = LOWER(REPLACE(REPLACE(REPLACE(Course,' ',''),'''',''),'-',''))
, lcHrs = LOWER(REPLACE(REPLACE(REPLACE(Horse,' ',''),'''',''),'-',''))
, lcTrn = LOWER(REPLACE(REPLACE(REPLACE(Trainer,' ',''),'''',''),'-',''))
, lcJck = LOWER(REPLACE(REPLACE(REPLACE(Jockey,' ',''),'''',''),'-','')) (note double '' for apostrophes and commas before the last three rows)

Next, for each of the CHTJ columns, I'd write something like:
USE DBname;
SELECT DISTINCT(lcTrn), Trainer, COUNT(lcTrn) AS 'N0.'
FROM TableName
GROUP BY lcTrn, Trainer
ORDER BY [No.] DESC, lcTrn (note the square brackets due to the point after 'No'

All names with variations in the spelling will have a value > 1, so you can beat them into shape. Once cleaned up, you can then transfer the CHTJ data into their own tables, add their IDs and then add the IDs back into the main table. You'll probably also need Race IDs and store the basic race data in its own table. You'll then need to create Primary keys and their associated Foreign keys, but we can look at that another time.

You'll need to experiment, Kevin, because I don't have a similar situation, so I'm trying to visualise it all in my head. I'm also unsure about syntax differences between MySQL and T-SQL.

Ray
 
Made some progress getting to grips with MySQL this evening.
It comes with MySQL Workbench which provides a really nice way of interacting with the SQL Server.

RaceForm Interactive is actually a Visual FoxPro database.
I have linked Access to the FoxPro tables and tonight exported those tables from Access via ODBC into MySQL.

Next thing will be to see if I can re-use the SQL queries I have written in Access on the same data now in MySQL.
 
Now you have me completely flummoxed! If you're still using RI as a data source, won't you have all the IDs you need?

On second thoughts, maybe you were just testing me. Pass or fail?

Ray
 
In that case, you can still use the IDs you have in RI for the CHTJ data and then add them to the new data source when you start using that. You'll only have to add IDs to the new data as it comes in and that can be added using auto-number from a specified starting value, i.e. the last RI value + 1. Yes/no?

Ray
 
Yes. But I will have to check every horse, trainer etc to see if they exist first and allocate a new ID if they don't. One problem though is that RI IDs are not necessarily consecutive so might be worth importing all RI stuff without IDs and create my own. At least that way I will have plenty of data to check that I am doing it right.
 
One problem though is that RI IDs are not necessarily consecutive so might be worth importing all RI stuff without IDs and create my own.
You can create your own, but IDs don't have to be numerically consecutive, just unique. My approach would be to import into MySQL all RI data common to the new data source. To the CHTJ tables I would add a column with the reduced name - the only basis for comparison. When you start using the new data source, assuming it will be a single table, you can add the reduced name columns to this table and use the INSERT... ON DUPLICATE KEY UPDATE command to distribute the new data amongst the existing (RI) tables, using the reduced names as a check. Where the reduced name doesn't exist, you can add the CHTJ data to its relevant table and attach a new ID to that data item using a variable that will be incremented in the routine. The base for this variable will be the MAX value + 1 of the existing IDs in the table.

I think that should work.

Ray
 
To get familiar with MySQL I have been experimenting with getting existing SQL scripts that I used in Access running in the new environment. Initially I was hugely underwhelmed by the time it took to run my main query, it was far longer than Access! Then I realised that I hadn't set any indexes on the tables I imported, now the query runs in about 5 seconds compared to 15 minutes:hi:
 
180 times faster? I didn't realise that indexes were that efficient. It's a pity the new data source doesn't have them. Still, I imagine the new data will just be one days racing, so there won't be a huge amount of data to be checked on strings of characters. At least the subsequent analysis will run quickly.

Good luck with the migration!

Ray
 
Back
Top