Page 1 of 6 123 ... LastLast
Results 1 to 10 of 60

Thread: WCA website crawling

Hybrid View

  1. #1
    Member
    Join Date
    Sep 2006
    Location
    Amsterdam
    WCA Profile
    2003BRUC01
    Posts
    341

    Default WCA website crawling

    Hi guys,

    As you may have noticed the WCA website has become slower recently.
    This is because our provider moved the WCA website to another (busier?) server. The reason mentioned was that some people are periodically crawling the WCA results.

    So if you are using some tool to update your database of WCA results, then please stop doing this. If you need data, then send me an e-mail and I will gladly send you the data in a standardised format.
    If there are generic requirements for data delivery, then let me know too. The discussion on the WCA Forum about this subject did not come to a conclusion.

    Thanks,

    Ron

  2. #2
    statue
    Join Date
    Jul 2008
    Location
    Camp Hill, PA, USA
    WCA Profile
    2008KORI02
    YouTube
    StachuK1992
    Posts
    3,470

    Default

    Hey Ron,

    Approximately how large is the database?
    I have a rather large server here, soon with FIOS connection, and there may be a chance of me being able to host at least a portion of the database, if that would be helpful.

    StachuK

  3. #3
    Premium Member MichaelErskine's Avatar
    Join Date
    Jul 2008
    Location
    Sherwood, Nottingham, UK
    WCA Profile
    2008ERSK01
    YouTube
    msemtd
    Posts
    1,186

    Default

    I'm guilty off running automated searches of the WCA site but always with consideration and not very regularly.

    It would be extremely useful if the data normally searchable from the web interface were made available, perhaps on a mirror server somewhere, and kept up to date (e.g. once a week) in a reasonable relational format (perhaps a SQLite DB or MySQL dbdump).
    Failing to improve!

  4. #4
    Member
    Join Date
    Apr 2008
    Location
    Ohio, USA
    WCA Profile
    2006MERT01
    Posts
    777

    Default

    Quote Originally Posted by Stachuk1992 View Post
    I have a rather large server here, soon with FIOS connection, and there may be a chance of me being able to host at least a portion of the database, if that would be helpful.
    Quote Originally Posted by msemtd View Post
    It would be extremely useful if the data normally searchable from the web interface were made available, perhaps on a mirror server somewhere, and kept up to date (e.g. once a week) in a reasonable relational format (perhaps a SQLite DB or MySQL dbdump).
    These two sound like they could go together well I can see privacy being an issue though. Perhaps don't include birthdates or other sensitive info on the mirror?

    If this comes through, while its being set up, the database could certainly use some restructuring... I talked with Dave briefly about this, and the "everything in one big table" format can be improved quite significantly to reduce memory usage and server load.
    Last edited by JBCM627; 07-01-2009 at 07:40 PM.

  5. #5
    Premium Member MichaelErskine's Avatar
    Join Date
    Jul 2008
    Location
    Sherwood, Nottingham, UK
    WCA Profile
    2008ERSK01
    YouTube
    msemtd
    Posts
    1,186

    Default

    Quote Originally Posted by JBCM627 View Post
    I can see privacy being an issue though. Perhaps don't include birthdates or other sensitive info on the mirror?
    Certainly - any organisation is obliged to do this for Data Protection. The data that is freely available via the web interface is all that need be exported - that is what is being web-crawled.
    Failing to improve!

  6. #6
    Member
    Join Date
    Sep 2006
    Location
    Amsterdam
    WCA Profile
    2003BRUC01
    Posts
    341

    Default

    Hi all,

    The WCA database is now downloable at http://www.worldcubeassociation.org/...umpyyyymm.xlsx
    We will post at the end of every month, so the first file is wcadump200907.xlsx (size 6.5MB).
    If we forget to post, then send an e-mail to rbruchem@worldcubeassociation.org.
    Now that we have the database online please stop crawling the WCA website to update your own databases.

    Looking forward to your feedback.

    Thanks,

    Ron

  7. #7
    Member Dave Campbell's Avatar
    Join Date
    Jun 2007
    Location
    Toronto, Ontario, Canada
    WCA Profile
    2005CAMP01
    YouTube
    canadiancubing
    Posts
    202

    Default

    I cannot open the file as it was created by a newer version of Excel than the one i am running (and i am running 2002). I'd personally rather see an sql dump file than an Excel file.

    Secondly, i see you have a naming convention in place, which is fine if you are trying to have versioning available. But i would suggest still making one "current" version that is always the same name each time. So i download the same file anytime i wish and i will get the latest version. I could then use the specific file names with the YYMMDD for any old version.

    In the end, i don't crawl the WCA site, nor do i wish to have my own version. I just want to be able to query the current version for selected data. Which of course brings me back to my webservice request which is documented on the WCA board.

  8. #8
    Super Moderator
    Join Date
    Feb 2008
    Location
    Westminster, CO
    WCA Profile
    2008BRUN01
    Posts
    1,133

    Default

    Quote Originally Posted by Dave Campbell View Post
    I cannot open the file as it was created by a newer version of Excel than the one i am running (and i am running 2002). I'd personally rather see an sql dump file than an Excel file.
    I just noticed that the results tab has exceeded the 65535 row limit of OpenOffice Calc and gnumeric, so I can't open it and retrieve all the result data. Bummer.
    The person posting below me is a genius.

  9. #9
    Premium Member MichaelErskine's Avatar
    Join Date
    Jul 2008
    Location
    Sherwood, Nottingham, UK
    WCA Profile
    2008ERSK01
    YouTube
    msemtd
    Posts
    1,186

    Default

    Quote Originally Posted by brunson View Post
    I just noticed that the results tab has exceeded the 65535 row limit of OpenOffice Calc and gnumeric, so I can't open it and retrieve all the result data. Bummer.
    I shall create a sqldump version and an ODS with two results tabs when I get into the office tomorrow. There's no good reason to get locked into non-free software
    Failing to improve!

  10. #10
    Premium Member MichaelErskine's Avatar
    Join Date
    Jul 2008
    Location
    Sherwood, Nottingham, UK
    WCA Profile
    2008ERSK01
    YouTube
    msemtd
    Posts
    1,186

    Default

    Quote Originally Posted by Ron View Post
    The WCA database is now downloable at http://www.worldcubeassociation.org/...umpyyyymm.xlsx
    Thanks Ron.

    Processing...
    Failing to improve!

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •