PennController for IBEX

February 23, 2022 at 1:31 pm #7795

Keymaster

Hello Kelly,

Your experiment fetches many resources from GitHub’s servers: your code points to the GitLink_* columns of the tables pre.csv, post.csv and training.csv, which represent a total of 265+216+216 = 697 unique URLs (training.csv contains redundant references). When I opened your demonstration link and opened my browser’s console, the Network tab listed a total of 406 successful requests to raw.githubusercontent.com. Besides those, there are 812 (=406*2) other requests to GitHub’s servers, which come upstream from the 406 aforementioned ones. In total, then, there was a total of 1218 requests to GitHub, and it took about 4min to complete them all

At the end of the day, I was still missing 291 requests. I am not sure what happened to those, as my Network tab does not show any unsuccessful request, but most likely they were dropped because of the many requests that were actually sent. Usually, the more requests you send within a short time to a server, the slower they will be processed, and some of them might even be blocked. PennController tries to prevent flooding the distant server with requests by only allowing for 4 concurrent requests (new requests are sent once older requests have successfully resolved). That being said, in your case, it could be that GitHub blocks the requests of an experiment session after a few hundreds, and/or that PennController fails to free new slots

My suggestion is that you consolidate your resources into zip files (you could have separate zip files for different sets of trials if you’d like) and host them on a dedicated webserver or an S3 bucket, for example. Then you’ll have just as many requests as you have zip files, making it much more likely that they all succeed, and the zip files will contain all your resources, so once they’ve been downloaded you don’t run the risk of missing one resource here and there

If that’s not an option you can pursue, I would at least replace the GitHub URLs from your tables with direct URLs pointing to the raw.githubusercontent.com domain, so you don’t generate three times as many requests as you really need. You could also try to balance the requests to GitHub’s servers and to the farm’s server more evenly, in the hope that no server would block the requests

Let me know if you need assistance

Jeremy