Scrape Everything From Github#164
Conversation
|
@wdahlenburg By this do you mean crawl through all of github.com? How are you getting around rate limiting and performance issues? I agree with this being cool, but without some guardrails or a way to save/continue a search, this is a rough process. I have looked at this type of functionality before and used it to great length in GHE and I rate limit out in <1m or if I go not threaded, it took ~8 hours to go through 60k repos, not including any additional network latency. |
|
@mattyjones Yes so this will technically crawl through all of github.com, except the rate limiting isn't built into it. After you hit that rate limit it will fail. I found differences in github.com versus an enterprise version of github (github.company.com), where the later did not have any rate limit. The rate limiting would be a little tricky to implement due to concurrency, but overall worthwhile. It realistically should be added as a separate feature and then this can be committed afterwards. I would need to add support to check if the rate limit exists. This code should work fine as-is if you are running it on an enterprise instance w/o rate limiting. It definitely does take a lot of time and CPU. Some potential options for the file size and partial results:
|
|
@wdahlenburg I can confirm that Enterprise at the least has the option to rate limit. The key here could be to check the request and it a rate limit is hit then we sleep for a little while and then try again. This will take while but that is something I was toying with. I also have could to dump stuff into sql lit for later querying, I may implement that here. I certainly agree with the rate limiting being a second feature as well. I will take and play with this for a little while and then merge it if all goes well. |
|
@wdahlenburg Thanks for the work on this. I have now merged it in and will be playing with it over the next few days to ensure all is light and bright. I also need to write tests for all this code before I start to muck with it. |
Pull Request for #163
Made use of the ListAll method to implement the ability to scrape all repositories. A binary search was implemented to determine the upper limit of repositories. This was then split evenly between threads so that large numbers of repositories could be pulled concurrently.
Currently only the master branch is pulled from repositories. This helps prevent path explosion, but could be improved in the future.
Closing issues
closes #163