This tutorial explains how to create rotating Proxies in Scrapy.
An IP address is a numerical value assigned to a device that connects to the internet. Similar to how each house has an address, so too does each internet device. The key point to remember here is that every website we access can see our IP address.
While Scraping, websites can see the IP address of the device from where the Scrapy bot is running. If you’re scraping hundreds of pages per minute, the website is bound to notice that a single IP is accessing hundreds of page in a short period of time.
Naturally, you can expect to see your IP banned on sites that take their security seriously. Luckily, Scrapy and it’s many libraries allow us to get around this problem by continuously rotating our IP address as we scrape. This masks it look as if there are many different people accessing the site, instead of one person.
Scrapy Rotating Proxies
There are many libraries created for the purpose of rotating proxies by the Scrapy Python community. We’ll be using scrapy-rotating-proxies
since we believe it’s reliable and used by the community sufficiently. It also has some interesting features, such as automatically quarantining dead proxies and separating the good proxies from the bad.
To install the library just run the below command into the command prompt.
pip install scrapy-rotating-proxies
Next up we begin adding in the settings required to get the rotating proxies started.
ROTATING_PROXY_LIST = [
'proxy1.com:8000',
'proxy2.com:8031',
...
...
]
Alternatively, you can instead create a text file with a list of proxies and simply add the file path to the settings as shown below. Remember! ROTATING_PROXY_LIST_PATH
takes precedence over ROTATING_PROXY_LIST
if both options are being used.
ROTATING_PROXY_LIST_PATH = '/my/path/proxies.txt'
A high number of proxies in your list means that there’s a whole bunch of IP’s to distribute your requests onto. In this case, more is better.
The final step is to add the following to your DOWNLOADER_MIDDLEWARES in the settings.py
file.
DOWNLOADER_MIDDLEWARES = {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}
If you’ve correctly setup everything as we’ve just shown, requests will start being sent using the proxies in the list.
We have included a list of proxies here, that you download and try out for yourself. Keep in mind, since these are free proxies, alot of them probably won’t even be working, or very slow.
Free Proxies List (248 downloads )ROTATING_PROXY_PAGE_RETRY_TIMES
is another pretty useful setting. If you have any experience proxies you’ll know that they often fail and can take several tries to get right. This setting holds the default value of 5 retries per request. If you wish to change it, add it into the settings.py
with a new value.
Conclusion
If you were to acquire enough proxies and IP’s, that means you could launch a script to send 1,000 requests to any number of sites and get 1,000 different IP addresses.
You may be wondering where you can these proxies lists for Scrapy. There are plenty of free proxies sites out there which you can checkout and add into the list.
There are also several paid and dedicated services which take care of rotating your IP address. If your scrapy bot is a serious, large scale project, you may want to consider using these services.
User Agents
While using proxies will mask your IP address, they won’t change the browser data that gets sent to the server. By default Scrapy identifies itself as a Scrapy bot to the target website. If you want to learn how to change this and improve your Spider, check out our Scrapy User Agents Tutorial.
This marks the end of the Scrapy Proxies tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the article content can be asked in the comments section below.