This is “vs” comparison article between Scrapy and Selenium.
What is Web Scraping?
Web scraping is the act of extracting or “scraping” data from a web page. The general process is as follows. First the targeted web page is “fetched” or downloaded. Next we the data is retrieved and parsed through into a suitable format. Finally we get to navigate through the parsed data, selecting the data we want.
The above process is meant to be completely automated, by creating “Web Crawlers” why go on the web and download the data for you.
You see, Selenium isn’t actually a Web Scraping library. Can it do Web Scraping-like tasks? Well yes, it can. Selenium is actually a Web Automation Library, used to automate tasks that one may do on the web.
Similar to how a library like PyAutoGUI (desktop automation library) is used to control the mouse, keyboard, clipboard etc on the desktop, Selenium controls everything within a web browser. It basically emulates the human, by opening the browser, moving the mouse, clicking on button etc.
Due to these Web automation abilities, we can perform web scraping-like tasks using Selenium. For instance, clicking on a download link, or downloading an image.
Selenium also has some handy CSS and HTML detection abilities, being able to locate elements in a screen location with just it’s name, or tag, or class or any other defining attribute.
You can also pair up Selenium with other libraries, in order to increase your functionality. From my observation, most web related libraries are pretty compatible with each other, allowing them to work together to achieve the end result. You could even pair up Scrapy and Selenium.
Scrapy is a complete framework in Python, designed to extract, process and store data. The Scrapy framework provides you with a lot of built in functionality and code, allowing you to execute complicated tasks without having to write large amounts of code.
For all it’s amazing and irreplaceable features, Scrapy can be a little hard to setup and learn. As a proper framework, its learning curve is steeper than other simple libraries.
There is another simpler library called BeautfiulSoup, which has an easier learning curve. It’s like Scrapy, but without many of the special features and extra functionality. But it’s simple to use, making it a good choice for simple Scraping jobs. If you want to see more on Scrapy and BeautifulSoup, read this comparison article.
There is more we have to say about Scrapy, but to avoid repetition, we’ll leave it for later.
Scrapy vs Selenium – Analysis
However, there is another module within Scrapy called Scrapy-Splash, which is designed to be able to scrape JS content as well.
Automation: Selenium obviously has the edge here as it’s literally a Web Automation Library. One thing in particular I want to point out though, that Scrapy can get quite complicated over things like automatically logging in to a site. In several sites, you need to login (to access the data) before you can begin scraping, so this is a problem.
Selenium, from my experience is easier to handle when it comes to automating logins. You can see this tutorial for Selenium login, and this tutorial for Scrapy logins (for comparisons sake). The Selenium way was much simpler as it excels as finding elements (login fields) and manipulating them (inserting data and submitting it).
Web Scraping: When it comes down to pure Web scraping, and the accompanying features involved, Scrapy wins hands down. This isn’t something we can sum up in a few lines, so we dedicated the whole next section to this topic.
Scrapy’s Special Features
Link following: This is something that can technically be replaced to some degree by Selenium (it can detect and click on links), but Scrapy is waaay superior and much easier to do Link following with. You even get built-in-options like, “avoid duplicate links” and “set link depth” etc.
Rotating Proxies: Another very handy feature from Scrapy’s side, is rotating proxies. You can use these to avoid the risk of being banned from a site due to a large number of requests. Basically you rotate between a list of proxies for every request sent.
Auto Throttle: The main reason why Spiders are blocked is because they put extra load on the servers, especially when they are sending alot of requests at once. The Auto Throttle setting causes Scrapy to automatically adjust it’s speed according to the load and traffic on the website it’s targeting. This both keeps you safe from detection (less noticeable), and it makes things easier for the website servers (due to distributed load).
Concurrent Requests: Scrapy has the ability to send concurrent requests, instead of sending them one by one. You can think of it as requests being sent in parallel. This is one of the many reasons why Scrapy is much faster than other Scraping libraries.
There are more of course, but I think you get the gist of it. You could probably replicate the above features somewhat in other libraries, but why do that, when all you have to do is write a few lines in Scrapy, or just turn on a single option.
Scrapy vs Selenium – Conclusion
You can think of them as two approaches to the same problem. Similar goals, but completely different ways of achieving that goal. That said though, when it comes to hard-core scraping, Scrapy is the way to go. Just because Selenium “can” do it, doesn’t mean it should be used.
The basic end result is that Selenium can do quite a number things that Scrapy can’t do (non-web scraping related mostly) whereas Scrapy can do alot of Web scraping related things, which Selenium cannot do.
Since we are talking about Web Scraping here, Scrapy is the obvious winner. That doesn’t mean you should ignore Selenium though. It’s a pretty great frameworks that has it’s own use, and can be paired together with Scrapy too. Expand your horizons and learn both if you can.
This marks the end of the Scrapy vs Selenium comparison. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the article content can be asked in the comments section below.