Basic Scrapy tutorial

Have you tried to scrap a website using python? If yes then, you have probably used beautifulsoup. But in this tutorial I will cover scrapy for website scrapping. Scrapy is a opensource an open source and collaborative framework for extracting the data you need from websites.

Let's start the tutorial

  • Step 1 : Make virtual environment in python link

      python -m venv virtualenvname
    
  • Step 2 : Active the virtual environment

      virtualenvname\Scripts\activate
    
  • Step 3 : Install scrapy pip install Scrapy

  • Step 4 : Make a scrapy project

      scrapy startproject tutorial
    
  • Step 4 : Go to the project cd tutorial .Make a spider scrapy genspider spidername websiteurl

      scrapy genspider pythonspider https://en.wikipedia.org/wiki/Python_(programming_language)
    

    You will see a file named pythonspider inside spiders folder

Now if you see inside your file, you will see some code.

import scrapy


class PythonspiderSpider(scrapy.Spider):
    name = "pythonspider"
    allowed_domains = ["en.wikipedia.org"]
    start_urls = ["https://en.wikipedia.org/wiki/Python_(programming_language)"]

    def parse(self, response):
        pass

We have to write our scrapping code inside parse function. Let's begin

Go to this url https://en.wikipedia.org/wiki/Python_(programming_language), and see there is a headline Python (programming language) . We will going to get this string from this website

Inside parse function :

import scrapy


class PythonspiderSpider(scrapy.Spider):
    name = "pythonspider"
    allowed_domains = ["en.wikipedia.org"]
    start_urls = ["https://en.wikipedia.org/wiki/Python_(programming_language)"]

    def parse(self, response):
        headline = response.css('span.mw-page-title-main::text').get()
        print('------------------Output-------------------')
        print(headline)
        print('------------------Output-------------------')

Now type this command scrapy crawl pythonspider

You will see the output.