Tips on how to internet scrape on Schedule utilizing Github Actions?

Advertisements

[ad_1]

Hello everybody! ? It’s been a very long time since I final posted on this weblog. I used to be busy graduating from faculty and shifting locations and that didn’t depart sufficient time for enjoyable new initiatives. I’m nonetheless in the midst of all of it however lately did a mission that gave me a tutorial concept to jot down about. I used to be requested to create an online scraper that might scrape a sure web site, filter the info after which ship an e-mail with that knowledge. I’ll present you what I used to be tasked with, how I developed it, and the way I deployed it totally free utilizing GitHub Actions.

Word: You will discover the corresponding GitHub repo right here

Necessities

I used to be requested if I may write an software that might monitor this web site beneath and extract tender info and e-mail solely the newest tenders to an e-mail id. I explored the web site and rapidly agreed.

PPRA homepage

Plan of Motion

I had come throughout some enjoyable initiatives by the likes of Simon Wilson and needed to check out GitHub Actions for myself. What higher approach to check it out than deploying a working scraper that robotically runs on a schedule and emails the scraped knowledge?

Step one is to obtain the net web page utilizing Python and pull out the desk right into a extra useable knowledge construction. I opted for regex initially after which settled on Pandas for the preliminary extraction.

The subsequent step is to e-mail the parsed knowledge. For that, I’m going to depend on a GitHub motion created by Dawid Dziurla.

Putting in the required libraries

We might be making use of the next two libraries:

Go forward and add each of those libraries right into a necessities.txt file within the root of your mission listing/repository and set up them if you happen to haven’t already. In a while, we might be utilizing this file to put in these dependencies within the GitHub Actions execution atmosphere.

Writing the scraper

That is by far essentially the most time-consuming a part of the entire course of. Nevertheless, it’s fairly easy as soon as you recognize what you might be doing. Step one is to determine the URL of the tenders web page. The web site makes use of AJAX so going to the tenders web page doesn’t change the URL within the handle bar. To seek out the precise URL, we will open up Developer Instruments and have a look at the requests made whereas navigating to the tenders web page.

Network Manager

This primary URL is the direct URL of the tenders web page nevertheless it doesn’t have a web page quantity within the URL. For it to make use of the web page quantity as effectively we will navigate to the 2nd web page of the tenders and try the brand new request (discover the PageNo=2 on the finish of the URL). We will now comply with the sample of this request to question for nevertheless many pages we wish.

Network manager 2

At this level, we will write some Python code to obtain the web page utilizing requests.

import requests as r
html = r.get('https://ppra.org.pk/dad_tenders.asp?PageNo=1')

Now that we all know methods to question for various pages, we have to determine methods to parse the desk. I initially tried utilizing regex as for such easy use circumstances it has all the time labored out fairly effectively for me. I find out about this SO reply however we’re parsing a identified, restricted subset of HTML so it’s completely high-quality to make use of regex for that.

Nevertheless, regex posed a couple of points. It wasn’t working precisely the way in which I needed it to. I had forgotten concerning the re.DOTALL flag whereas writing my preliminary sample and my regex sample wasn’t working previous the n character. As a substitute of fixing this problem, I made a decision to present Pandas a strive. I had heard that you could possibly simply parse an HTML desk utilizing Pandas however I had by no means tried it so I thought of giving {that a} strive as an alternative.

My most important motivation for taking up new initiatives is that I get to discover and study new stuff after which write about it on this weblog so I actively attempt to go for new strategies even when I do know the outdated technique works completely high-quality. You’ll by no means find out about new and higher methods if you happen to don’t discover, proper?

Utilizing Pandas is extraordinarily easy. We will use the next snippet to parse the tables within the doc:

import pandas as pd
dfs = pd.read_html(html.textual content)

Nevertheless, by solely supplying the HTML to read_html technique, Pandas extracts all of the tables from the web page. In my testing, this resulted in 4 tables.

>>> dfs
[                                                   0
0  Please Click on image to Download/View. Only 1...,                        0                                                  1
0  Tender Closing Date :  Month January  Feburay  March  April  May  Jun...,                   0
0  PPRA Ref No: TSE,             0                                                  1         2                3                      4
0   Tender No                                    Tender  Details  Download  Advertised Date           Closing Date
1   TS455000E  Pakistan Aeronautical Complex, Kamra Procureme...       NaN        30/7/2021   4/8/2021 10:30:00 AM
2   TS454835E  Pakistan State Oil, Karachi Miscellaneous Work...       NaN        30/7/2021    5/8/2021 2:15:00 PM
3   TS453722E  State Life Insurance Corporation of Pakistan, ...       NaN        30/7/2021              24/8/2021
4   TS453262E  Sui Southern Gas Company Limited, Karachi SSGC...       NaN        30/7/2021   23/8/2021 3:00:00 PM
5   TS455691E  National Database and Registration Authority N...       NaN        30/7/2021  16/8/2021 11:00:00 AM
6   TS453260E  Sui Southern Gas Company Limited, Karachi SSGC...       NaN        30/7/2021  23/8/2021 12:30:00 PM
7   TS456503E  National Heritage & Intrigation Division, Isla...       NaN        30/7/2021  24/8/2021 11:00:00 AM
 
...

To only extract the table we need, we can tell Pandas to extract the table with the width attribute set to 656. There is only one table in the whole HTML with that specific attribute value so it only results in 1 extracted table.

dfs = pd.read_html(html.text, attrs={'width': '656'})
>>> dfs
[            0                                                  1         2                3                      4
0   Tender No                                    Tender  Details  Download  Advertised Date           Closing Date
1   TS456131E  Sui Southern Gas Company Limited, Karachi Supp...       NaN        30/7/2021   25/8/2021 3:30:00 PM
2   TS456477E  National Accountability Bureau, Sukkur Service...       NaN        30/7/2021  16/8/2021 11:00:00 AM
3   TS456476E  Sukkur Electric Power Company (SEPCO), Sukkur ...       NaN        30/7/2021  12/8/2021 11:00:00 AM
4   TS456475E  Evacuee Trust Property Board, Multan Services ...       NaN        30/7/2021  17/8/2021 10:00:00 AM
5   TS456474E  Military Engineering Services (Navy), Karachi ...       NaN        30/7/2021  13/8/2021 11:30:00 AM
6   TS456473E  National University of Science and Technology,...       NaN        30/7/2021  17/8/2021 11:00:00 AM
7   TS456490E  Shaikh Zayed Hospital, Lahore Miscellaneous Wo...       NaN        30/7/2021  19/8/2021 11:00:00 AM
8   TS456478E  Ministry of Religious Affairs & Interfaith Har...       NaN        30/7/2021  16/8/2021 11:00:00 AM
9   TS456489E  Cantonment Board, Rawalpindi Services Required...       NaN        30/7/2021  17/8/2021 12:00:00 PM
10  TS456480E  National Bank of Pakistan, Lahore Miscellaneou...       NaN        30/7/2021  16/8/2021 11:00:00 AM
11  TS456481E  National Bank of Pakistan, Lahore Miscellaneou...       NaN        30/7/2021  16/8/2021 11:00:00 AM

...

There are still a few issues with this extraction.

  • Pandas isn’t able to automatically extract header for our table
  • Advertised Date column data isn’t parsed as a date
  • Download column is all NaNs

To fix the first issue, we can pass in the header parameter to read_html and Pandas will make the respective row the header of the table. The second issue can be fixed by passing in parse_dates parameter and Pandas will parse the data in the respective column as dates. There are multiple ways to resolve the third issue. I ended up using regex to extract the download links into a list and then assigning that list to the Download column in our data frame.

The read_html method call looks something like this after fixing the first two issues:

dfs = pd.read_html(html.text, attrs={'width': '656'}, header=0, parse_dates=['Advertised Date'])

The regex for extracting the Obtain hyperlinks and assigning them to the Obtain column seems to be like this:

download_links = re.findall('<a goal="_blank" href="(.+?)"><img border="0" src="pictures/(?:.+?)"></a>',html.textual content)
download_links = ["<a href='https://ppra.org.pk/"+link+"' style='display: block;text-align: center;'> <img src='https://ppra.org.pk/images/download_icon.gif'/></a>" for link in download_links]
tender_table = dfs[0]
tender_table['Download'] = download_links

The extracted hyperlinks are relative so we will use record comprehension to prepend the precise URL to the extracted hyperlink. I’m changing the hyperlink into an anchor tag and encapsulating a picture with that due to two causes. Firstly, it might look good in our closing e-mail, and secondly, the desk would look just like what our consumer is used to seeing on the web site so there could be much less visible fatigue whereas wanting on the e-mail.

The consumer requested me to generate emails with the tenders for the newest date solely. The tenders for the newest date typically span a number of pages so we will put all of the code now we have up to now right into a separate operate after which cross in web page numbers to that operate to extract the tables from a selected web page.

The code with the operate seems to be like this:

import requests as r
import pandas as pd
import re

url_template = "https://ppra.org.pk/dad_tenders.asp?PageNo="

def download_parse_table(url):
    html = r.get(url)
    dfs = pd.read_html(html.textual content, attrs={'width': '656'}, header=0, parse_dates=['Advertised Date'])
    download_links = re.findall('<a goal="_blank" href="(.+?)"><img border="0" src="pictures/(?:.+?)"></a>',html.textual content)
    download_links = ["<a href="https://ppra.org.pk/"+link+"" style="display: block;text-align: center;"> <img src="https://ppra.org.pk/images/download_icon.gif"/></a>" for link in download_links]
    tender_table = dfs[0]
    tender_table['Download'] = download_links
    return tender_table

To place all of the extracted tables into one knowledge body, we have to put all of them into an inventory and use the pd.concat technique. The code for that appears like this:

combined_df = []
for index in vary(1,8):
    df = download_parse_table(url_template+str(index))
    combined_df.append(df)

combined_df = pd.concat(combined_df)
>>> combined_df
    Tender No                                    Tender  Particulars                                           Obtain Marketed Date           Closing Date
0   TS455000E  Pakistan Aeronautical Advanced, Kamra Procureme...  <a href="https://ppra.org.pk/obtain.asp?have a tendency...      2021-07-30   4/8/2021 10:30:00 AM
1   TS454835E  Pakistan State Oil, Karachi Miscellaneous Work...  <a href="https://ppra.org.pk/doc/30-7/42pso307...      2021-07-30    5/8/2021 2:15:00 PM
2   TS453722E  State Life Insurance coverage Company of Pakistan, ...  <a href="https://ppra.org.pk/doc/30-7/42life30...      2021-07-30              24/8/2021
3   TS453262E  Sui Southern Fuel Firm Restricted, Karachi SSGC...  <a href="https://ppra.org.pk/obtain.asp?have a tendency...      2021-07-30   23/8/2021 3:00:00 PM
4   TS455691E  Nationwide Database and Registration Authority N...  <a href="https://ppra.org.pk/obtain.asp?have a tendency...      2021-07-30  16/8/2021 11:00:00 AM
..        ...                                                ...                                                ...             ...                    ...
25  TS456443E  Civil Aviation Authority, Karachi TENDER NOTIC...  <a href="https://ppra.org.pk/obtain.asp?have a tendency...      2021-07-29  13/8/2021 11:00:00 AM
26  TS454178E  Zarai Taraqiati Financial institution Ltd (ZTBL), Islamabad Inf...  <a href="https://ppra.org.pk/doc/28-7/ztb287-1...      2021-07-28  10/8/2021 11:00:00 AM
27  TS454566E  Sui Northern Fuel Pipelines Restricted, Lahore Rel...  <a href="https://ppra.org.pk/obtain.asp?have a tendency...      2021-07-28    1/9/2021 3:30:00 PM
28  TS455579E  Pakistan Ordnance Factories, Wah Cantt Restore ...  <a href="https://ppra.org.pk/doc/28-7/pof287-4...      2021-07-28   4/8/2021 10:20:00 AM
29  TS455365E  Pakistan Nationwide Delivery Company (PNSC),...  <a href="https://ppra.org.pk/obtain.asp?have a tendency...      2021-07-28  24/8/2021 11:00:00 AM

[210 rows x 5 columns]

It seems to be largely high-quality however there may be one problem. The index is preserved after the concat. I wish to reset it in order that it goes from 0-209 as an alternative of 0-29 a number of occasions. That is additionally simple to acomplish. We simply want to change the concat technique name like this:

combined_df = pd.concat(combined_df).reset_index(drop=True)
>>> combined_df
     Tender No                                    Tender  Particulars                                           Obtain Marketed Date           Closing Date
0    TS455000E  Pakistan Aeronautical Advanced, Kamra Procureme...  <a href='https://ppra.org.pk/obtain.asp?have a tendency...      2021-07-30   4/8/2021 10:30:00 AM
1    TS454835E  Pakistan State Oil, Karachi Miscellaneous Work...  <a href='https://ppra.org.pk/doc/30-7/42pso307...      2021-07-30    5/8/2021 2:15:00 PM
2    TS453722E  State Life Insurance coverage Company of Pakistan, ...  <a href='https://ppra.org.pk/doc/30-7/42life30...      2021-07-30              24/8/2021
3    TS453262E  Sui Southern Fuel Firm Restricted, Karachi SSGC...  <a href='https://ppra.org.pk/obtain.asp?have a tendency...      2021-07-30   23/8/2021 3:00:00 PM
4    TS455691E  Nationwide Database and Registration Authority N...  <a href='https://ppra.org.pk/obtain.asp?have a tendency...      2021-07-30  16/8/2021 11:00:00 AM
..         ...                                                ...                                                ...             ...                    ...
205  TS456443E  Civil Aviation Authority, Karachi TENDER NOTIC...  <a href='https://ppra.org.pk/obtain.asp?have a tendency...      2021-07-29  13/8/2021 11:00:00 AM
206  TS454178E  Zarai Taraqiati Financial institution Ltd (ZTBL), Islamabad Inf...  <a href='https://ppra.org.pk/doc/28-7/ztb287-1...      2021-07-28  10/8/2021 11:00:00 AM
207  TS454566E  Sui Northern Fuel Pipelines Restricted, Lahore Rel...  <a href='https://ppra.org.pk/obtain.asp?have a tendency...      2021-07-28    1/9/2021 3:30:00 PM
208  TS455579E  Pakistan Ordnance Factories, Wah Cantt Restore ...  <a href='https://ppra.org.pk/doc/28-7/pof287-4...      2021-07-28   4/8/2021 10:20:00 AM
209  TS455365E  Pakistan Nationwide Delivery Company (PNSC),...  <a href='https://ppra.org.pk/obtain.asp?have a tendency...      2021-07-28  24/8/2021 11:00:00 AM

[210 rows x 5 columns]

This seems to be significantly better!

Subsequent we have to filter this knowledge for the newest date. We will try this utilizing simply two strains:

latest_date = combined_df.iloc[0]['Advertised Date']
filtered_df = combined_df[combined_df['Advertised Date'] == latest_date]

We first extract the newest date which is the Marketed Date of the primary merchandise within the desk after which filter the remainder of the desk utilizing that worth.

Now we will convert this into an HTML desk. Pandas doesn’t create a full HTML doc so we have to create one ourselves after which embed the Pandas output in it:

html_string = """
    <html>
    <head><title>Newest PPRA Tenders</title></head>
    <physique>
            <fashion>
        desk {
            border-collapse: collapse;
            border: 1px strong silver;
        }
        desk tr:nth-child(even) {
            background: #E0E0E0;
        }
        </fashion>
        %s
    </physique>
    </html>
"""

table_html = filtered_df.to_html(index=False,render_links=True, justify="middle", 
    escape=False, border=4)
with open('ppra.html', 'w') as f:
    f.write(html_string %(table_html))

I used the %s string interpolation technique as a result of I’ve some <fashion> tags with {} braces and it confuses Python if I take advantage of the f-strings.

After operating what now we have up to now, the output (ppra.html) seems to be like this:

Output Table

There may be one problem, the Tender Particulars column seems to be very cluttered. As a substitute of placing the small print knowledge in a single font-weight and on the identical line, we have to break it up as they do on the unique web site. The best method to try this is to extract the small print utilizing regex after which substitute the small print within the knowledge body with the extracted ones.

The regex for particulars extraction seems to be one thing like this:

particulars = re.findall('<td bgcolor="(?:.+?)" width="305">(.+?)</td>', html.textual content, re.DOTALL)

The element cells have a selected width so we extract the info based mostly on that. We use re.DOTALL as a result of the small print span a number of strains and we wish . to match the carriage return (r) and newline character (n) as effectively. The extracted particulars include rn and we will do away with them utilizing this record comprehension:

particulars = [detail.replace('rn','') for detail in details]

Let’s assign this particulars record to our knowledge body:

tender_table["Tender  Details"] = particulars

Closing scraper code

The ultimate scraper code seems to be like this:

import requests as r
import pandas as pd
import re

url_template = "https://ppra.org.pk/dad_tenders.asp?PageNo="
html_string = """
    <html>
    <head><title>Newest PPRA Tenders</title></head>
    <physique>
        <fashion>
        desk {
            border-collapse: collapse;
            border: 1px strong silver;
        }
        desk tr:nth-child(even) {
            background: #E0E0E0;
        }
        </fashion>
        %s
    </physique>
    </html>
"""

def download_parse_table(url):
    html = r.get(url)
    particulars = re.findall('<td bgcolor="(?:.+?)" width="305">(.+?)</td>', html.textual content, re.DOTALL)
    particulars = [detail.replace('rn','') for detail in details]
    dfs = pd.read_html(html.textual content, attrs={'width': '656'}, header=0, 
        parse_dates=['Advertised Date'])
    download_links = re.findall(
        '<a goal="_blank" href="(.+?)"><img border="0" src="pictures/(?:.+?)"></a>',
        html.textual content)
    download_links = ["<a href='https://ppra.org.pk/"+link+"' style='display: block;text-align: center;'> <img src='https://ppra.org.pk/images/download_icon.gif'/></a>" for link in download_links]
    tender_table = dfs[0]
    tender_table['Download'] = download_links
    tender_table["Tender  Details"] = particulars
    return tender_table

combined_df = []
for index in vary(1,8):
    df = download_parse_table(url_template+str(index))
    combined_df.append(df)

combined_df = pd.concat(combined_df).reset_index(drop=True)
latest_date = combined_df.iloc[0]['Advertised Date']
filtered_df = combined_df[combined_df['Advertised Date'] == latest_date]

table_html = filtered_df.to_html(index=False,render_links=True, justify="middle", 
    escape=False, border=4)
with open('ppra.html', 'w') as f:
    f.write(html_string %(table_html))

Getting began with Github Motion

I’m going to maintain this intro to GitHub Actions very brief. GitHub Actions have an idea of workflows. Actions will execute workflows. These workflows are contained contained in the .github/workflows folder within the root of the repo and record the steps we wish Actions to execute. I went forward and created a .github/workflows folder in my mission root after which created a scrape.yml file contained in the workflows folder. GH Actions would make extra sense if I present you the whole YAML file after which clarify it.

The contents of the scrape.yml file are this:

title: Scrape

on:
  schedule:
    - cron: "0 4 * * *"
  workflow_dispatch:

env:
  ACTIONS_ALLOW_UNSECURE_COMMANDS: true

jobs:
  scrape-latest:
    runs-on: ubuntu-latest

    steps:
      - title: Checkout repo
        makes use of: actions/checkout@v2
      - title: Arrange Python
        makes use of: actions/[email protected]
        with:
          python-version: '3.7'
      - title: Set up necessities
        run: pip set up -r necessities.txt
      - title: Run Scraper
        run: python scraper.py
      - title: Set env vars
        run: |
          echo "DATE=$(python -c 'import datetime as dt; print(dt.datetime.now().date())')" >> $GITHUB_ENV
      - title: Ship mail
        makes use of: dawidd6/action-send-mail@v3
        with:
          server_address: smtp.gmail.com
          server_port: 465
          username: ${{secrets and techniques.MAIL_USERNAME}}
          password: ${{secrets and techniques.MAIL_PASSWORD}}
          topic: Newest PPRA tenders for ${{env.DATE}}
          to: [email protected]
          from: Automated Electronic mail
          ignore_cert: true
          safe: true
          html_body: file://ppra.html

We begin by naming the Motion. In our case, we named it Scrape. Subsequent, we inform GitHub when to execute this motion. The primary time is by way of a cron schedule and the second is by way of the web workflow dispatcher. The cron worth is just like the crontab you may need used on Linux. You need to use Crontab Guru to discover crontabs. The one I’m utilizing will trigger the workflow to run day-after-day at 4 o’clock. That is in UTC. The workflow_dispatch is used only for testing. This fashion we don’t have to attend till 4 o’clock simply to check it and might set off the workflow manually utilizing the web interface.

Subsequent, we create an atmosphere variable to which our execution atmosphere could have entry. ACTIONS_ALLOW_UNSECURE_COMMANDS is required for Python on GitHub because of a bug. I’m not certain whether it is fastened or not. Afterward, we set up Python, set up the necessities and run the scraper. Then we set the DATE variable to the present server time. This might be used within the topic of our e-mail.

For the e-mail sending half, I’m utilizing the superior send-email motion which makes the entire course of tremendous easy. I present it with my Gmail username and password and level it to the generated HTML file and it robotically sends the e-mail.

We additionally want to ensure we configure the secrets and techniques within the repository settings web page in order that the send-email motion has entry to the MAIL_USERNAME and MAIL_PASSWORD.

GitHub Repository settings

We will take a look at the execution by pushing the whole lot to GitHub after which going to the Actions tab of the repo. From there we will choose our workflow and manually set off it.

Github action workflow

GitHub Actions Suggestions

The Motion has been operating efficiently for a couple of days now. I’ve fallen in love with the simplicity of Actions. The one main problem is that you shouldn’t use the free model for something time-sensitive. The motion execution is nearly all the time delayed by a couple of minutes when utilizing the scheduler. For simplest circumstances it’s high-quality although. I’m additionally undecided if the state of affairs is any totally different underneath the paid plan. Nevertheless, if you happen to dispatch the workflow utilizing the web interface, it largely will get scheduled for execution immediately.

I’ll use Actions once more sooner or later when I’ve comparable initiatives. The one factor to remember is that Actions solely have a restricted quantity of free minutes. Should you transcend this restrict, you’ll have to pay and the value varies based mostly on the working system. The free account is restricted to 2000 minutes per 30 days.

Conclusion

I hope you all loved this tutorial. You will discover the corresponding GitHub repo right here. I had a ton of enjoyable engaged on it. In case you have any questions, feedback, or ideas, please be happy to remark beneath or ship me an e-mail (hello @ yasoob.me) or a tweet. I do plan on writing a couple of bunch of different stuff however time is at a premium nowadays. Nevertheless, I’m hoping the state of affairs would quickly change ? I’ll see you within the subsequent article ❤️



[ad_2]