Skip to content

Instantly share code, notes, and snippets.

@owlfox
Created December 12, 2019 02:49
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save owlfox/9cbbe5f8de6507dc4b9c33c4196c5de2 to your computer and use it in GitHub Desktop.
Save owlfox/9cbbe5f8de6507dc4b9c33c4196c5de2 to your computer and use it in GitHub Desktop.
104 scrapy spider / 爬蟲
# -*- coding: utf-8 -*-
import scrapy
import json
class CrawlerSpider(scrapy.Spider):
name = 'crawler'
allowed_domains = ['104']
start_urls = [
'https://www.104.com.tw/company/ajax/joblist/5t7gcns?roleJobCat=0_0&area=0&page=1&pageSize=100&order=8&asc=0&',
'https://www.104.com.tw/company/ajax/joblist/5t7gcns?roleJobCat=0_0&area=0&page=2&pageSize=100&order=8&asc=0&'
]
def parse(self, response):
result = json.loads(response.body_as_unicode())
for item in result['data']['list']['normalJobs']:
yield item
@owlfox
Copy link
Author

owlfox commented Dec 12, 2019

scrapy runspider crawler.py -o file.csv -t csv

  • make company id as argument, for example: 5t7gcns
  • dynamic page crawling
  • job requirement not crawled

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment