Zé The Scraper

Install

Install Berkeley DB

Limitações

Os artigos article são listados por ordem da data de coleta dateCreated porem os artigos podem ser considerados com atualizados e serem coletados novamente causado que a data de coleta e data de publicação datePublished divirjam

Usage

Crawlling using a single spider an single url

scrapy crawl <spider_name> -a url=http(s):someurl.com?query1=a&query2=b

Crawlling using a single spider with urls extrected from Google

scrapy crawl <spider_name> -a search='{ \
  "query": "Enem OR \"Exame Nacional * Ensino Médio\"", \
  "regex": "(?i)Enem|Exame.{0,}Nacional.{0,}Ensino.{0,}Mé?e?dio" \
  "engine": "google", \
  "dateRestrict": "d1",\
  "results_per_page": 50,\
  "pages": 2 \
}'

Crawlling using all spiders with urls extrected from Google

scrapy crawl all -a search='{ \
  "query": "Enem OR \"Exame Nacional * Ensino Médio\"", \
  "regex": "(?i)Enem|Exame.{0,}Nacional.{0,}Ensino.{0,}Mé?e?dio"
  "engine": "google", \
  "dateRestrict": "d1", \
  "results_per_page": 50, \
  "pages": 2 \
}'

scrapy crawl all \
-a search=google \
-a query="Enem OR \"Exame Nacional * Ensino Médio\"" \
-a regex="(?i)Enem|Exame.{0,}Nacional.{0,}Ensino.{0,}Mé?e?dio" \
-a dateRestrict=d1

References

http://xpo6.com/list-of-english-stop-words/
Scrapy - Docs | Jobs: pausing and resuming crawls
[scrapy.extensions.memusage][https://github.com/scrapy/scrapy/blob/master/scrapy/extensions/memusage.py] It's a good code to extend, overide _send_report_ function to send to another services than only mail

TODO:

Implement DeltaFetch midleware
decompose class .n--noticia__newsletter to spider estadao
Use https://github.com/codelucas/newspaper

Ideas

Relation DB Schema

https://cloud.google.com/bigtable/docs/schema-design

Use this:

lambda

class AVRO_FIELD_TYPE(Enum):
    str = 'STRING'
    list = 'RECORD'
    int = 'INTERGE'
    bool = 'BOOLEAN'

f_avro = lambda ft, md='NULLABLE', fd=[]: { 'avro': { 
    # 'field_type': ft.uppe() if ft else AVRO_FIELD_TYPE[type(ft)], 
    'field_type': ft.uppe(), 
    'mode': md, 
    'fields': fd } }

@property
def identifier(self):
    self['output_processor'] = self.get('output_processor') if self.get('output_processor') \
                                else TakeFirst()
    if not hasattr(self, 'schemas'):
        self['schemas'] = self.f_avro('STRING', 'NULLABLE', [])
    
    return self 

@identifier.setter
def identifier(self, value):
    self['output_processor'] if self.get('output_processor') else TakeFirst()
    return self

Name		Name	Last commit message	Last commit date
Latest commit History 337 Commits
data		data
ze		ze
.gitignore		.gitignore
.python-version		.python-version
.travis.yml		.travis.yml
README.md		README.md
proxies-list.txt		proxies-list.txt
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zé The Scraper

Install

Limitações

Usage

Crawlling using a single spider an single url

Crawlling using a single spider with urls extrected from Google

Crawlling using all spiders with urls extrected from Google

References

TODO:

Ideas

Relation DB Schema

lambda

About

Releases

Packages

Contributors 3

Languages

labic/ze-the-scraper

Folders and files

Latest commit

History

Repository files navigation

Zé The Scraper

Install

Limitações

Usage

Crawlling using a single spider an single url

Crawlling using a single spider with urls extrected from Google

Crawlling using all spiders with urls extrected from Google

References

TODO:

Ideas

Relation DB Schema

lambda

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages