EuroPython 2015

Frontera: open source large-scale web crawling framework

In this poster session I’m going to introduce Scrapinghub’s new open source framework Frontera. Frontera allows to build real-time distributed web crawlers and website focused ones.

Offering:

  • customizable URL metadata storage (RDBMS or Key-Value based),
  • crawling strategies management,
  • transport layer abstraction.
  • fetcher abstraction.

Along with framework description I’ll demonstrate how to build a distributed crawler using Scrapy, Kafka and HBase, and hopefully present some statistics of Spanish internet collected with newly built crawler. Happy EuroPythoning!


Comments

  1. Gravatar
    Is this a poster session or a talk?
    — Alexandre Savio,
  2. Gravatar
    This is a poster session.
    — Alexander Sibiryakov,

New comment