How to structure multiple parsers? #227

dogweather · 2022-09-23T03:55:33Z

dogweather
Sep 23, 2022
Collaborator

I've read the docs, but I'm still a little unsure about it. Say I have two page types: a home page and item pages. My non-framework way to handle this is:

defmodule Spider do
  use Crawly.Spider

  @home_page "https://www.oregonlegislature.gov/bills_laws/Pages/ORS.aspx"

  # ...

  @impl Crawly.Spider
  def parse_item(%{status_code: 200, request_url: @home_page} = response) do
    Parser.parse_home_page(response)
  end

  def parse_item(%{status_code: 200} = response) do
    Parser.parse_chapter_page(response)
  end
end

So, I'm using plain Elixir pattern matching on response properties to choose a parser. What would this code look like if implemented using Response Parsers? Could someone expand on the expected return type? (Maybe Crawly would benefit from a ResponseParser behavior?)

Must return a Map on the first tuple position...

So it must return a tuple and the first item must be a ParsedItem?

How should Response Parsers choose to not process a Response? I'm guessing by just returning an empty ParsedItem? Will the framework call each Response Parser in turn?

dogweather · 2023-12-13T21:50:17Z

dogweather
Dec 13, 2023
Collaborator Author

Just following up with the solution I'm currently using:

  @ors_home_page   "https://www.oregonlegislature.gov/bills_laws/Pages/ORS.aspx"
  @chapter_root    "https://www.oregonlegislature.gov/bills_laws/ors/ors"
  @anno_root       "https://www.oregonlegislature.gov/bills_laws/ors/ano"


  @impl Crawly.Spider
  def base_url(), do: "https://www.oregonlegislature.gov/"


  @impl Crawly.Spider
  def init() do
    [start_urls: [@ors_home_page]]
  end


  @impl Crawly.Spider
  def parse_item(%{request_url: @ors_home_page} = response) do
    Logger.info("Parsing #{response.request_url}...")
    Parser.parse_home_page(response)
  end


  def parse_item(%{request_url: @chapter_root <> _} = response) do
    Logger.info("Parsing #{response.request_url}...")
    ChapterFile.parse(response)
  end


  def parse_item(%{request_url: @anno_root <> _} = response) do
    Logger.info("Parsing #{response.request_url}...")
    AnnotationFile.parse(response)
  end

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to structure multiple parsers? #227

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

How to structure multiple parsers? #227

dogweather Sep 23, 2022 Collaborator

Replies: 1 comment

dogweather Dec 13, 2023 Collaborator Author

dogweather
Sep 23, 2022
Collaborator

dogweather
Dec 13, 2023
Collaborator Author