Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to parse agenda questions #2

Open
comsaint opened this issue May 21, 2015 · 4 comments
Open

Fail to parse agenda questions #2

comsaint opened this issue May 21, 2015 · 4 comments

Comments

@comsaint
Copy link
Owner

Multiple issues with agenda questions:

  • Questions of agendas on 2014.06.18 are successfully scraped and shows up in the parsed source documents, but some questions are missing in view (for English version only).
  • When there are more than one responder for a question, the current view shows only one of them.
  • There is at least one occasion that the content of a question is completely ripped off from the view, although this content exists in the parsed source document (for both languages), and the question entry for it also exists in view. See Q5 in agenda on 2014.10.15.
  • It may be a good idea to include a field responders in the model RawCouncilQuestion.
@comsaint
Copy link
Owner Author

A side-note: The RawCouncilQuestion instances saved in database comes from scraping webpages such as http://www.legco.gov.hk/yr13-14/english/counmtg/question/ques1314.htm, which contains a link to its respective agenda (and reply), while the issue mentioned above comes from parsing the agendas directly. It seems that these hyperlinks all contain an anchor for a question - maybe this is a good start to break the issue.

1 similar comment
@comsaint
Copy link
Owner Author

A side-note: The RawCouncilQuestion instances saved in database comes from scraping webpages such as http://www.legco.gov.hk/yr13-14/english/counmtg/question/ques1314.htm, which contains a link to its respective agenda (and reply), while the issue mentioned above comes from parsing the agendas directly. It seems that these hyperlinks all contain an anchor for a question - maybe this is a good start to break the issue.

@comsaint
Copy link
Owner Author

There is a reply link alongside each question on e.g. Legco 13-14. It returns a well-structured HTML page that consists of both question and reply.
Since we will need to scrape the replies eventually, we may consider moving the creation of question instances here, i.e. scrape both questions and replies from that same page. The drawbacks are that:

  • We have to parse agendas anyway (since it contains more than questions), it would mean duplicating work.
  • There may be inconsistency (due to parsing errors) between questions from agenda, the above link, and Hansard (in future).

Older questions (from year 2005-2006 back) do not have such a reply link. Need to parse the Hansard instead. However, since we need to parse the Hansard anyway, there is no extra work.

@comsaint
Copy link
Owner Author

Found a note in raw.models.parsed.QuestionManager.populate(), which shares my idea above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant