Legislative_Data

Scraping, parsing, and entity/relation extraction tools for texts enacted legislation. Currently focused on the US (via congress.gov), the UK (legislation.gov.uk), Australia (legislation.gov.au), and Canada (parl.gc.ca). Under construction, so check back for updates!

Setup and dependencies

Currently, this repository is only tested for Python 2. Besides base Python, the Legislative_Data library also relies on Selenium (for webscraping), NLTK (for entity extraction), igraph/NetworkX (for network calculations and visualization), and wikipedia. Parsing and entity extraction functions are currently implemented for US legislation only, and further rely on constitute_tools.

Basic usage

Most library functions are wrapped through the collector.DataManager class. Initialize the class with a working directory as follows:

>> from collector import DataManager
>> manager = DataManager('/path/to/working_directory')

Library functions can be accessed through various wrapper functions:

>> manager.update_data() # run the scrapers
>> manager.append_parsed() # parse scraped data
>> manager.append_auxiliary() # add auxiliary metadata from outside sources
>> manager.extract_entites(write=True) # extract entities from parsed data, and optionally write to disk

As mentioned earlier, various subcomponents of these functions are under construction. Depending on build state, you may need to comment out some country-level parsers for wrappers to run without error. Please message me if you have any questions!

Parsing and entity extraction

Overview

Currently, relations in Legislative_Data are defined using a co-mention approach. In other words, edges are drawn between extracted entities that co-occur within a given unit of analysis. In the United States, for example, the natural unit of analysis for legislative texts is the section (as articulated in the Office of Law Revision Counsel's guidelines). As a result, splitting legislation into appropriate units of analysis (and cleaning extraneous text) is a critical step for the Legislative_Data library.

The parsing functions in Legislative_Data rely on the parser implemented in constitute_tools to parse legislation into units of analysis, which is called and applied through the _country_parsers_annual._CountryBase class (inherited by country-specific classes in _country_parsers_annual). This parser cleans extraneous text, chunks documents into units of analysis (e.g. sections in the US case), and outputs a flat (csv-like) representation. This parsed text is then used in the entity extraction functions contained in _country_entities_annual._EntityBase.

Customization

If the included parser is not appropriate for your application, you can input your own pre-segmented texts to the entity extraction tool as follows:

>> from _country_entities_annual import _EntityBase as entity_manager
>> chunks = ['We the People...', 'All legislative Powers...', ...] # format parsed text as list of strings
>> manager = entity_manager(None) # null argument in place of parsed text
>> manager.chunks = parsed
>> edges = manager.do_entity_extraction()

This process saves a dictionary to edges, which can then be saved to disk or manipulated. Here, chunks represents a single document; to process multiple documents, wrap this piece of code in a loop.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
LICENSE		LICENSE
NER_Train_Set.ipynb		NER_Train_Set.ipynb
README.md		README.md
_country_auxiliary_annual.py		_country_auxiliary_annual.py
_country_entities.py		_country_entities.py
_country_institutions.py		_country_institutions.py
_country_parsers_annual.py		_country_parsers_annual.py
_country_scrapers_annual.py		_country_scrapers_annual.py
_country_scrapers_consolidated.py		_country_scrapers_consolidated.py
agency_dictionary.json		agency_dictionary.json
agency_list.txt		agency_list.txt
collector.py		collector.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Legislative_Data

Setup and dependencies

Basic usage

Parsing and entity extraction

Overview

Customization

About

Releases

Packages

Languages

License

rbshaffer/Legislative_Data

Folders and files

Latest commit

History

Repository files navigation

Legislative_Data

Setup and dependencies

Basic usage

Parsing and entity extraction

Overview

Customization

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages