Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cliffordchance openedgar #17

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# ignore .git and .cache folders
.git
.cache
.idea
data
219 changes: 110 additions & 109 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,109 +1,110 @@
*#
*~
.idea/

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
env/
ve/
venv/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# dotenv
.env

# virtualenv
.venv
venv/
ENV/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/

benchmarks
*#
*~
.idea/
data/

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
env/
ve/
venv/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# dotenv
.env

# virtualenv
.venv
venv/
ENV/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/

benchmarks
38 changes: 38 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# CC Specific Dockerfile implementing steps at https://github.com/LexPredict/openedgar/blob/master/INSTALL.md
# Allows the use of OpenEDGAR in AKS
FROM ubuntu:18.04
MAINTAINER Michael Seddon ([email protected])

# Environment variables
ENV DEBIAN_FRONTEND=noninteractive

# Package installation
RUN apt update
RUN apt upgrade -y
RUN apt install -y software-properties-common build-essential python3-dev python3-pip virtualenv git-all
# to be removed when rabbit is in its own container
RUN apt install -y rabbitmq-server
RUN apt-get install -y openjdk-8-jdk

# Clone OpenEDGAR repository
WORKDIR /opt
RUN mkdir /opt/openedgar

# Set up Python venv
WORKDIR /opt/openedgar/
RUN virtualenv -p /usr/bin/python3 env
COPY lexpredict_openedgar/requirements/full.txt lexpredict_openedgar/requirements/full.txt
RUN ./env/bin/pip install -r lexpredict_openedgar/requirements/full.txt
RUN ./env/bin/pip install azure-mgmt-resource azure-mgmt-datalake-store azure-datalake-store azure-storage-blob
COPY tika/tika-server-1.21.jar /opt/openedgar/tika/tika-server-1.21.jar
COPY lexpredict_openedgar/ /opt/openedgar/lexpredict_openedgar/

COPY docker/default.env /opt/openedgar/
RUN cp lexpredict_openedgar/sample.env lexpredict_openedgar/.env
#COPY docker/erlang-solutions_1.0_all.deb lexpredict_openedgar/erlang-solutions_1.0_all.deb
COPY docker/oe-entrypoint.sh /usr/local/bin/
COPY docker/run_edgar.py /opt/openedgar/lexpredict_openedgar/run_edgar.py
COPY docker/dot_env.sh /opt/openedgar
RUN mkdir /data

ENTRYPOINT ["oe-entrypoint.sh"]
38 changes: 38 additions & 0 deletions docker/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Dockerisation
The image will take as default parameters [default.env](docker/default.env)

All the variable can be substitute at runtime as environment variables

## Download tika
run in `/tika` the script `download_tika.sh` it will download in the `/tika'
folder tika version 1.20

## Docker
Run the follow from the repository root for creating the image:

docker build -t dslcr.azurecr.io/openedgar:1.1 .

# Run container
It is wise to mount a local folder to the container for being able to access to the
downloaded documents.
Example:

docker run --env-file vars.txt -v /Users/mirko/Projects/research-openedgar/data:/data dslcr.azurecr.io/openedgar:1.1

Contents of vars.txt

EDGAR_YEAR=2015
EDGAR_QUARTER=1
EDGAR_MONTH=1
CLIENT_TYPE=Local
S3_DOCUMENT_PATH=/data
DOWNLOAD_PATH=/data

After the download is terimated you have to stop the container:

$ docker ps

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
9e0ae247b61f dslcr.azurecr.io/openedgar:1.1 "oe-entrypoint.sh" 2 minutes ago Up 2 minutes priceless_bardeen

$ docker kill 9e
70 changes: 70 additions & 0 deletions docker/default.env
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# PostgreSQL
DATABASE_URL=${DATABASE_URL:="postgres://postgres:[email protected]:5432/openedgar"}
CELERY_BROKER_URL=${CELERY_BROKER_URL:="amqp://openedgar:openedgar@localhost:5672/openedgar"}
CELERY_RESULT_BACKEND=${CELERY_RESULT_BACKEND:="rpc"}
CELERY_RESULT_PERSISTENT=${CELERY_RESULT_PERSISTENT:="False"}
DJANGO_SECRET_KEY=${DJANGO_SECRET_KEY:="openedgar"}

# Domain name, used by caddy
#DOMAIN_NAME=domain.com

# General settings
# DJANGO_READ_DOT_ENV_FILE=True
# CLIENT_TYPE: S3, ADL, Local, Blob
CLIENT_TYPE=${CLIENT_TYPE:="Local"}


DJANGO_ADMIN_URL=${DJANGO_ADMIN_URL=""}
DJANGO_SETTINGS_MODULE=${DJANGO_SETTINGS_MODULE:="config.settings.production"}
DJANGO_SECRET_KEY=${DJANGO_SECRET_KEY:="openedgar"}
DJANGO_ALLOWED_HOSTS=${DJANGO_ALLOWED_HOSTS:="localhost"}

# AWS Settings
DJANGO_AWS_ACCESS_KEY_ID=${DJANGO_AWS_ACCESS_KEY_ID:=""}
DJANGO_AWS_SECRET_ACCESS_KEY=${DJANGO_AWS_SECRET_ACCESS_KEY:=""}
DJANGO_AWS_STORAGE_BUCKET_NAME=${DJANGO_AWS_STORAGE_BUCKET_NAME:=""}

# AZURE DLAKE Settings
ADL_ACCOUNT=${ADL_ACCOUNT:=""}
ADL_TENANT=${ADL_TENANT:=""}
# Client ID
ADL_CID=${ADL_CID:=""}
# Client secret/password
ADL_SECRET=${ADL_SECRET:=""}

# Azure Blob Storage
BLOB_CONNECTION_STRING=${BLOB_CONNECTION_STRING:=""}
BLOB_CONTAINER=${BLOB_CONTAINER:="openedgar"}


# Read rate limit
CELERY_TASK_DEFAULT_RATE_LIMIT={$CELERY_TASK_DEFAULT_RATE_LIMIT:="10/s"}

# Used with email
DJANGO_MAILGUN_API_KEY=${DJANGO_MAILGUN_API_KEY:=""}
DJANGO_SERVER_EMAIL=${DJANGO_SERVER_EMAIL:=""}
MAILGUN_SENDER_DOMAIN=${MAILGUN_SENDER_DOMAIN:=""}
EMAIL_BACKEND=${EMAIL_BACKEND:="django.core.mail.backends.console.EmailBackend"}

# Security! Better to use DNS for this task, but you can use redirect
DJANGO_SECURE_SSL_REDIRECT=${DJANGO_SECURE_SSL_REDIRECT:="False"}

# django-allauth
DJANGO_ACCOUNT_ALLOW_REGISTRATION=${DJANGO_ACCOUNT_ALLOW_REGISTRATION:="True"}

# AWS setup
S3_ACCESS_KEY=${S3_ACCESS_KEY:="ABCDEFGHIJKLMNOPQRST"}
S3_SECRET_KEY=${S3_SECRET_KEY:="abcdefghijklmnopqrstuvwxyz12345678901234"}
S3_BUCKET=${S3_BUCKET:=""}

S3_PREFIX=${S3_PREFIX:="DATA"}
S3_COMPRESSION_LEVEL=${S3_COMPRESSION_LEVEL:="9"}

# Download path
DOWNLOAD_PATH=${DOWNLOAD_PATH:="/data"}
S3_DOCUMENT_PATH=${S3_DOCUMENT_PATH:="/data"}

# EDGAR PARAMETERS
EDGAR_YEAR=${EDGAR_YEAR:="2015"}
FORM_TYPES=${FORM_TYPES:="3, 10, 8-K, 10-Q, 10-K"}

Loading