Welcome to the GitHub organization of the Data Science Team within the Office of the Chief Data Officer (OCDO) at Health Canada. We operate under the Digital Transformation Branch (DTB).
The OCDO Data Science Team is dedicated to accelerating Health Canada's data-enabled digital transformation. We serve as an Enabler, Facilitator, and Champion for data projects across the department, providing expertise in data science, analytics, and data-enabled transformation projects.
Our team engages in a variety of projects to support data science initiatives within Health Canada. Below are some of our key repositories:
The file-processing suite is a collection of Python libraries designed to generalize and streamline the processing of diverse file types.
The core library that provides a generalized File
class using a strategy pattern to select appropriate processors based on file extensions. It supports over 20 unique file processors and extracts metadata and text from files.
Extends file-processing
by enabling Optical Character Recognition (OCR) capabilities. It wraps around relevant file types to extract text from images and scanned documents. Currently utilizes Tesseract OCR.
Adds transcription capabilities to file-processing
by processing audio and video files. Uses OpenAI Whisper for transcribing speech to text.
Designed for analyzing collections of files. It iterates through directories, creating File
objects, collecting metadata, and outputting results to a CSV file.
Integrates machine learning models into the file-processing suite. Extends file processors to include functionalities like loading models, performing inferences, and creating embeddings.
Facilitates the creation and management of indices (e.g., FAISS) for efficient data retrieval in RAG LLMs. Includes functionality for loading and searching indices.
A collection of over 100 sample files used for testing across the file-processing suite, ensuring robustness and reliability. The sample files are generic and publicly available, containing no sensitive or proprietary information, and their content is unrelated to our organization's operations.
Conceptual diagram illustrating the relationships between the libraries in the file-processing suite.
An example workflow using our file-processing suite:
- Data Extraction: Use
file-processing-analytics
to collect file paths and extract text from a company's directory. - Embedding Creation: Employ
file-processing-models
to create embeddings of the extracted text. - Indexing: Utilize
file-processing-indices
to build a FAISS index from the embeddings. - Query Handling: When a user submits a query, search the FAISS index using
file-processing-indices
to retrieve relevant context. - Response Generation: Feed the user's prompt and retrieved context into the model using
file-processing-models
to generate an answer.
Docker templates for data science applications, including:
- A web application
- Jupyter Notebook examples
- A security-focused "isolated" container
- A GPU-enabled notebook example
A simple template demonstrating essential components of a Python library, such as project structure and test cases.
Code to automatically generate UML diagrams from a codebase. Aims to provide UML diagrams for all our repositories to enhance understanding and documentation.
Our versioning protocol ensures consistency and traceability across all repositories:
- Major Releases: Indicate feature additions (e.g.,
1.0
,2.0
). Labeled using Git tags. - Minor Releases: Include bug fixes and minor improvements (e.g.,
1.1
,1.2
). - Branching Strategy:
- Releases live only in the
main
branch. - Development occurs in
dev
or feature branches before merging intomain
.
- Releases live only in the
- Changelog: All updates are summarized in a
CHANGELOG.md
file within each repository.
We are committed to fostering collaboration and innovation within Health Canada and beyond. Explore our repositories, contribute, or get in touch to learn more about our work.