Skip to content
@hc-sc-ocdo-bdpd

hc-sc-ocdo-bdpd

Office of the Chief Data Officer (OCDO) - Data Science Team

Welcome to the GitHub organization of the Data Science Team within the Office of the Chief Data Officer (OCDO) at Health Canada. We operate under the Digital Transformation Branch (DTB).


Table of Contents


About Us

The OCDO Data Science Team is dedicated to accelerating Health Canada's data-enabled digital transformation. We serve as an Enabler, Facilitator, and Champion for data projects across the department, providing expertise in data science, analytics, and data-enabled transformation projects.

Our Projects and Repositories

Our team engages in a variety of projects to support data science initiatives within Health Canada. Below are some of our key repositories:

Key Projects

File-Processing Suite of Python Libraries

The file-processing suite is a collection of Python libraries designed to generalize and streamline the processing of diverse file types.

The core library that provides a generalized File class using a strategy pattern to select appropriate processors based on file extensions. It supports over 20 unique file processors and extracts metadata and text from files.

Extends file-processing by enabling Optical Character Recognition (OCR) capabilities. It wraps around relevant file types to extract text from images and scanned documents. Currently utilizes Tesseract OCR.

Adds transcription capabilities to file-processing by processing audio and video files. Uses OpenAI Whisper for transcribing speech to text.

Designed for analyzing collections of files. It iterates through directories, creating File objects, collecting metadata, and outputting results to a CSV file.

Integrates machine learning models into the file-processing suite. Extends file processors to include functionalities like loading models, performing inferences, and creating embeddings.

Facilitates the creation and management of indices (e.g., FAISS) for efficient data retrieval in RAG LLMs. Includes functionality for loading and searching indices.

A collection of over 100 sample files used for testing across the file-processing suite, ensuring robustness and reliability. The sample files are generic and publicly available, containing no sensitive or proprietary information, and their content is unrelated to our organization's operations.

Diagram of the File-Processing Suite

File-Processing Suite Diagram

Conceptual diagram illustrating the relationships between the libraries in the file-processing suite.


Use Case Example: Retrieval-Augmented Generation (RAG) LLM

An example workflow using our file-processing suite:

  1. Data Extraction: Use file-processing-analytics to collect file paths and extract text from a company's directory.
  2. Embedding Creation: Employ file-processing-models to create embeddings of the extracted text.
  3. Indexing: Utilize file-processing-indices to build a FAISS index from the embeddings.
  4. Query Handling: When a user submits a query, search the FAISS index using file-processing-indices to retrieve relevant context.
  5. Response Generation: Feed the user's prompt and retrieved context into the model using file-processing-models to generate an answer.

Other Repositories

Docker templates for data science applications, including:

  • A web application
  • Jupyter Notebook examples
  • A security-focused "isolated" container
  • A GPU-enabled notebook example

A simple template demonstrating essential components of a Python library, such as project structure and test cases.

Code to automatically generate UML diagrams from a codebase. Aims to provide UML diagrams for all our repositories to enhance understanding and documentation.


Repository Versioning Protocol

Our versioning protocol ensures consistency and traceability across all repositories:

  • Major Releases: Indicate feature additions (e.g., 1.0, 2.0). Labeled using Git tags.
  • Minor Releases: Include bug fixes and minor improvements (e.g., 1.1, 1.2).
  • Branching Strategy:
    • Releases live only in the main branch.
    • Development occurs in dev or feature branches before merging into main.
  • Changelog: All updates are summarized in a CHANGELOG.md file within each repository.

We are committed to fostering collaboration and innovation within Health Canada and beyond. Explore our repositories, contribute, or get in touch to learn more about our work.

Popular repositories Loading

  1. file-processing file-processing Public

    A metadata extraction tool for various file types

    Python 5 3

  2. RAGnalysis-API RAGnalysis-API Public archive

    Python 1 1

  3. llm-tools llm-tools Public archive

    Python

  4. table-processing table-processing Public archive

    Python 1

  5. CoiMailApp CoiMailApp Public archive

    TeX

  6. Data-Profiling Data-Profiling Public archive

    HTML

Repositories

Showing 10 of 26 repositories
  • Python_Container_Demo Public template
    hc-sc-ocdo-bdpd/Python_Container_Demo’s past year of commit activity
    Python 0 MIT 0 0 0 Updated Dec 23, 2024
  • hc-sc-ocdo-bdpd/copilot-python-automator’s past year of commit activity
    Python 0 MIT 0 0 0 Updated Nov 27, 2024
  • hc-sc-ocdo-bdpd/file-processing-guide’s past year of commit activity
    Shell 0 MIT 0 2 0 Updated Nov 26, 2024
  • hc-sc-ocdo-bdpd/file-processing-test-data’s past year of commit activity
    Python 0 MIT 0 0 0 Updated Nov 14, 2024
  • hc-sc-ocdo-bdpd/file-processing-analytics’s past year of commit activity
    Python 0 MIT 0 1 0 Updated Nov 14, 2024
  • hc-sc-ocdo-bdpd/file-processing-transcription’s past year of commit activity
    Python 0 MIT 0 2 0 Updated Nov 14, 2024
  • hc-sc-ocdo-bdpd/file-processing-ocr’s past year of commit activity
    Python 0 MIT 1 2 0 Updated Nov 14, 2024
  • file-processing Public

    A metadata extraction tool for various file types

    hc-sc-ocdo-bdpd/file-processing’s past year of commit activity
    Python 5 MIT 3 21 (1 issue needs help) 1 Updated Nov 14, 2024
  • .github Public
    hc-sc-ocdo-bdpd/.github’s past year of commit activity
    0 0 0 0 Updated Nov 1, 2024
  • hc-sc-ocdo-bdpd/file-processing-models’s past year of commit activity
    Python 0 MIT 0 0 0 Updated Oct 28, 2024

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…