Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: Implement a broader range of statistical distributions #77

Closed
6 tasks done
Rejoan-Sardar opened this issue Apr 1, 2024 · 4 comments
Closed
6 tasks done
Labels
2024 2024 GSoC proposal. rfc Project proposal.

Comments

@Rejoan-Sardar
Copy link

Rejoan-Sardar commented Apr 1, 2024

Full name

Rejoan Sardar

University status

Yes

University name

Lovely Professional University, Punjab

University program

BTech. Computer Science and Engineering

Expected graduation

2026

Short biography

I am Rejoan Sardar, a 2nd year undergraduate Computer Science and Engineering student at Lovely Professional University. Throughout my programming journey, which began with Python during high school, I've continuously expanded my skill set and expertise. Python served as the cornerstone, providing me with a strong foundation that facilitated the exploration of other languages such as C/C++, Java, Javascript etc.I like coding for fun and have worked on various small projects which can be found on my Github Profile.

Timezone

IST (UTC + 5:30)

Contact details

email:[email protected] github:@Rejoan-Sardar

Platform

Windows

Editor

VS Code stands out as my favorite due to its seamless integration of powerful features, intuitive interface, and extensive customization options. Its robust set of tools, including IntelliSense for code completion and debugging capabilities, greatly enhance my productivity and streamline my development workflow.

Programming experience

I have been developing projects and participating in various competitions since my first year :
a. NodeRepl- Online Code Editor Implementation in Node.js : Developers can write, edit, and execute Node.js code directly in-browser, collaborating with others in real-time via socket.io. Kubernetes ensures reliable container orchestration and scalability. Smooth handling of HTTP requests enhances user experience. Bounties incentivize community contributions and reward developers.
b. Atom-REPL: Empowering JavaScript Development with Live REPL: Atom-REPL is a promising addition to the JavaScript coding ecosystem, offering a Live Read-Eval-Print-Loop (REPL) directly within the familiar environment of the Atom text editor. Inspired by the functionality of platforms like CoderPad, it aims to streamline the coding experience for JavaScript developers, particularly for quick exercises and experimentation. Still actively in development, Atom-REPL holds the potential for further enhancements and features, making it worth keeping an eye on for future updates and improvements.
c. XKCDDisplay-JupyterLab Integration for Random XKCD Comics: Users can fetch and display random XKCD comics, navigate through them with metadata, and enjoy offline viewing. Customization options allow adjusting display settings, bookmarking favorites, and sharing via social media. Additionally, users can search for comics based on keywords, utilize keyboard shortcuts, and access accessibility features. Robust error handling ensures smooth operation, while integration with the JupyterLab ecosystem enhances overall user experience.

JavaScript experience

While contributing to stdlib-js development, I integrated JavaScript and C implementations for essential mathematical functions like boxcox1p, xlog1py, asinh, atanh, asind, acscd, odd and even etc. This involved enhancing the standard library for both JavaScript and Node.js environments. My contributions aimed to improve the efficiency and functionality of these mathematical operations within the library, benefiting users across different platforms. Despite its flexibility, JavaScript's dynamic typing can pose challenges without tools such as TypeScript to enforce type safety. While JSDoc aids in documentation, it lacks the capability to enforce strict typing. With my proficient skills, I ensure robust code quality by utilizing TypeScript to maintain type safety and enhance overall code reliability. I have worked on various small projects in Javascript and node.js which can be found on my Github Profile.

Node.js experience

Same as Javascript.

C/Fortran experience

My proficiency in C programming and problem-solving skills have been refined through my extensive engagement in competitive programming. I excel in crafting efficient algorithms and implementing them effectively in C to solve a variety of challenges. With a solid foundation in data structures and algorithms, coupled with my ability to optimize code for performance, I consistently deliver robust solutions in competitive programming contests. My track record showcases my capability to tackle complex problems, demonstrating my dedication to mastering C programming and problem-solving at a competitive level.

Interest in stdlib

I have been contributing to open source for quite a long time . I actively contributed to the Open Source community, notably within the stdlib organization. Here, I made significant contributions by refining C implementations of pivotal mathematical functions, aiming to enhance computational precision and efficiency. Through rigorous code review and optimization efforts, I demonstrated my commitment to collaborative development and my dedication to advancing computational mathematics' state-of-the-art. Engaging within the Open Source community provided invaluable insights and opportunities for learning and growth, reinforcing the importance of shared knowledge and collective advancement in the field of programming. My contributions involve creating PRs, helping other team members, creating bug issues present in the project, and also resolving them.

Version control

Yes

Contributions to stdlib

At present I have 22 PRs merged and 5 issues closed. Here are my some of the Open source contributions:
All PRs which are merged

Status Opened

All Prs which are opened

Issues Opened
All issues which are currently opened

Resolved Issue

Goals

This project aims to incorporate every distribution available in SciPy stats into a comprehensive JavaScript library. It involves developing APIs to calculate PDF, CDF, quantiles, and other essential distribution properties. Additionally, the library will provide functionality to generate random variates from any of the implemented distributions, ensuring a robust statistical toolkit within the JavaScript ecosystem.
The default method used by SciPy to sample from any distribution requires integrating the PDF and then numerically inverting the CDF. The implementation in SciPy is too slow to be relied on for practical purposes and custom methods for sampling random variates need to be implemented for the distributions in SciPy.
According to my thorough analysis, I've found that the distributions already implemented in the standard library (stdlib) exhibit a remarkable performance advantage over their default counterparts in SciPy. Through rigorous testing and benchmarking, it has been observed that the stdlib distributions are approximately 10,000 times faster in execution speed compared to the corresponding methods in SciPy. This significant performance gain could greatly benefit applications requiring high computational efficiency, making stdlib distributions an attractive choice for various scientific and engineering tasks.

Why this project?

This project holds immense significance due to its focus on enhancing the efficiency of random variate sampling methods from probability distributions. My analysis reveals that the default method employed by SciPy for sampling involves computationally intensive processes, such as integrating the Probability Density Function (PDF) and numerically inverting the Cumulative Distribution Function (CDF). However, this approach proves to be inefficient for practical applications due to its slow execution speed. To address this limitation, custom methods for sampling random variates from distributions in SciPy need to be developed.

Through extensive testing and benchmarking, I have discovered that distributions already implemented in the standard library (stdlib) offer a remarkable performance advantage over their counterparts in SciPy. Specifically, my findings indicate that stdlib distributions exhibit execution speeds approximately 10,000 times faster than those in SciPy. This substantial improvement in computational efficiency makes stdlib distributions an exceptionally appealing option for various scientific and engineering tasks where rapid computation is crucial.

Therefore, by optimizing random variate sampling methods through this project, we can significantly enhance the computational performance of scientific computations, thereby providing tangible benefits to a wide range of applications.

Implementation: Plan of action : In my study of SciPy implementations, I've identified some functionality gaps in stdlib essential for implementing all distributions found in SciPy. Some distributions pose challenges due to dependencies on BLAS and LAPACK functionality, such as matrix operations like transpose, dot product, cross product, determinant calculation, and many. To address this, I've analyzed each distribution's dependencies and also I will create a dependency graph For distributions relying on functionalities already present in stdlib, I've prioritized them in the simple and intermediate sections of my plan, ensuring completion before the midterm evaluation. Those dependent on BLAS and LAPACK functions are some of included in the advanced distribution section. As the stdlib's math/base/special and BLAS and LAPACK project covers some of these functionalities, I'll leverage them initially. If additional functionalities are needed, I'll implement them after discussing with my mentor and adjust my timeline accordingly.
2.3.1 Foundational Implementation-Simple Distributions and Random Variate Generation APIs: In the initial phase of our project, we prioritize the implementation of simple distributions, which serve as foundational components of statistical analysis. These distributions exhibit straightforward mathematical properties and are commonly used in various fields. By focusing on simple distributions first, we establish a solid groundwork for our library, ensuring accessibility and usability for a wide range of users. Through meticulous implementation and testing, we aim to deliver reliable and efficient functionality that lays the groundwork for more complex distributions in subsequent phases of development. This include:

  • Bradford (Bradford continuous random variable)
  • Argus (Argus distribution)
  • Plank (Plank distribution)
  • Rademacher (Rademacher distribution)
  • Gibrat (Gibrat continuous random variable)
  • Boltzmann (Boltzmann distribution)
  • Pearson Type III (Pearson3) (Pearson Type III continuous random variable)
  • Exponential (Dogum) (Exponential continuous random variable)
  • Half Normal (Half-normal continuous random variable)
  • Half Logistic (Generalized half-logistic continuous random variable)
  • Half Cauchy (Half-Cauchy continuous random variable)
  • Log Uniform (Log Uniform or reciprocal continuous random variable)
  • Log Logistic (Logistic continuous random variable)
  • Log Normal (Lognormal) (Lognormal continuous random variable)
  • Inverse Weibull (Inverted Weibull continuous random variable)
  • Gompertz (Gompertz continuous random variable)
  • Fold Cauchy (Folded Cauchy continuous random variable)
  • Power Normal (Powernorm) (Power normal continuous random variable)
  • Power Log Normal (Powerlognorm) (Power log-normal continuous random variable)
  • Angelit (Anglit continuous random variable)
  • Log Gamma (Loggamma) (Log gamma continuous random variable)

2.3.2 Implementation of Intermediate Distributions and Random Variate Generation APIs: During the implementation of intermediate distributions, we focus on incorporating distributions with a moderate level of complexity and specialized applications. These distributions serve as essential tools in statistical analysis and modeling, providing insights into various real-world phenomena. Each distribution requires careful consideration of its mathematical properties and algorithms to ensure accurate and efficient implementation. Through this phase, we aim to expand the capabilities of our library to handle a broader range of statistical tasks and support more advanced data analysis techniques. This include:

  • Maxwell (Maxwell distribution)
  • Zipf (Zipf distribution)
  • CrystalBall (CrystalBall distribution)
  • Von Mises (Von Mises distribution)
  • Skellam (Skellam distribution)
  • Wald (Wald distribution)
  • Double Laplace (Double Laplace distribution)
  • Rice (Rice distribution)
  • Studentized Range (Studentized Range distribution)
  • Fariguelife (Fatigue-life distribution)
  • Double Weibull (Double Weibull distribution)
  • Burr(III) (Burr Type III distribution)
  • Non-Central Chi-square (Non-Central Chi-square distribution)

2.3.3 Implementation of Advanced Distributions: Once the simple and intermediate distributions are implemented, we proceed to incorporate advanced distributions. This involves completing the implementation of all continuous and discrete distributions, along with selected multivariate distributions, to offer users a robust and versatile library for statistical analyses and modeling tasks.

  • I am planning to undertake this part post-midterm. Stdlib is a project in continuous development, several enhancements have been made, such as the addition of C implementation in math/base/special and the incorporation of BLAS bindings and implementations for linear algebra projects. These updates are crucial as they will largely fulfill the requirements of my project's dependencies. Consequently, I will be able to implement all distributions comprehensively
  • Implementation Strategy:
  • Continuous Distributions:
    • Complete the implementation of all remaining continuous distributions, ensuring accurate computation of PDF, CDF, mean, median, mode, quantile and random variate generation functions for each distribution.
    • Optimize algorithms and ensure efficient handling of edge cases to improve computational performance and accuracy.
  • Discrete Distributions:
    • Finalize the implementation of remaining discrete distributions, ensuring correct generation of random variates and
      computation of probability mass functions (PMFs), cumulative distribution functions (CDFs), and other essential properties.
      Conduct thorough testing to validate the correctness and reliability of implemented algorithms.
  • Multivariate Distributions:
    • Select a subset of multivariate distributions for implementation, focusing on those with significant utility and applicability
      across various domains.
    • Implement algorithms for generating random variates, computing joint probability distributions, and other relevant
      functionalities for selected multivariate distributions.
      - Ensure compatibility with existing statistical tools and frameworks for seamless integration into data analysis workflows.

Qualifications

  • Experience with Implementation: Demonstrated ability to implement statistical distributions such as Dgamma and Log Logistic, showcasing proficiency in translating mathematical concepts into functional code.(I will make my PR). Familiarity with the implementation process, including parameter estimation, PDF, CDF, mean, median, mode, quantile and random variate generation functions for each distribution. ensuring accurate representation of distributions in the library.

  • Knowledge of SciPy: When contributing to stdlib, I enhanced numerous distribution namespaces. To improve these contributions, I studied how similar functionalities are implemented in Scipy, gaining valuable insights and knowledge from the process.

  • Proficiency in C Implementation: For this project, we're building upon an already-established groundwork, as evidenced by the successful implementation of 20 functions. This achievement signifies a significant milestone, showcasing our ability to tackle complex tasks and deliver tangible results. Each implemented function represents a step forward in enhancing the project's capabilities, demonstrating our commitment to excellence and innovation.

  • Strong Mathematical Foundation:Possesses a solid understanding of statistical concepts and probability theory, ensuring accurate implementation and validation of distributions within the library. Applied mathematical principles to develop algorithms for probability density function (PDF), cumulative distribution function (CDF), and random variate generation, maintaining consistency with mathematical definitions and properties.

Prior art

  • For the set of distributions which have already been added to stdlib
  • And for APIs for generating random variates from those distributions.
  • Backbone of all distributions is SciPy stats.
  • JavaScript Statistical Library jstat.

Commitment

I'm committed to dedicating 4 to 5 hours daily to the project, totaling 30-35 hours per week, with the flexibility to increase my hours as needed. With no conflicting obligations, I can adjust my availability to align with my mentor's time zone. My summer break spans from May 31 to July 31, allowing ample time for project completion. Expect regular progress updates and prompt communication for mentor assistance. Additionally, I'll maintain bi-weekly blog updates for reference.

Schedule

Assuming a 12 week schedule,

  • Up Till May 1: Proposal accepted or rejected

    • Deep familiarity with SciPy's statistical functionalities, explore other resources like JStat, which implements a wide range of distributions in JavaScript.
    • I will improve the namespace example of some distribution.

    Deliverables : SciPy's statistical capabilities and exploring JStat's diverse distribution implementations to enrich our understanding and optimize the statistical library.

  • Community Bonding Period:

    • In this phase, my focus will be on implementing distribution functions with a streamlined approach, leveraging existing utilities and mathematical functions.
    • Beginning with Gompertz, Fold Cauchy, Half cauchy, Half normal, Half logistics distributions,
    • More familiarity with SciPy.
      Deliverables : Obtain comprehensive knowledge for smooth project implementation. Confirm clarity and confidence in project goals and specifications.
  • Week 1 - Week 3: Coding officially begins

    • In this time I will implement all simple distributions which I discuss above like Bradford, Argus, Plank, Rademacher, Gibrat, Boltzmann, Pearson Type III, and Exponential.
    • Develop PDF, CDF, mean, median, mode, quantile and random variate generation functions for each distribution.
    • Conduct thorough testing to validate implementations' accuracy and consistency.
    • Collaborate with mentors for guidance and feedback on implementations.

    Deliverables : Completed all simple distributions with PDF, CDF, and random variate generation functions, validated through thorough testing and supported by mentor guidance and feedback.

  • Week 4 - Week 6:

    • I will implement all intermediate distributions which I discuss above like Maxwell (Maxwell distribution), Zipf (Zipf distribution), CrystalBall (CrystalBall distribution), Von Mises (Von Mises distribution).
    • Develop PDF, CDF, mean, median, mode, quantile and random variate generation functions for each distribution.
    • Conduct thorough testing to validate implementations' accuracy and consistency.
    • Collaborate with mentors for guidance and feedback on implementations.

    Deliverables : Prepared for midterm evaluation, having completed implementation of all distributions discussed in the simple and intermediate distributions sections with documentation.

  • Week 7 - Week 8:

    • In this phase, I will complete the remaining incomplete continuous distributions present in Scipy.stats, such as fatiguelife, foldcauchy, foldnorm, genlogistic, gennorm and all. Additionally, I will ensure that all continuous distributions present in Scipy.stats are fully implemented.
      Deliverables : Full implementation of all continuous distributions in Scipy.stats.
  • Week 9 - Week 10:

    • In this phase, I will address incomplete discrete distributions and certain multivariate distributions for which dependencies have already been implemented. I will personally oversee the implementation of these multivariate distributions, ensuring seamless integration with existing dependencies.
      Deliverables : Completed discrete and multivariate distributions.
  • Week 11 - Week 12: Buffer Period

    • I will thoroughly address any backlog items that may have accumulated throughout the project timeline. This will involve carefully reviewing the project's progress with the guidance of mentors.

    • I will prioritize the creation of comprehensive documentation that encapsulates the entire project journey, including its objectives, methodologies, findings, and outcomes.

    • Finally, the project submission will be completed in the last week.

  • Post-GSOC :

    • After the initial 12 weeks, any remaining distributions discussed in the advanced distribution section will be addressed. Additionally, I will actively collaborate with the stdlib to tackle any remaining issues or feature requests, ensuring the project's continued development and effectiveness.

Notes:

  • The community bonding period is a 3 week period built into GSoC to help you get to know the project community and participate in project discussion. This is an opportunity for you to setup your local development environment, learn how the project's source control works, refine your project plan, read any necessary documentation, and otherwise prepare to execute on your project project proposal.
  • Usually, even week 1 deliverables include some code.
  • By week 6, you need enough done at this point for your mentor to evaluate your progress and pass you. Usually, you want to be a bit more than halfway done.
  • By week 11, you may want to "code freeze" and focus on completing any tests and/or documentation.
  • During the final week, you'll be submitting your project.

Related issues

#2

Checklist

  • I have read and understood the Code of Conduct.
  • I have read and understood the application materials found in this repository.
  • I understand that plagiarism will not be tolerated, and I have authored this application in my own words.
  • I have read and understood the patch requirement which is necessary for my application to be considered for acceptance.
  • The issue name begins with [RFC]: and succinctly describes your proposal.
  • I understand that, in order to apply to be a GSoC contributor, I must submit my final application to https://summerofcode.withgoogle.com/ before the submission deadline.
@Rejoan-Sardar Rejoan-Sardar added 2024 2024 GSoC proposal. rfc Project proposal. labels Apr 1, 2024
@kgryte
Copy link
Member

kgryte commented Apr 1, 2024

Thank you for sharing a draft proposal @Rejoan-Sardar.

One question I have is whether in your study of SciPy implementations you've found any prerequisite functionality which is missing in stdlib but would need to be present in order to execute on your proposed tasks. If so, what is your plan for addressing and how will you accommodate potential delays (including review cycles) in your proposed timeline?

@Rejoan-Sardar
Copy link
Author

Rejoan-Sardar commented Apr 2, 2024

@kgryte, In my study of SciPy implementations, I've identified some functionality gaps in stdlib essential for implementing all distributions found in SciPy. Some distributions pose challenges due to dependencies on BLAS and LAPACK functionality, such as matrix operations like transpose, dot product, cross product, determinant calculation, and many. To address this, I've analyzed each distribution's dependencies and also I will create a dependency graph For distributions relying on functionalities already present in stdlib, I've prioritized them in the simple and intermediate sections of my plan, ensuring completion before the midterm evaluation. Those dependent on BLAS and LAPACK functions are some of included in the advanced distribution section. As the stdlib's math/base/special and BLAS and LAPACK project covers some of these functionalities, I'll leverage them initially. If additional functionalities are needed, I'll implement them after discussing with my mentor and adjust my timeline accordingly.

@Rejoan-Sardar
Copy link
Author

Rejoan-Sardar commented Apr 2, 2024

@kgryte any suggestions for improving this proposal before submitting it to the GSoC website? Additionally, I have a doubt: since I've already submitted a proposal in GSoC website, should I mention that this proposal will be my second preference?

@kgryte
Copy link
Member

kgryte commented Apr 2, 2024

Yes, you should specify which proposal is your highest preference.

@kgryte kgryte closed this as completed Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2024 2024 GSoC proposal. rfc Project proposal.
Projects
None yet
Development

No branches or pull requests

2 participants