Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚗️ Extract text from base64 yjs document #270

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

AntoLC
Copy link
Collaborator

@AntoLC AntoLC commented Sep 19, 2024

Purpose

To be able to index our documents we need a way to extract the text of them.
We created a function to extract text from base64 yjs document.

In detail:

    # I wrote "Hello world" in the blocknote editor
    # This is the base64 string of the Yjs document saved in Minio
    base64_string = (
        "ARCymr/3DgAHAQ5kb2N1bWVudC1zdG9yZQMKYmxvY2tHcm91cAcAspq/9w4AAw5ibG9j"
        "a0NvbnRhaW5lcgcAspq/9w4BAwlwYXJhZ3JhcGgHALKav/cOAgYEALKav/cOAwFIKACy"
        "mr/3DgINdGV4dEFsaWdubWVudAF3BGxlZnQoALKav/cOAQJpZAF3DmluaXRpYWxCbG9j"
        "a0lkKACymr/3DgEJdGV4dENvbG9yAXcHZGVmYXVsdCgAspq/9w4BD2JhY2tncm91bmRD"
        "b2xvcgF3B2RlZmF1bHSHspq/9w4BAw5ibG9ja0NvbnRhaW5lcgcAspq/9w4JAwlwYXJh"
        "Z3JhcGgoALKav/cOCg10ZXh0QWxpZ25tZW50AXcEbGVmdCgAspq/9w4JAmlkAXckMTFj"
        "YTgzYmEtZGM3OS00N2Q3LTllNzYtNmM4OTQwNzc1ZjE3KACymr/3DgkJdGV4dENvbG9y"
        "AXcHZGVmYXVsdCgAspq/9w4JD2JhY2tncm91bmRDb2xvcgF3B2RlZmF1bHSEspq/9w4E"
        "C2VsbG8gd29ybGQgAA=="
    )
    decoded_bytes = base64.b64decode(base64_string)
    uint8_array = bytearray(decoded_bytes)

    d1 = Y.YDoc()
    Y.apply_update(d1, uint8_array)
    blocknote = str(d1.get_xml_element("document-store"))

    # blocknote var will look like this:
    # <UNDEFINED>
    # <blockGroup>
    #     <blockContainer "backgroundColor"="default" "id"="initialBlockId" "textColor"="default">
    #         <paragraph "textAlignment"="left">Hello world </paragraph>
    #     </blockContainer>
    #     <blockContainer "id"="11ca83ba-dc79-47d7-9e76-6c8940775f17" "backgroundColor"="default" "textColor"="default">
    #         <paragraph "textAlignment"="left"></paragraph>
    #     </blockContainer>
    # </blockGroup>
    # </UNDEFINED>

    # BeautifulSoup is used to extract the text from the previous structure
    soup = BeautifulSoup(blocknote, "html.parser")
    soupValue = soup.get_text(separator=" ").strip()

    assert soupValue == "Hello world"

@AntoLC AntoLC self-assigned this Sep 19, 2024
@AntoLC AntoLC marked this pull request as draft September 19, 2024 10:41
@AntoLC AntoLC changed the title ⚗️ example about how to extract text from base64 yjs document ⚗️ Extract text from base64 yjs document Sep 19, 2024
@AntoLC AntoLC added dependencies Pull requests that update a dependency file python Pull requests that update Python code and removed wip labels Sep 19, 2024
@AntoLC AntoLC linked an issue Sep 19, 2024 that may be closed by this pull request
@AntoLC AntoLC force-pushed the feature/back-read-yjs-doc branch from 3122c6a to 3552d66 Compare September 19, 2024 11:11
@AntoLC AntoLC requested a review from sampaccoud September 19, 2024 11:19
@AntoLC AntoLC marked this pull request as ready for review September 19, 2024 11:20
Function to extract text from base64 yjs document.
Can be usefull if we need to index the content
of the documents.
@AntoLC AntoLC force-pushed the feature/back-read-yjs-doc branch from 3552d66 to 1ee8e5f Compare September 20, 2024 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend dependencies Pull requests that update a dependency file python Pull requests that update Python code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

✨Get doc content from backend
1 participant