⚗️ Extract text from base64 yjs document #270

AntoLC · 2024-09-19T10:22:09Z

Purpose

To be able to index our documents we need a way to extract the text of them.
We created a function to extract text from base64 yjs document.

In detail:

    # I wrote "Hello world" in the blocknote editor
    # This is the base64 string of the Yjs document saved in Minio
    base64_string = (
        "ARCymr/3DgAHAQ5kb2N1bWVudC1zdG9yZQMKYmxvY2tHcm91cAcAspq/9w4AAw5ibG9j"
        "a0NvbnRhaW5lcgcAspq/9w4BAwlwYXJhZ3JhcGgHALKav/cOAgYEALKav/cOAwFIKACy"
        "mr/3DgINdGV4dEFsaWdubWVudAF3BGxlZnQoALKav/cOAQJpZAF3DmluaXRpYWxCbG9j"
        "a0lkKACymr/3DgEJdGV4dENvbG9yAXcHZGVmYXVsdCgAspq/9w4BD2JhY2tncm91bmRD"
        "b2xvcgF3B2RlZmF1bHSHspq/9w4BAw5ibG9ja0NvbnRhaW5lcgcAspq/9w4JAwlwYXJh"
        "Z3JhcGgoALKav/cOCg10ZXh0QWxpZ25tZW50AXcEbGVmdCgAspq/9w4JAmlkAXckMTFj"
        "YTgzYmEtZGM3OS00N2Q3LTllNzYtNmM4OTQwNzc1ZjE3KACymr/3DgkJdGV4dENvbG9y"
        "AXcHZGVmYXVsdCgAspq/9w4JD2JhY2tncm91bmRDb2xvcgF3B2RlZmF1bHSEspq/9w4E"
        "C2VsbG8gd29ybGQgAA=="
    )
    decoded_bytes = base64.b64decode(base64_string)
    uint8_array = bytearray(decoded_bytes)

    d1 = Y.YDoc()
    Y.apply_update(d1, uint8_array)
    blocknote = str(d1.get_xml_element("document-store"))

    # blocknote var will look like this:
    # <UNDEFINED>
    # <blockGroup>
    #     <blockContainer "backgroundColor"="default" "id"="initialBlockId" "textColor"="default">
    #         <paragraph "textAlignment"="left">Hello world </paragraph>
    #     </blockContainer>
    #     <blockContainer "id"="11ca83ba-dc79-47d7-9e76-6c8940775f17" "backgroundColor"="default" "textColor"="default">
    #         <paragraph "textAlignment"="left"></paragraph>
    #     </blockContainer>
    # </blockGroup>
    # </UNDEFINED>

    # BeautifulSoup is used to extract the text from the previous structure
    soup = BeautifulSoup(blocknote, "html.parser")
    soupValue = soup.get_text(separator=" ").strip()

    assert soupValue == "Hello world"

Function to extract text from base64 yjs document. Can be usefull if we need to index the content of the documents.

AntoLC added wip backend labels Sep 19, 2024

AntoLC self-assigned this Sep 19, 2024

AntoLC marked this pull request as draft September 19, 2024 10:41

AntoLC changed the title ~~⚗️ example about how to extract text from base64 yjs document~~ ⚗️ Extract text from base64 yjs document Sep 19, 2024

AntoLC added dependencies Pull requests that update a dependency file python Pull requests that update Python code and removed wip labels Sep 19, 2024

AntoLC linked an issue Sep 19, 2024 that may be closed by this pull request

✨Get doc content from backend #264

Open

AntoLC force-pushed the feature/back-read-yjs-doc branch from 3122c6a to 3552d66 Compare September 19, 2024 11:11

AntoLC requested a review from sampaccoud September 19, 2024 11:19

AntoLC marked this pull request as ready for review September 19, 2024 11:20

⚗️(backend) function to extract text from base64 yjs document

1ee8e5f

Function to extract text from base64 yjs document. Can be usefull if we need to index the content of the documents.

AntoLC force-pushed the feature/back-read-yjs-doc branch from 3552d66 to 1ee8e5f Compare September 20, 2024 08:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚗️ Extract text from base64 yjs document #270

⚗️ Extract text from base64 yjs document #270

AntoLC commented Sep 19, 2024 •

edited

Loading

⚗️ Extract text from base64 yjs document #270

Are you sure you want to change the base?

⚗️ Extract text from base64 yjs document #270

Conversation

AntoLC commented Sep 19, 2024 • edited Loading

Purpose

In detail:

AntoLC commented Sep 19, 2024 •

edited

Loading