How to accelerate inference speed with LightPipeline? #13921

AayushSameerShah · 2023-08-14T11:11:11Z

AayushSameerShah
Aug 14, 2023

🤕 Quick background

I have been working with the t5_small model in Python to get the summaries for 10-15 individual points. And there I was using the generic, straight forward pipeline like below (quickly):

!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

from sparknlp.pretrained import PretrainedPipeline
from sparknlp.base import *
from sparknlp.annotator import *

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("documents")

t5 = T5Transformer() \
    .pretrained("t5_small") \
    .setTask("summarize:")\
    .setMaxOutputLength(128)\
    .setInputCols(["documents"]) \
    .setOutputCol("summaries") \
    .setTemperature(0.1) \
    .setDoSample(True) \

pipeline = Pipeline().setStages([document_assembler, t5])

And then summarization by:

points = [["sentence-1"], ["sentence-2"], ..., ["sentence-10"]]

data = spark.createDataFrame(points).toDF("text")
result = pipeline.fit(data).transform(data)
response = result.select("summaries").collect():

# Then a simple for loop to print those...

😢 Problem

It takes generally ~5 seconds per point to generate a summary. Which results in around 1 minute to get summaries of all 10 individual points! It really is too much for the kind of application I am developing.

💡 Found `LightPipeline`

I think this would be a life saver, as it seems to be built just for the tasks like what I am dealing with. I found the tutorial here.

Some issues:

But it seems that we still have to pass the spark dataframe ~ which means there is the connection with spark still.
The performance doesn't seem to be improving drastically (10x...?)

I am sure, I am doing something wrong here. Will you please have a look at my code? Its on the colab but the workflow is:

!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

import sparknlp
#spark = sparknlp.start()
spark = sparknlp.start(gpu=True)

from sparknlp.pretrained import PretrainedPipeline
from sparknlp.base import *
from sparknlp.annotator import *

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("documents")

t5 = T5Transformer() \
    .pretrained("t5_small") \
    .setTask("summarize:")\
    .setMaxOutputLength(128)\
    .setInputCols(["documents"]) \
    .setOutputCol("summaries") \
    .setTemperature(0.1) \
    .setDoSample(True) \

And then...

from sparknlp.base import LightPipeline

data = spark.createDataFrame(points).toDF("text")
model = pipeline.fit(data)

light_model = LightPipeline(pipelineModel = model, parse_embeddings = False)
result = light_model.transform(data)

response = result.select("summaries").collect()

Will you @maziyarpanahi please give me some guidance on this?

How can I improve the speed?
What is the right way to use this LightPipeline with T5?
Do I really need to first fit to get the model and then pass in the LightPipeline to use the transform function? Because all times I will be in the scenario where I will be fitting and transforming the same data.

I apologize for asking too much here 🙏🏻

Thank you for your support!

Answered by maziyarpanahi

Aug 15, 2023

Hi,

Here is an article about LightPipelines https://medium.com/spark-nlp/spark-nlp-101-lightpipeline-a544e93f20f1
You are still using .trasnform which is DataFrame
You are still passing a DataFrame data - in LightPipelines to reduce DataFrame latency, you can pass a string or a list of strings.
You are using .collect() which is very bad. It brings all the data into the Driver's memory

from sparknlp.base import LightPipeline

data = spark.createDataFrame(points).toDF("text")
model = pipeline.fit(data)
light_model = LightPipeline(pipelineModel = model, parse_embeddings = False)


result = light_model.annotate("Here is a text that must be summarized ....")

# now result is a dict you can a…

View full answer

maziyarpanahi · 2023-08-15T10:34:51Z

maziyarpanahi
Aug 15, 2023
Maintainer

Hi,

Here is an article about LightPipelines https://medium.com/spark-nlp/spark-nlp-101-lightpipeline-a544e93f20f1
You are still using .trasnform which is DataFrame
You are still passing a DataFrame data - in LightPipelines to reduce DataFrame latency, you can pass a string or a list of strings.
You are using .collect() which is very bad. It brings all the data into the Driver's memory

from sparknlp.base import LightPipeline

data = spark.createDataFrame(points).toDF("text")
model = pipeline.fit(data)
light_model = LightPipeline(pipelineModel = model, parse_embeddings = False)


result = light_model.annotate("Here is a text that must be summarized ....")

# now result is a dict you can access the values by their keys like: result["summaries"]

On smaller datasets you want to go native string or a list of string via LightPipeline annotate or fullAnnotate - it's faster and appropriate for services like REST APIs, real-time, etc.
For larger datasets you want to go with DataFrame and .transform because it is scalable and it can be distributed over multiple machines

1 reply

AayushSameerShah Aug 17, 2023
Author

You are amazing. Agreed, I was using .collect so badly since I wasn't aware of annotate. But just came to know about it from the docs and it is great. It simply returns the python list object. Which is what is required.

Thanks again!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to accelerate inference speed with LightPipeline? #13921

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How to accelerate inference speed with LightPipeline? #13921

AayushSameerShah Aug 14, 2023

🤕 Quick background

😢 Problem

💡 Found LightPipeline

Replies: 1 comment · 1 reply

maziyarpanahi Aug 15, 2023 Maintainer

AayushSameerShah Aug 17, 2023 Author

AayushSameerShah
Aug 14, 2023

💡 Found `LightPipeline`

Replies: 1 comment 1 reply

maziyarpanahi
Aug 15, 2023
Maintainer

AayushSameerShah Aug 17, 2023
Author