type
slides

Processing pipelines

Notes: Welcome back! This chapter is dedicated to processing pipelines: a series of functions applied to a Doc to add attributes like part-of-speech tags, dependency labels or named entities.

In this lesson, you'll learn about the pipeline components provided by spaCy, and what happens behind the scenes when you call nlp on a string of text.

What happens when you call nlp?

doc = nlp("This is a sentence.")

Notes: You've already written this plenty of times by now: pass a string of text to the nlp object, and receive a Doc object.

But what does the nlp object actually do?

First, the tokenizer is applied to turn the string of text into a Doc object. Next, a series of pipeline components is applied to the Doc in order. In this case, the tagger, then the parser, then the entity recognizer. Finally, the processed Doc is returned, so you can work with it.

Built-in pipeline components

Name	Description	Creates
tagger	Part-of-speech tagger	`Token.tag`
parser	Dependency parser	`Token.dep`, `Token.head`, `Doc.sents`, `Doc.noun_chunks`
ner	Named entity recognizer	`Doc.ents`, `Token.ent_iob`, `Token.ent_type`
textcat	Text classifier	`Doc.cats`

Notes: spaCy ships with the following built-in pipeline components.

The part-of-speech tagger sets the token dot tag attribute.

The depdendency parser adds the token dot dep and token dot head attributes and is also responsible for detecting sentences and base noun phrases, also known as noun chunks.

The named entity recognizer adds the detected entities to the doc dot ents property. It also sets entity type attributes on the tokens that indicate if a token is part of an entity or not.

Finally, the text classifier sets category labels that apply to the whole text, and adds them to the doc dot cats property.

Because text categories are always very specific, the text classifier is not included in any of the pre-trained models by default. But you can use it to train your own system.

Under the hood

Pipeline defined in model's meta.json in order
Built-in components need binary data to make predictions

Notes: All models you can load into spaCy include several files and a meta JSON.

The meta defines things like the language and pipeline. This tells spaCy which components to instantiate.

The built-in components that make predictions also need binary data. The data is included in the model package and loaded into the component when you load the model.

Pipeline attributes

nlp.pipe_names: list of pipeline component names

print(nlp.pipe_names)

['tagger', 'parser', 'ner']

nlp.pipeline: list of (name, component) tuples

print(nlp.pipeline)

[('tagger', <spacy.pipeline.Tagger>),
 ('parser', <spacy.pipeline.DependencyParser>),
 ('ner', <spacy.pipeline.EntityRecognizer>)]

Notes: To see the names of the pipeline components present in the current nlp object, you can use the nlp dot pipe names attribute.

For a list of component name and component function tuples, you can use the nlp dot pipeline attribute.

The component functions are the functions applied to the Doc to process it and set attributes – for example, part-of-speech tags or named entities.

Let's practice!

Notes: Let's check out some spaCy pipelines and take a look under the hood!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chapter3_01_processing-pipelines.md

chapter3_01_processing-pipelines.md

Processing pipelines

What happens when you call nlp?

Built-in pipeline components

Under the hood

Pipeline attributes

Let's practice!

Files

chapter3_01_processing-pipelines.md

Latest commit

History

chapter3_01_processing-pipelines.md

File metadata and controls

Processing pipelines

What happens when you call nlp?

Built-in pipeline components

Under the hood

Pipeline attributes

Let's practice!