truncateHTML

A PHP function that truncates (shortens) a given HTML5 string to a max number of characters.

Example: truncate after 6 characters including the ellipsis:
<p><b>A</b> red ball.</p> => <p><b>A</b> red…</p>

Compatible with PHP 5.6 and 7+
Uses the mbstring PHP extension for UTF-8.
More than 240 unit tests (see or run: unittest.php)

The function is in truncateHTML.php, you can just copy/paste it to your project.

Features:

Quickly truncate most common HTML5 sources without using a full HTML parser (which is ~100x slower).
Configurable ellipsis: …, ..., <a href="">More</a>, etc.
- Can include the length of the ellipsis in the truncated result.
Supports self-closing tags like: <img>, <img/>, <newtag />
Collapsing spaces: sequences of multiple spaces are counted only once (including <br>,   and a few others)
Don't count characters in invisible elements like: <head>, <script>, <noscript>, <style>, 
Supports HTML entities ( , …, ", etc.)
Whole word: can truncate at the end of the last word instead of cutting in the middle of a word.
- Cut long words: can truncate in the middle of a word if it is very long (useful to truncate an URL)
Truncates before the error in case of malformed HTML (like a mismatched closing tag)
UTF-8 support (multibyte characters)

Examples:

// Example from the introduction:
truncateHTML(6, "<p><b>A</b> red ball.</p>");
// =>           "<p><b>A</b> red…</p>"

// Whole word:
truncateHTML(5, "<blockquote>A lumberjack</blockquote>");
// =>           "<blockquote>A…</blockquote>"

// Without whole word, without includeEllipsisLength:
truncateHTML(5, "<blockquote>A lumberjack</blockquote>", ['wholeWord' => false, 'includeEllipsisLength' => false]);
// =>           "<blockquote>A lum…</blockquote>"

// Whole word: example of cutting only long words:
truncateHTML( 5, "<a href='https://php.net/docs.php'>https://php.net/docs.php</a>");
// =>            "…"   Notice how wholeWord truncates before opening a tag that would be left empty.
truncateHTML(20, "<a href='https://php.net/docs.php'>https://php.net/docs.php</a>");
// =>            "<a href='https://php.net/docs.php'>https://php.net/doc…</a>"

// Comments, scripts and styles are not counted:
truncateHTML(3, "<script>$();</script><!-- Start div --><div>Hi</div><!-- End div --> More text.");
// =>           "<script>$();</script><!-- Start div --><div>Hi…</div>"

// Collapsing multiple spaces:
truncateHTML(6, "A <br>  &nbsp; \n\t   long space!");
// =>           "A <br>  &nbsp; \n\t   long…"

// Tag mismatch: truncates before the error:
truncateHTML(99, "Click</a>here</a>");
// =>            "Click…"

API:

string truncateHTML(int $maxLength, string $html, array $options = [])

$maxLength: the returned HTML will contain at most $maxLength countable characters. If negative, remove $maxLength countable characters from the end of the $html.
$html: the input HTML string that will be truncated.

$options: (optional) an array of options:

Options (with default value)	Descriptions
`'ellipsis'=>'…'` (or: `'ellipsis'=>'...'`)	The ellipsis that will be included. Can be an empty string, can contain HTML tags. (`'…'` is the horizontal ellipsis character, ie. `'...'` as a single unicode character) (If not using UTF-8 mode, the default value will be `'...'` instead of `'…'`)
`'includeEllipsisLength'=>true`	Whether to include the length of the ellipsis in the length of the truncated result.
`'wholeWord'=>true`	When truncating, don't cut in the middle of a word. Instead cut at the end of the last word.
`'cutWord'=>18`	When `wholeWord` is enabled, allows to cut long words after `cutWord` characters (Set to `0` or `false` to disable)
`'utf8'=>true`	Use UTF-8 mode. You should always use UTF-8 though. If `utf8` is `false`, only ASCII-compatible single-byte encodings (such as Latin-1) are supported. For other encodings, use mb_convert_encoding to convert to UTF-8 and back. (If UTF-8 is disabled, the default ellipsis will be `'...'` instead of `'…'`)

Limitations:

XHTML: probably works in most cases, but is untested.

Not supported:

Malformed HTML, badly nested tags, missing closing tags: it doesn't try to guess the correct fix (for this you would need a full HTML parser).
Note: when meeting an unexpected closing tag: it always truncates before the closing tag (see the examples).
Uncommon HTML code like:
- HTML tags inside an HTML Tag attribute: <img title="Hello<br>World">
The string </script> inside <script>code…</script>. For this you would need a full HTML parser, or a JavaScript parser. (Other tags are ok, but don't have a closing tag </script> in a JavaScript string or comment)
The string </style> inside <style>code…</style>. For this you would need a full HTML parser, or a CSS parser. (Other tags are ok, but don't have a closing tag </style> in a CSS comment)
XML
CDATA (deprecated in HTML5)

If you find more, please open an issue.

History (changelog)

v1.0.1 (9 Feb. 2018):
- Fix multibyte characters in regex
- Add parameter types verifications
v1.0 (5 Feb. 2018):
- Initial version
- Inspired by:
  - StackOverflow: Truncate text containing HTML, ignoring tags
  - truncate() from CakePHP

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE-MIT		LICENSE-MIT
Readme.md		Readme.md
truncateHTML.php		truncateHTML.php
unittest.php		unittest.php

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

truncateHTML

Features:

Examples:

API:

Limitations:

History (changelog)

About

Uh oh!

Releases

Packages

Languages

License

jlgrall/truncateHTML

Folders and files

Latest commit

History

Repository files navigation

truncateHTML

Features:

Examples:

API:

Limitations:

History (changelog)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages