Skip to content

darkoatanasovski/htmltags

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HTML Strip tags

Build Status Docs Go Report Card License

This is a Go package which strip HTML tags from a string. Also, you can provide an array of allowableTags that can be skipped. Strip HTML tags library is very useful if you work with web crawlers, or just want to strip all or specific tags from a string.

nodes, err := Strip(content string, allowableTags []string, stripInlineAttributes bool) (Nodes, error)
nodes.Elements //HTML nodes structure of type *html.Node
nodes.ToString() //returns stripped HTML string

Installation

$ go get github.com/darkoatanasovski/htmltags

Parameters

input                   - string
allowableTags           - []string{} //array of strings e.g. []string{"p", "span"}
removeInlineAttributes  - bool // true/false

Return values

Returns node structure. You can get the stripped string with nodes.ToString(). If there are errors, it will return the first error message

Usage

If you want to keep the inline attributes of the tags, set the third parameter to false

stripped, err := htmltags.Strip("<h1>Header text with <span style=\"color:red\">color</span></h1>", []string{"span"}, false)

Or if you want to strip all tags from the string, and get a pure text, the second parameter has to be empty array

stripped, err := htmltags.Strip("<h1>Header text with <span style=\"color:red\">color</span></h1>", []string{}, false)

A working example

package main

import(
    "fmt"
    "github.com/darkoatanasovski/htmltags"
)

func main() {
    original := "<div>This is <strong style=\"font-size:50px\">complex</strong> text with <span>children <i>nodes</i></span></div>"
    allowableTags := []string{"strong", "i"}
    removeInlineAttributes := false
    stripped, _ := htmltags.Strip(original, allowableTags, removeInlineAttributes)
    
    fmt.Println(stripped) //output: Node structure
    fmt.Println(stripped.ToString()) //output string: This is <strong>complex</strong> text with children <i>nodes</i>
}

Development

If you have cloned this repo you will probably need the dependency:

go get golang.org/x/net/html

Notes

The broken or partial html will be fixed. If your input HTML string is <p>Content <i>italic, the fixed string will be <p>Content <i>italic</i></p>