smart developer’s blog

This is a C# resource library! Free how to’s and best practices…

SQL strip HTML tags

leave a comment »

I used many versions of the function below in order to strip HTML from a string in SQL. But no matter how much I tried, I always had some cases in witch it failed. I never found a perfect one on the Internet either. So, find below my latest version (it almost never fails :)):


ALTER function [dbo].[stripHTML]
	(
	@Content nvarchar(4000)
	)
returns nvarchar(4000)
as 
begin

	declare @it int
	set @it = 0

	declare @ind1 int
	set @ind1 = -1
	declare @ind2 int
	set @ind2 = -1

	declare @plaintext nvarchar(4000)
	declare @htmltext nvarchar(4000)
	set @plaintext = ''
	set @htmltext = ''

	declare @strippedContent nvarchar(4000)
	set @strippedContent = ''

	while @ind1 <> 0 or @ind2 <> 0
		begin
			set @ind1 = PATINDEX('%<%', @Content)
			set @ind2 = PATINDEX('%>%', @Content)

			if not @ind2 > @ind1
				set @ind2 = @ind1

			if @ind1 > 0
				set @plaintext = left(@Content, @ind1 - 1)
			else
				set @plaintext = ''
			
			if len(@Content) = @ind2 - @ind1 + 1
				set @htmltext = @Content
			else
				set @htmltext = substring(@Content, @ind1, @ind2 - @ind1 + 1)

			if @ind1 = 0 and @ind2 = 0 and len(@Content) > 0
				begin
					set @htmltext = ''
					set @plaintext = @Content	
				end	

			set @strippedContent = @strippedContent + @plaintext
			--set @Content =  right(@Content, len(@Content) - @ind2)
			SET @Content = SUBSTRING(@Content, @ind2 + 1, LEN(@Content)) -- whatever, until the end

			set @it = @it + 1
			if (@it > 2000)
				break
			else
				continue
		end

	set @strippedContent = replace(@strippedContent, '&nbsp;', ' ')
	set @strippedContent = replace(@strippedContent, '

	', '
	')
	set @strippedContent = replace(@strippedContent, '  ', ' ')
	set @strippedContent = replace(@strippedContent, '  ', ' ')
	set @strippedContent = replace(@strippedContent, '  ', ' ')

	return @strippedContent
end

Note: there is something to notice here. I have found out that

set @htmltext = substring(@Content, @ind1, @ind2 - @ind1 + 1)

is not the same as:

set @htmltext = right(@Content, len(@Content) - @ind2)

I am not sure how to explain this, but I am sure it has something to do with the size of unicode characters.

Advertisements

Written by smartdev

April 3, 2009 at 3:32 pm

Posted in SQL

Tagged with

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: