Displaying the first n words of a blog post; a look at XSLT extensions and Blog4Umbraco
The CogBlog, our staple bastion of informative comment (at least
some of the time!), is powered by the Blog4Umbraco
package. There are a couple of additional modifications that
we've made here and there to get it running just how we like it.
These modifications were a chance to look more deeply into xslt
extentions and showcase just how easy it is to implement them; all
thanks to the power of Umbraco.
Blog4Umbraco
Fist thing's first; as a package, Blog4Umbraco (b4u) isn't
updated too frequently. Additionally; there are several versions
out there which produce different levels of stability depending on
which version of Umbraco you're running. However, we've been very
pleased with the results we've obtained after a little bit of
tweaking.
For anything pre Umbraco 4.5.2, the latest release version of
the b4u package (available from codeplex here)
should be fine (version 2.0.26 at the time of writing). Umbraco
4.5.2 implements a new version of the XML Scheema; as a result a
different version of b4u is needed; Blog4Umbraco 2.0.26 for 4.5.x is available
here. At the time of writing some XSLT issues are still being
reported, so you may wish to download the latest version of the b4u
source (available
here) and have a look at the try-codegeko-features branch where
Benjamin Howarth over at Codegeko has ironed out a number of these
issues.
The problem; displaying n words of a blog post, with
intact HTML
So; assuming you have a working install of b4u, you may have
noticed that any blog posts will be displayed in full on the blog
homepage. This looks fine if using the bundled blog skin, however,
we wanted our blog to look like a content page within our existing
site structure. Once we had applied all our templates and styles,
we ended up with an extremely long blog homepage, where each post
was too long and thin to be legible, and finding posts was a bit of
a nightmare. Other blog packages, such as wordpress, automatically
only display a 50 or so word summary on the blog homepage. This
makes the page much more navigable, so we began looking for a way
to allow b4u to do the same.
After some extensive googling, it appeared that most of the
solutions involved using the umbraco.library:StripHtml() method to
remove any html from the post and then using the xslt substring
method to print out x characters from the post. This didn't really
satisfy us as we wanted to make sure that HTML was included, and
that it would be well formed. You could not simply strip out the
html and print out the x characters as; if the last character in
the post substring was in the middle of an open HTML tag, you would
end up with page errors as the tag would never be closed.
The solution; HTML agility pack and an xslt
extension
In order to make sure that we could print out our n words and
still maintain well formed html, we made use of HTML agility pack.
In the creator's own words, it is "a .NET code library that allows
you to parse "out of the web" HTML files. The parser is very
tolerant with "real world" malformed HTML. The object model is very
similar to what proposes System.Xml, but for HTML documents (or
streams)." as a result, it was perfect for our needs. There is also
a version included with the Umbraco core; so you may not even need
to upgrade your DLL. However, we included the latest version of the
library to be on the safe side.
The next step was to write an xslt extention; this would provide
a getSummary method to be used within b4u's BlogListPosts.xslt. As
with the
Cogworks DayNight Webservice, Umbraco makes it easy to
integrate your code libraries into the core. I won't explain how
the extension itself uses HTML agility pack here, as I have a bit
of a thing for excessively commented code, which does all the
explaining for me! You can download the C# class below;
Download the C# code here!
Wiring the extension up
To use the extension, you first need to build the code file
above and drop the resulting DLL into the bin directory of your
website. (Make sure you also have the HTML agility pack DLL in your
bin - remember; you may need to upgrade this DLL to the latest
version as mentioned above). You then need to add a reference to it
in your xsltExtentions.config file like this;

In the desired xslt file, in this case b4u's BlogListPosts.xslt,
you need to add the extention's namespace. In order to use it like
an xslt method, you then need to add the extension to the exclude
result prefixes list. The stylesheet tag should thus look something
like this;

The extension can then be used like a regular xslt method; like
this;

Where $post/bodyText is the HTML blog post being worked on, 100
is the number of words you want back, and the last argument is the
link to the document to be put out as the read more link.
We also use this to generate blog post previews for our blog RSS
feed.
So there you have it! An n word blog post preview, with valid
HTML. All thanks to the power of xslt extentions and Umbraco!
Issues / Extending
You might run into an issue with the HTML space character. If
you add the following entity to the doctype tag in the xslt you
shouldn't have any problems;

Additionally, the handling of the read more link is hard coded
in the extension so that it can appear within a blog post's p tags.
(i.e. on the same line as the last part of the n word preview). You
may wish to adjust this behaviour to suit.