The Battle for the Body Field

In the early ’90s, every page was a handcrafted labor of love. Sadly, anyone who managed a large site eventually hit the wall: writing piles of custom HTML that tangled valuable content with boilerplate markup, gnarly design tweaks, and other difficult-to-maintain cruft.

Article Continues Below

Soon, large sites abandoned handcrafted pages entirely. The meat of a page got stored in a database, then passed through HTML templates to “wrap” it in design elements like footers, sidebars, and banner ads. Today, even individual elements like the name of a book, a photo of its cover, and an author’s bio are often teased out of design-heavy HTML and stored as individual chunks. Content editors fill out input forms rather than wrestling with a blank HTML canvas, and CMS templates reshuffle the elements as needed.

Trouble in Chunkytown#section2

This fields-and-templates approach works great for content that follows predictable patterns, like product information sheets, photo galleries, and podcasts. It’s at the heart of NPR’s successful “Create Once, Publish Everywhere” system, and it’s hard to find a CMS or web publishing tool that doesn’t offer some way to model different types of content.

But Team Chunk has a deadly weakness. When narrative text is mixed with embedded media, complex call-outs, or other rich supporting material, structured templates have trouble keeping up.

MSNBC.com is a perfect example. As part of its 2013 redesign, the cable news channel put more emphasis on in-depth, web-first news coverage. The design included several reusable modules that could be placed on template-driven pages: videos with accompanying playlists, photo galleries, polling widgets, and teasers for related articles. That standardization delivered all the benefits of CMS content modeling: it made the design more consistent, simplified the process of reusing rich multimedia elements across different stories, and kept the responsive CSS rules manageable.

MSNBC news story, where rich media elements must appear at specific spots in stories and include captions, titles, related links, etc.
MSNBC news story, where rich media elements must appear at specific spots in stories and include captions, titles, related links, etc.

Unfortunately, reporters and editors insisted it would cripple their work. They needed to mix in multiple videos, a gallery and a poll, or several related article teasers, at specific points in each article. Carving out these elements into separate CMS fields or standalone pieces of content would make storing and remixing them easier. However, relying on rule-based CMS templates to display them would also break their connection to the specific sentences, paragraphs, and sections they were meant to enhance.

This is how complex markup makes its way into an article’s body field. Soon, WYSIWYG tools are added to help editors with limited HTML skills. Before anyone realizes what’s happened, use of presentation-oriented markup explodes. Mobile layouts break, and the already difficult task of cross-channel content reuse becomes even harder.

A blog post with embedded tweets, a comparison review that illustrates each product with a photo gallery, and a story that pulls in supporting material from a previous article all face the same problem: the fields-and-templates approach doesn’t work for these small pockets of structure.

Why “clean markup” won’t help#section3

If you grew up during the WYSIWYG Wars—when tools like Adobe PageMill and Microsoft Word’s “Save to Web” feature splattered hideous markup across the internet—you might think cleaner HTML markup is the answer. Kill those unnecessary style attributes, ensure that <p> tags are used instead of <br />, use <ul> tags properly, name your CSS classes carefully, and things will fall into place!

Clean, semantic markup is important, but it won’t solve complex structural problems, like MSNBC’s need to embed widgets into narrative text. We have workhorse elements like ul, div, and span; precision tools like cite, table, and figure; and new HTML5 container elements like section, aside, and nav. But unless our content is really as simple as an unattributed block quote or a floated image, we still need layers of nested elements and CSS classes to capture what we really mean.

Imagine embedding a simple photo gallery in an article. Its markup might be clean and semantically correct, but the fact that the gallery displays with a headline, three photos, a link to a dedicated page, and a caption? Those are design decisions that may change in the future, and we need to separate them from the markup mapping our content to HTML.

<aside class="gallery">
  <h1><a href="gallery1.html">Gallery Title!</a></h1>
  <figure>
    <a href="photo1.html"><img src="photo1.jpg" /></a>
    <a href="photo2.html"><img src="photo2.jpg" /></a>
    <a href="photo3.html"><img src="photo3.jpg" /></a>
    <figcaption>Custom caption</figcaption>
  </figure>
</aside>

The problem isn’t restricted to the publishing industry, either. My team recently encountered similar challenges building a health insurance portal for a company’s HR department. Most content on the 50,000-page site included complex step-by-step instructions, special steps for specific types of employees, or call-outs appropriate for workers in one country but not another. Even with a WYSIWYG editor, the HTML structures needed were far too complex for the site’s business users to create.

At its heart, the problem is a vocabulary mismatch. While standard HTML is rich enough for a designer to represent complex content, it isn’t precise enough to describe and store the content in a presentation-independent fashion. This is why WYSIWYG tools can make the problem worse: rather than shielding content creators from the complexity of markup, they make it easier to describe content using the wrong vocabulary.

Now, as we attempt to combine multi-device design requirements with complex, media-rich narratives, we’ve hit the wall. The chunky, fields-and-templates approach we’ve developed can’t save us from the mismatch between our content and HTML’s descriptive tools.

Meanwhile, in XML-world#section4

While fields and templates have come to dominate web publishing tools, the XML world has spent nearly 15 years developing a parallel approach. Rather than chunking content into fields and re-assembling it later, the XML community embraces fluid, markup-based documents. To capture meaningful structure and avoid HTML’s browser-specific presentation pitfalls, they define purpose-specific collections of markup tags for different projects and applications. It’s a versatile approach that has crossed paths with the web publishing world: the XHTML standard is just HTML, defined as an XML schema.

The Darwin Information Typing Architecture standard—better known as DITA—is a mature example of this approach. Developed by IBM and announced in 2001, DITA was shaped by the technical documentation community. As far back as 2005, Adobe used it to store and manage Creative Suite software manuals—more than 100,000 pages thick with illustrations, cross-references, and complex metadata, all in 14 languages. Both the print and online editions were generated from the same pool of DITA files.

DITA’s heart is a family of standard XML schemas that define a rich vocabulary of content elements. HTML-compatible tags like <ol> and <p> are used for simple formatting, but the standard also defines hundreds of additional tags and properties to describe complex concepts. In addition, it includes provisions for “specializations”—add-on vocabularies for a given industry or project.

<task id="signup">
  <title>Signing up for health insurance</title>
  <taskbody>
    <steps>
      <step>List your dependents</step>
      <step>Gather past medical information</step>
      <step>Fill out forms 21a, 39b, and 92c</step>
      <step audience="retail">
          Hand in your paperwork to a supervisor
      </step>
      <step audience="corporate">
          Deliver your paperwork to the HR office
      </step>
    </steps>
  </taskbody>
</task>
<p conref="../boilerplate.xml#disclaimer">
  This text will be replaced by the boilerplate legal disclaimer.
</p>

Once these semantically precise documents have been created, a transformation step is necessary to turn the structured content into final output. A web publishing tool might read a directory of DITA XML files, replace placeholder elements with the text they reference, expand custom tags into styled HTML markup, strip out text that’s only intended for printed manuals, and so on.

The approach isn’t without its downsides. Managing large collections of related articles and documents requires the users editing them to understand the nuances of the specific relationships, and how they’ll affect the final product. While the simplest DITA schema is similar to HTML, other variations add hundreds of special-purpose tags and properties.

In the broader web publishing world, it can take more customization to achieve the same benefits. Although the semantically rich content is cleaner and easier to repurpose, building usable editorial tools and publishing processes on top of DITA can be just as daunting as building a complex, multichannel website.

The best of both worlds#section5

The good news is we don’t have to convert all our projects to XML to learn from those communities’ accumulated wisdom. While the toolchains that have been built around those approaches are a tough fit for today’s mature web development tools and workflows, we can use their principles in our projects.

Store meaning, not appearance, in the body#section6

When complex markup structures appear in narrative text, boil them down to the basics. Replace complex house styles with custom tags that describe their precise meaning, like <warning type="hardware">Don't turn off the server!</warning>. When a new tag isn’t appropriate, use custom attributes. The DITA audience attribute is a good example. It can apply to many different kinds of elements, but jamming it into the often-abused CSS class attribute would muddy its meaning.

More complicated elements inside a body field, like multi-image galleries or metadata-heavy media embeds, should be broken out into separate content fields. If they’re meant to be reused across multiple pieces of content, make them freestanding content items in a CMS. Instead of relying on rule-based templates to position them, however, use placeholders like <gallery id="1" /> and <teaser article="82" rel="rebuttal" /> right inside the narrative fields.

This approach turns an article or a post into a kind of manifest, with narrative fields like “Body” and “Summary” playing traffic cop for the collection of properly separated supporting elements. Later, on output, they can be stitched back together.

Tailor editorial tools for the same meaningful elements#section7

Editors and creators who work with complex content need tools that manipulate that content’s native vocabulary, not the final visual design or the browser-specific nuances of HTML. Wikipedia recently rolled out an assistive editing tool to help new users navigate the complexity of the site’s content. It offers a limited set of formatting tools, but gives editors one-click access to Wikipedia-specific markup standards like inline journal citations, boilerplate text, and calls for editorial review.

Screenshot of Wikipedia’s custom rich-text editor, with assistive tools for Wiki-specific markup
Screenshot of Wikipedia’s custom rich-text editor, with assistive tools for Wiki-specific markup

Those kinds of decisions aren’t universal: they’re tailored to the peculiarities of a specific project’s content. Disabling all but the most basic HTML tags and adding one-click buttons for a site’s custom elements can turn a “stock” WYSIWYG editor into a structure-friendly tool. It’s also the best way to avoid the click-buttons-till-it-looks-right markup mess.

Transform the content to match the designs#section8

When the time comes to publish the content, we can transform these custom tags and placeholders to the final destination format. If the design changes, tweaks need only be made in the code or templates that transform the markup—not in every piece of content where the structures appear.

In addition, different transformations can be applied to those custom elements depending on the context. The <gallery> element mentioned earlier might be replaced by multiple captioned and credited images for most web browsers. On bandwidth-constrained mobile devices, a single thumbnail image and links to the full gallery could be inserted instead. Contextually appropriate decisions can be made for email summaries, partner content APIs, or RSS feeds; each is just an alternative transformation step.

That processing doesn’t even need to happen on the server side. Client-side tools like jQuery and AngularJS can be used to apply complex behaviors based on custom attributes, style and interact with custom elements, replace placeholders with standard markup, or lazy-load media that’s tailored to a device’s needs.

The best news: it’s possible today#section9

This triad of techniques—using custom elements and properties to represent content’s meaning, transforming it into HTML on output, and ensuring editing tools share the same vocabulary—has already started to gain momentum in the web publishing world.

WordPress’s “Short Tags” are a simple application of the technique, and third-party plugins can present editors with a customized set of placeholder tags tailored to their needs. Although WordPress’s use of bracketed placeholders like

No upcoming events

and makes client-side processing of the tags more difficult, the underlying approach is the same. Shortcodes can break out complex or reusable elements into separate fields and entities, then position them inside the body field.

EZPublish, a popular PHP-based CMS, allows content to be stored as XML rather than HTML. Developers can set up custom tags whose properties and content are mapped to templates for output. Although it’s not automatic, these custom tags can be integrated into EZPublish’s native editing tools, so content creators don’t have to use raw markup to enter them.

<custom name="about_author" photo="author.jpg" user_account="77" >
The author wrote this article over a long holiday break, and regrets any eggnog-induced errors.
</custom>

Drupal 8, currently under development, will ship with the CKEditor WYSIWYG tool. It will come pre-configured for a minimal set of HTML tags, but will use HTML5 data attributes to store additional properties like captions, layout hints, and more on simple elements. When the content is rendered, Drupal’s text filters will transform it into the final representation: CSS classes, <figcaption> tags, and so on. Users can manage that complex information using CKEditor’s visual tools instead of raw markup, but storing precision content while outputting semantic HTML will be the default.

A bright future#section10

This approach to structured content won’t always rely on complex web publishing tools. Several related HTML5 standards, grouped under the Web Components umbrella, will eventually make it possible to perform these transformations in the browser itself. The ability to define custom elements will bring us closer to XML’s vocabulary flexibility, browser-supported HTML templates will be able to replace those elements with more complex representational markup on the fly, and the Shadow DOM will give designers a way to “sandbox” complex Javascript and CSS interactions inside those custom elements.

Browser support for these behaviors is understandably patchy, but tools like Polymer are designed to fill the gaps. In the meantime we can still depend on existing HTML elements, enhanced with data attributes, to stand in for custom ones. Although we still have to do the work of transforming them, they bridge the gap between a precise, tailored content vocabulary and clean, browser-friendly markup.

What are the next steps?#section11

Using this narrative-friendly approach to structured content isn’t a cakewalk. Site builders, content strategists, and designers must understand what’s happening inside the body field, not just the database-powered chunks that surround it. Which patterns in our content should rely on simple styling, and which merit their own custom tags? Which can we assume will stay consistent, and which should account for future changes? Our planning process must start answering those questions.

In addition, content editing tools must be tailored to reflect those decisions. Too many users are accustomed to presentation-oriented “Dreamweaver in a body field” WYSIWYG tools, and throwing them back into the land of raw markup is a recipe for disaster. Although the current crop of web WYSIWYG tools can all be customized, actually tweaking them to match the vocabulary of a site’s content rarely happens when deadlines loom.

But the payoff can be dramatic. Richer, more flexible designs can coexist with the demands of multichannel publishing; future design changes can sidestep the laborious process of scrubbing old content blobs; and simpler, streamlined tools can help editors and authors produce better content faster. By combining the best of XML and structured web content, we can make the body field safe for future generations.

About the Author

Jeff Eaton

Jeff Eaton is a digital strategist at Lullabot, where he designs and implements large-scale web platforms for media, education, and enterprise businesses. He co-authored the first edition of O’Reilly Media’s Using Drupal; hosts the Insert Content Here content strategy podcast; and is a frequent writer and speaker at web and open source conferences. In a previous life, he worked as a freelance writer and a copy editor, jobs that he recalls fondly while building editorial tools for today’s content teams.

31 Reader Comments

  1. Apologies, but this is a very Drupal-oriented article. You seem to have taken a long time to say ‘let’s add custom tags and attributes into our WYSIWYG editors, except that part won’t be WYSIWYG and guess what it will be in Drupal 8’.

    As a web developer I’ve spent years using a wide variety of CMS, and Drupal’s approach is highly limited by the fact that it’s based in PHP, which is just a scripting language (and Perl-based, so everything is seen as a text-processing problem, which is what Perl is good at). If all you have is a hammer, everything looks like a nail.

    Talking about XML transformations and CKEditor is giving me flashbacks to the late 90s and early 2000s when those were the primary CMS options.

    You also seem to be contradicting yourself a bit by saying that we don’t want to make users dive into the HTML, but then we do want them to have to learn custom tags and attributes (which don’t get visually represented during editing).

    Modern systems based on object-oriented languages (like C# or Javascript) give developers the chance to build reusable modules that can separate data from design, not require authors to learn any markup, and be visually represented during content construction. I’ve built widgets in Sitefinity, for example, that don’t allow any design decisions by the author, but give them complete flexibility to place specialized content wherever they need in their article, and see how it will look at design time.

    Custom tags and attributes are indeed the future, but modern technologies will allow the author to still be abstracted away from the semantic markup and simply focus on content.

    Thank you for the article, but this is for a specific technology community, and leaves out the many other approaches available even now.

  2. I think ultimately what we’re looking at is the need for more metadata tagging the content, allowing a means of managing that metadata, and then defining a means of leveraging that metadata toward a variety of purposes (seo, presentation, interaction, filtering).

    That really is the thing lacking in HTML. Tags are fine to define basic and common structures (i.e. this is a list of stuff, this is an article, this is a quote) where it fails is in describing the specific content more. So yeah, it’s an article, what kind of article, about what, from whom, when, what’s the context, who’s the audience, etc.,. At the same time we have to worry about “clean code” and not becoming the bloat that XML can become if not properly curated.

    It’s funny in that what we basically need is XML and XSL, but not XML or XSL. We need the ability to create chunks of content and specify all those extra data points and then transform it in some separate step into something the browser will render and the user can interact with, but without the bloat of XML and without the speed issues of XSL. So we’re left to writing our own parsers and coming up with our own meta -languages and domain specific languages to fix this problem across a variety of platforms.

    I think another challenge is that the publishers don’t want restrictions. I’ve run into this with some marketing teams and designers. If our developers say “we need to live within this box due to budget, infrastructure, time, etc.,.” the designers and marketers and business partners balk… “why are you putting restrictions on us! We’re designers! You’re not designers!” and then we go over time or budget trying to figure out how to make their designs implement like they expect (which rarely works out 100%). I think the same holds true here. Instead of designing a system that works consistently and fluidly for 80% of consumers, we’re worrying about fitting every single possible use case that could ever occur and in the process making it difficult for everyone. So how do we put a stake in the ground and say we’re going to do something here, it won’t make everyone happy, but it will work for most and not have everyone reinventing their own version of the same thing.

  3. Tony, thanks for your comments!

    I’ve definitely spent the past several years in the Drupal world, but I don’t think the narrative-structure challenge (or the approaches I’m discussing) are in any way specific to one CMS or a particular programming language. You’re correct in that many different CMS platforms have arrived at similar conclusions… and yet, on project after project we find clients who’ve been left with a body field, a WYSIWYG editor, and a pile of HTML-insertion buttons.

    The challenge isn’t simply technical, since the tools to build intelligent, modular content have been around for decades. The biggest hurdle is getting CMS integrators, web developers, and designers on the same page when it comes to the meaning of the underlying content they’re storing, manipulating, and presenting. XML and DITA are important not because they’re perfect technical solutions, but because their communities have learned to articulate many challenging questions about content structure.

    I agree wholeheartedly that presenting raw markup to users isn’t the answer, and I hope that the space constraints of the article didn’t leave anyone with the wrong impression. Precise, meaningful markup (the kind that vanilla HTML can’t currently provide) is a critical foundation for the easy-to-use widgets and editing interfaces you describe. While it can simplify the markup jumble, it’s not the end of the process.

  4. The quandary is that, although mistrusted by many, XML and XSLT actually enhance publishing opportunities as well as enable stepping away from solutions that only exist in silos. Shared content architectures lower the cost of development and maintenance of tools, enhance content portability across the Web, enable common design patterns to attract wider communities of users (training content, how-to articles, and encyclopedic content as a few examples), and give product vendors wider markets for widely-shared solutions, potentially lowering costs thanks to competition that could not exist before. HTML5’s structural and semantic elements certainly help out with creating ever more adaptive content delivery solutions, but nothing in the Web architecture helps with the problem of umpteen different authors creating umpteen different interpretations of what the repeatable structure of particular content type ought to be.

    Herein is where some form of schema-coached content authoring tool really does make sense. I’ve been beaten up enough about technology biases to avoid letting you see whether or not I use XML under the covers. I prefer to suggest that HTML5 can be augmented by XML for far greater roles, just as Ripley was augmented by her Power Loader suit for doing outmatched battle (Alien meme there). It is a simple matter of finding scaffolding for your content goals in the standards that can do the most to lower your costs and enhance your reach.

  5. @Tony, just to point out, the latest iterations of PHP, and specific PHP frameworks, including Symfony, which is the basis for much of Drupal 8, are object-oriented (http://stackoverflow.com/questions/4699519/is-php-object-oriented), so I don’t think that’s at issue. (Also, Perl-based… really?!) I think the issue of how to break down content for its reuse (here, we’re concerned about the content’s code, not that of the CMS/html framework) while allowing contextual presentation of aggregated content in a specific display format at a specific url is an important one that all web developers have to work through in order to accommodate the proliferating number of uses of our content (e.g., in apps, output in APIs, displayed on different types of devices, consumed by screen-readers, and the like). It’s an important question for long-term preservation of content, as well, something that’s coming to a head given the age of the web and the pervasiveness of web-published content.

  6. Pixel and Tonic have added a great feature into Craft CMS called Matrix.

    It lets you create ‘blocks’ of content types that make it easy for a person to enter in the control panel and also give developers the ability to use proper markup.

  7. @Jeff Eaton Thanks for the reply – perhaps I misunderstood the intent of your article, but the struggles you suggested seemed specific to the ‘WYSIWYG editor and buttons’ problem that you mention, which I haven’t found to be a problem in all CMS. That seems to me more a technical problem of implementation.

    Most of your article seemed to me focused on the experience of the author, not the challenges of the developer, thus my feeling that it seemed contradictory. Thanks for clarifying.

  8. @sclapp Yes Perl-based, in that PHP started as a C-translation of Perl scripts, and inherited Perl-like syntax. I find that languages, no matter how they progress, keep a root way of thinking. C++ and C# still ‘think’ like C in many ways though they’ve moved far from their origins.

    Also PHP 5 has objects and interfaces, but that doesn’t necessarily make it ‘object-oriented’ yet. That’s an open debate. You can get into questions of polymorphism, and implementation (heaps, stacks, etc.)

    As far as data, I think it’s important as developers not to get caught up in abstractions. Custom tags, attributes, etc. are also just data – metadata that someone has to write code to translate. Moving from

    to

    is to accommodate humans (whether screen readers or reading code), whatever the device is, the computer system doesn’t care.

    You’re abstracting away from what’s happening within the browser anyway. We’re just building deeper layers of abstraction. So the question is how far do you abstract the user, and how far do you abstract the developer, and overall how clean and understandable can you make your abstraction.

  9. The need for more specific elements is there for sure. Microformats never really got full support, RDFa strikes fear that another marquee or blink tag will be introduced under a money-backed namespace, and data attributes, while useful, unfortunately have little to do with semantics at this time. I agree that rel attributes need to play a more prominent role in defining and describing content relationships.

    “There is a very real problem that needs to be solved here. We need mechanisms in HTML that clearly and unambiguously enable developers to add richer, more meaningful semantics—not pseudo semantics—to their markup. This is perhaps the single most pressing goal for the HTML 5 project.” – John Allsop

    And yet, here we are. I think we need to introduce a small new set of elements that represent content categories—general enough to be reusable, but specific enough to be semantic.

    Just like Microformats, I think the 80/20 rule is a good model to follow. Why shouldn’t the following tags exist?

    calendar
    event
    location (lat, long)
    vCard (person or profile)
    product
    review
    resume/cv
    feed (RSS, timelines)
    img type="logo" (solving the h1 vs img debate)
    disclosure type="summary | spoiler | sensitive etc" (replacing the detail tag)
    

    You could go more even granular for tags:

    post type="status | image | audio | video | comment"
    audio type="music | voiceover | language | show (podcast/interview)"
    

    Your warning example above could be abstracted to notification with a type attribute that accepts values of error, warning, or message.

    Think about the countless parsers, aggregators, and API calls just this set would save, and how this would affect object-oriented CSS.

    I think we’d all benefit from paving the cowpaths and re-approaching HTML as if we were being created for tomorrow’s web.

    The other, larger elephant in the room is that content strategy, content management, and semantic, reusable markup exist in a delicate balance. There are IAs who are concerned with content management, content strategists concerned with copywriting and context planning, and front-end devs who want to keep markup lean and reusable, not to mention page weight low.

    It’s becoming increasingly more difficult to draw clear distinctions in ownership, because content management and authoring are dictating the HTML markup, which more often than not, heavily influences CSS/JavaScript hooks, even source control.

    The interdependence can quickly become political and the user ends up suffering. I’m also heavily invested in helping find or create an amicable solution.

  10. Fully agree with the author here. For those familiar with Entity Relationship (ER) modeling, we could say there is a need for two things in a CMS:

    1. Custom entity types (meaning entity types consisting of several attribute types). In typical CMSs such as Drupal, these would be the “custom content types”.
    2. Custom attribute content. We should be able to define any attribute type as a simple type, such as text, string, options, …, or as a complex types (such as the body field described here). This body field can then (technically) consist of e.g. a combination of standard HTML tags and custom elements (such as an image gallery), which can be themed independently of their structure.

    We are currently implementing such a system in a new Angular based CMS. Angular is great in this regard, because it has something called directives, which is a great way of adding these custom elements. For more information check this: http://docs.angularjs.org/guide/directive, or wait for our upcoming CMS 🙂

  11. That’s a really interesting article Jeff, I totally agree with your central point that complex content in the body of page content is really underrepresented in CMSs at present.

    Page content types work fine (see content channels, content types, etc), and complex areas in fixed locations seem well catered for too (for example Matrix in EE and Craft). However, I agree that users often want the flexibility to add complex content anywhere in the page. But that content is still modular and needs to be separated from what it looks like. From what I’ve seen most CMSs don’t really cater for that.

    You seem to be indicating users will have to write markup to express complex content in the body, in my experience this is usually a hard sell.

    I think a successful solution would have to marry a decent visual editor along with an expressive markup language. My instinct is HTML5 could be OK for this, with things like data attributes as you suggest.

    There are also some interesting projects like Made by Many’s Sir Trevor http://madebymany.github.io/sir-trevor-js/ which allows the user to add blocks of modular content in any order.

    It feels like content strategy is smashing up against CMS tools and we’re starting to see some real progress in how people think about content and how we publish it.

    It will be interesting to see how this all evolves…

  12. Jeff,
    Great points about the limitations of HTML5, which fails to provide true semantic markup, despite claims otherwise.

    I think the bigger issue to solve is determining the purpose of markup within the body. You mention the role of markup to help to structure a document. There is far more of that can be done. That is really what DITA has been about, and it has been a walled garden, serving the needs of individual content creating organizations.

    The other kind of meaning-based markup is about relating internal content other other content outside of the document. This is where the various flavors of data markup are about RDFa, JSON-LD, etc. The web standards community has been more active in this area, largely because committee members are interested in linking data, but it hasn’t been as much of a priority for ordinary authors, who largely don’t understand how it works. As a result, this kind of markup is still rather cryptic to many. And it only covers a limited range of named entities that while important to businesses, don’t reflect full range of interests of the general public.

    For either kind of markup to be more widely adopted, it needs to be concise, precise, and easy to understand. I would love to see standards bodies care more about these issues. HTML is now being used for everything, including books, so it needs more robust markup capabilities.

  13. I feel that the curve of our industry is — and should be — bending towards less coding complexity in the body field, rather than more.

    Most of the problems that Mr. Eaton describe have a straightforward and non-coding based solution already: oEmbed. I can’t think of a more semantic, easy to use, and responsive way of handling needing to put some piece content into a specific part of your content than a simple HTTP link at that point in the text, then allowing your CMS to auto-discover from the content source the code required to display that rich content, and then work out it’s own rules on integrating that content into your design.

    You drop a link to a Slideshare presentation into your body field. When that body field is pulled into some context where a rich media object is appropriate (like a full-size article view), an oEmbed-enabled CMS will render the presentation; in other contexts (an RSS feed, an older feature phone, a ‘help’ tooltip) the CMS will just include the link.

    WordPress, Drupal, Plone and I’m sure many other CMSs have supported oEmbed for years (in WordPress, I much prefer to use oEmbed to using their shortcode system that Mr. Eaton described). Most of us are familiar with oEmbed as a way of rendering Twitter cards and Youtube videos, or one of the many other social media services, but in fact anyone can setup an oEmbed provider for their own content, both for internal use and to share with the world.

    Finally, rich media aside, Mr. Eaton brings up his core point — adding semantic-ish custom markup to text (‘warning’ or ‘task’ etc tags within an XMLish framework). I guess there are probably some specific uses for this, but at a core level I’m uncomfortable with trying to use a technical structure to create meaning. The content is the meaning.

  14. Interesting article and interesting discussions in the comments section. One can take ideas from here and there to get closer towards a solution for this daunting task.

    In my search for CMSs that more or less implement what’s discussed here, I found the below two, not very known, CMSs. I’m sharing them in hope they’ll help someone looking for a similar solution.

    http://www.impresspages.org/
    http://www.pimcore.org/

    Cheers.

  15. Thanks for the interesting article. It’s indeed about an old problem that yet is at the very heart of the difficulty of Web content management. I appreciate the effort to compare the approach of the “Web folks” (rich text editor in an object or page oriented CMS) with the DITA way. These two worlds are too disconnected and it is good sometime to try to connect the dots!

    I’m working on eZ Publish (side note: spells “eZ Publish”, I know this doesn’t always make it easy ..) and we’ve been dealing with this problem for many years (from day 1 actually) with the approach to forbid HTML markup in the Richtext edit and rely on an internal XML pivot markup (ezxml). We definitely think this is the right approach. I just wanted to add that we are in the middle of reworking our solution here (moving to a new version of our internal format based on Docbook, and using transformations to different HTML5 views by default). Ping me if ever you are interested.

    One of the main point in the process is the editorial experience within the Richtext editor often miss-named “Wysiwyg” (we use a customization of TinyMCE). How to “see” the chunks when the content is indeed edited embracing the “Create Once, Publish Everywhere” approach of NPR (and many others in fact I’m not sure they invented it…). By design, Wysiwyg is not making any sens in that scenario when you have multiple channels, multiple designs, multiple screen size consuming the content… and we are back to the big dilemma of separating content from presentation BUT giving meaning and context to editors that needs it to deliver good content! A passionating challenge that is more than ever giving many of us a lot of fun.

  16. It says, “Wikipedia recently rolled out an assistive editing tool to help new users navigate the complexity of the site’s content.”

    However, the screenshot provided is not the new tool (it’s an old one called WikiEditor). The new tool is called VisualEditor, and can be [url=https://en.wikipedia.org/w/index.php?title=WYSIWYG&veaction=edit]accessed directly[/url]. It’s interesting, since it’s a visual view, but still lets you use templates (which can be added and edited, with parameters), to avoid the issue of e.g. every article coming up with their own way of displaying info on the right.

    You can also see a screenshot of VisualEditor in action.

    You can try VisualEditor more broadly by signing up (if you don’t yet have an account). Then, go to Beta Features , select VisualEditor, and save. There will then be an “Edit” tab at the top, with Beta next to it, which leads to VisualEditor.

    I work for the Wikimedia Foundation as a software engineer, though not on the VisualEditor team.

  17. In my experience, many content authors have little idea about writing with meaning, or writing for the web specifically at all. They are very visually oriented, as if they’re writing in Word. In fact, only people comfortable in abstract thinking grasp the concept of semantics vs visuals at all.

    It is very well possible to build draggable, reusable components that authors can embed in a content body, with direct visual feedback, and dialogs that allows them to tune that instance of the component. They can even be perfectly responsive. And the markup is clean. I’ve seen it work.

    The problem though is that such an approach is hugely expensive, and requires a lot of custom development, as well as ongoing maintenance.

    Therefore I think, we need to improve the authoring tools, but authors also need to become more skilled in writing content for the web. Yes it’s a tough sell, but if it’s your job to write content for the web, you might as well learn it. We need to close the gaps from both sides.

  18. Jeff:

    Excellent article. You are spot on. And, this is why you need to be at the next Intelligent Content Conference. We just ended this year’s show a few days ago, but I have a free ticket with your name on it for next year’s event. You’re exactly the kind of guy who should be there amongst the hundreds of us who get it.

    I used to spend my time trying to recruit new believers, but, given the awesome amount of work and great paying gigs out there, now I spend most of my time finding qualified people to work on amazing projects with real impact.

    Congrats on summarizing so well what we’ve been preaching for a decade or more. It’ll take some time, but the rest will come kicking and screaming to our party, like it or not.

    Scott Abel
    The Content Wrangler

  19. @Matthew Flaschen — Thanks for the update on Wikipedia’s visual editor. It’s a great example of how a project can iterate on its tools when the underlying “vocabulary” is understood, and I’m going to be giving it a closer look…

  20. @fchristant — You’re absolutely correct that visual tools aren’t incompatible with the approach being discussed, and they can have a significant impact on the user experience and content quality.

    The problem of training writers to capture meaning rather than appearance is definitely a key challenge. That’s actually one of the reasons I’ve been focusing on lessons from the XML community, and the problematic vocabulary mismatch between HTML and the work that most content creators do every day. Although we can never build foolproof systems, the “language” we offer them in the form of markup, assistive editor buttons, and widgets can shape the work in a positive direction. When the training and the functional vocabulary of the tools *compliment* each other, it’s helped reduce many of the reuse challenges we see on large content projects.

  21. Thanks for this, I’m inspired. The use of Web Components (currently through Polymer.js) along with a CSS architecture such as SUIT.css make this approach a plausible and attractive one.

  22. Well, i never had this kind of problems because i use the power of MODX Revolution with a customized TinyMCE plugin and another PHP plugins.

    In my opinion, we should create rules for the writers, and try to cover all the situations that they´ll need to create their articles, by developing specific templates/chunks. And of course, if we can give them some training on HTML better.

    As Tony said “… for example, that don’t allow any design decisions by the author, but give them complete flexibility to place specialized content wherever they need in their article, and see how it will look at design time.”

    This is the point. we just create the flexibility to an author publish their content… and that´s it.

    And i think at this time we´ve got all the tools to solve this problematic… we just need to be creative!!! 🙂

    Sorry about my english.

  23. Very interesting read, though I’m not a fan of drupal, I think there are greater technologies out there.

  24. This is precisely the problem I have been struggling with in recent years.

    I’m comfortable coding a website for a client. The difficult part is handing it over to them with a CMS that strikes the right balance of usability and flexibility. Whilst generating clean code and responsive layouts.

    I now think some form of custom tags are the answer to this problem. Even if a WYSIWYG editor is built on top of them. The question is now one of finding the technology to implement this…

    I’d like to keep most pre-processing server-side until Web Components are more widely, natively supported. And I favour Ruby/Rails for development.

    Something like the Radius templating engine is a contender. (I think it’s a shame it hasn’t been more widely adopted.) But it would be even better to use pure XML or HTML for forward-compatibility, perhaps in combination with Markdown for simplicity.

    Does anyone have any suggestions?

  25. Wow Jeff, thank you very much for this article.

    I got the impression that as soon as you work outside the popular web publishing world with PHP-based CMS, in an environment more dominated by engineering, you find very good solutions in deed, that make use of XML (maybe even XSLT) for quite a while.

    Please let me point out a small comment by sc5 that CMS will die in the transition to responsive HTML5 services ,which is interesting but wont be the case as I understand now.

    During my research on this topic I found Symphony CMS, which is PHP-based but makes use of XML and XSLT exclusively.

    Would be interesting to see how and if they make use of it in the native editor. Any experiences?

  26. This feels like a complex way to say “use symantic bbcodes where possible, limited styling bbcodes, then convert to markup on output”.

    Haven’t forums and CMSes been doing this for ages with comment editors inserting such square bracketed tags in lieu of direct html?

  27. Nice article. With vast enterprise content management experience I feel it is difficult to have a simple tool that design the content in a way that can be published to any platform without modifications. Trying this things out in various platforms like WordPress, Drupal and even .Net based Sitefinity, I found that all of this depends upon how a business customize the workflow using system integration with Content Management System.

    It is important for the CMS development companies to identify a plugin that is most suitable to the expected requirements and customize it further for better results. Make sure to run different websites on same platform so that the tool you optimized help in publishing content seamlessly to other websites as well. Using a consistent back-end technology and language helps a great deal.

Got something to say?

We have turned off comments, but you can see what folks had to say before we did so.

More from ALA

I am a creative.

A List Apart founder and web design OG Zeldman ponders the moments of inspiration, the hours of plodding, and the ultimate mystery at the heart of a creative career.
Career