WordPress eXtended RSS (WXR) export/import
XML document format decoded and explained
Or help me out by engaging with any advertisers that you find interesting
One of the great things about WordPress is its portability and its popularity. It is extremely easy for a WordPress owner to move their entire site, comments and all between different hosting providers without the use of complex database languages such as SQL.
This article is based on a document export taken from the free hosting service WordPress.com. The service uses some plugins and creates some tags that may not be included in the WordPress core downloaded from WordPress.org.
Every WordPress site provides the option to import and export data between WordPress servers. This is not restricted to the site entries themselves but can also include the post categories, tags, comments, drafts and even spam! It does all this with the WordPress Extended RSS document format, WXR.
The WXR format is based on the Really Simple Syndication or RSS specification which is a very popular dialect of XML. It has been designed as a syndication format for websites who wish to share and serialise some of their data. http://www.rssboard.org/
A web syndication specification might seem an odd choice for a site exporting tool but RSS popularity on today’s Internet, its simplicity and its expandable format through the use of 3rd party extensions make it a great choice. Being an XML dialect also means you can open up any text editor and have complete access to all data in a mark-up format that is human readable, in a layout not too different from a HTML file.
To create a WXR export file you need to login at your WordPress Dashboard, scroll down to Tools and select Export. Select All content under the Choose what the export option and then press the Download Export File.
Once it has finished downloading you should have an XML document with the name of
[site_title].wordpress-[yyyy]-[mm]-[dd].xml. You can open this with any text editor or even Windows Notepad. But it is preferable that you use a text editor that can parse the XML document with colourisation as it makes the document much easier to read. At the time of writting in 2011, NotePad++ http://notepad-plus-plus.org/ is a good choice for Windows users while TextMate http://macromates.com/ was probably the best choice for OS/X.
As the title suggests in this post I will attempt to decode the content of the WordPress Extended RSS document. This means I will list in published order the RSS elements contained within a standard export and briefly describe their purpose.
This will not be a tutorial on XML or RSS and I will assume you have some understanding of both. However if this is not the case things should not be too hard to follow especially for people familiar with HTML documents.
<!-- This is a WordPress eXtended RSS file generated by WordPress as an export of your site. -->
At the top of the WXR file there is a large commented section explaining the purpose of the document and in case you have forgotten instructions on how to import the file to a WordPress site.
Beyond the comments is the required
<rss> element containing 5 namespace extensions as well as the RSS numeric version. The extensions include the RDF site summary content module (
xmlns:content), the well-formed web comment API (
xmlns:wfw), the Dublin Core metadata element set (
xmlns:dc) and 2 WordPress extensions (
xmlns:wp). If this isn’t making too much sense then don’t worry as it is not really important unless you are developing a RSS parser.
The namespaces listed are unique with each serving specific functions that the base RSS specification does not cover. Each XML namespace starts with
xmlns: and is followed by an abbreviated title of the namespace which is usually an acronym. The URL that follows each title is a requirement and should point to a webpage that provides further information on the namespace. These days though the URLs usually point to non-existent pages.
Xmlns:dc="http://purl.org/dc/elements/1.1/" Is an example of the Dublin Core element set namespace.
<![CDDATA[ ]]> Some tags in a RSS or XML document contain unparsed character data enclosures. These let the XML parsers know to not process the text contained within. It is a safety measure against any illegal characters that would normally generate errors. http://www.w3schools.com/xml/xml_cdata.asp
<rss> element is the
<channel> container element. This holds all the child elements and data related to the WordPress site. You can find the closing
</rss> element at the bottom of the RSS document. At the top of the
<channel> we have the elements that are associated with the WordPress metadata.
<title> Contains the title of the site.
<link> Is the URL of the site as determined by WordPress.
<description> Is a tagline that can be modified in the Dashboard under Settings, General Settings, Tagline General Settings.
<pubDate> Was the time and date that the WXR document was created. It is in the RFC-822 format http://asg.web.cmu.edu/rfc/rfc822.html as required by the RSS standard. The format should be self explanatory except for the last numeric value which represents the local differential from GMT using a +/-hhmm format. Plus 2 hours from GMT would be represented as +0200. The WordPress time zone can be changed in the Dashboard under Settings, General Settings, Timezone General Settings, Timezone.
<language> Is the primary language the site is written in as determined by Settings, General Settings, Language in the WordPress Dashboard. A list of valid codes used to represent the language can be found at http://www.rssboard.org/rss-language-codes.
<wp: wxr_version> This is our first example of an extended RSS element. We can recognise that it does not belong to the RSS specification as the element contains a colon. Left of the colon contains the elements extension while right is the element name.
wp:wxr_version is the version number for the WordPress extension RSS. At the last update to this article in December 2013 the version number was at 1.2.
<wp:base_site_url> Is the root URL of the WordPress hosting provider.
<wp:base_blog_url> Is the root URL of the WordPress site.
<wp:wp:wp_author> Contains details on the authors of the site. Each author gets their own wp_author container.
<wp:author_login> Is the author’s WordPress login user name.
<wp:author_email> Is the author’s e-mail address associated with their WordPress account.
<wp:author_display_name> Is the author’s public display name used in instead of the login user name for comments and posts.
<wp:author_first_name> Is the author’s first name.
<wp:author_last_name> Is the author’s last name.
<wp:category> Each container holds information on a category used by the site for the classification of posts. Contains a complete collection of categories associated with the blog. You can view and edit the list within the WordPress Dashboard under Posts, Categories. Each category is given its own
<category> element and contains the following 3 4 child elements.
<wp:term_idname> Is a unique numeric identifier assigned by WordPress to this category. It is found in URL strings that reference this category.
<wp:category_nicename> Is the category name in a URL friendly format.
<wp:category_parent> If the category belongs to a hierarchy then the parent category is listed.
<wp:cat_name><![CDATA]> The original name of the category contained within an unparsed character data enclosure.
<wp:tag> Contains a complete collection of the tags assigned to posts. You can view and edit the tags within the Dashboard under Posts, Posts Tags. It contains the following 2 3 child elements.
<wp:term_idname> Is a unique numeric identifier assigned by WordPress to this tag. It is found in URL strings that reference this tag.
<wp:tag_slug> Is the URL friendly name of the tag.
<wp:tag_name> Is the original name of the tag contained within an unparsed character data enclosure.
<generator> Is the name or a URL pointing to the homepage of the application that was used to create the RSS document.
<cloud> Is a pointer to the RssCloud API which is a blog monitoring service supported by WordPress.com. It enables a supporting client to receive instant notification when the blog is updated. http://www.rssboard.org/rsscloud-interface
<image> Is a logo belonging to the site that can be displayed by RSS clients. You can modify the logo under the General Settings, Blog Picture / Icon dialog in the Dashboard under Settings, General, Big Picture / Icon. There are strict size and image formats requirements imposed by the RSS standard. http://www.rssboard.org/rss-specification#ltimagegtSubelementOfLtchannelgt
<atom:link rel="search"> Is a URL pointing to the Open Search description document supplied by WordPress. It enables supported RSS clients and web browsers an easy means to provide search terms to the blog and receive results in a standardised XML format. http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_description_document
<atom:link rel="pub"> Is a URL pointing to the Google designed pubsubhubbub notification service that is supported by WordPress. In my opinion this is easier to implement and use then the alternative
<cloud> service that offers similar functionality. http://code.google.com/p/pubsubhubbub/
That is the end of the RSS metadata related elements. Below are the list of child elements contained within the
<item></item> elements. Items are repeated multiple times as each item holds a single blog post, article or page. Items contain the details of the unique resources used by the WordPress site. These include Posts, Pages and Media.
<title> Is the Title for a page and a post or the Name for media. Title of the blog post or page.
<link> Is the site URL that points to the site page that displays the item. URL to the blog post or page.
<pubDate> Time and date the item posted to the site formatted to the RFC 822 specification. that the post was posted online.
<dc:creator> Lists the author of the item using the user name found in
<wp:author_login> post. The element is a Dublin Core RSS extension as the RSS specification doesn’t contain any suitable elements for this role.
<guid> Is the globally unique identifier used for the identification of the blog post item by RSS and WordPress clients. The
isPermaLink=false attribute just means according to the RSS standard should mean that this identifier is not a legitimate website URL and is not usable in a web browser. Though in WXR the URLs are valid and point to the asset.
<description> In RSS documents this element contains the synopsis of the item but in WXR it is left blank.
<content:encoded> Is the replacement for the restrictive RSS
<description> element. Enclosed within a character data enclosure is the complete WordPress formatted blog post or page complete with HTML tags and all. For media this contains the Description which is also formatted in HTML.
<excerpt:encoded> This contains a Caption used by media. This is an unknown element. This is a summary or description of the post often used by RSS/Atom feeds.
<wp:post_id> This is an auto-incremental, numeric, unique identification number given to each post, article media or page.
<wp:post_date> Time and date that the post item was published to the site.
<wp:post_date_gmt> Time and date in GMT that the post item was published to the site.
<wp:comment_status> A value stating whether public access for posting comments is opened or closed.
<wp:post_name> Is a unique, URL friendly nicename based on the post title at the time of the first save.
<wp:status> Publish status of the post item with the options;
<wp:post_parent> The numeric identification number if the post’s parent item. I think this is applicable to WordPress pages which can be nested within each other.
<wp:menu_order> I assume is related to menu navigation of nested pages.
<wp:post_type> Item Post type either
<wp:post_password> A non-encrypted password used by WordPress to restrict reading access to the post.
<wp:is_sticky> A numeric Boolean value (
0 is false,
1 is true) to determine if the post as a sticky. A sticky post means the post will be displayed before all other non-sticky posts.
<wp:attachment_url> The URL that points to the media item source. The URL could be used to display in a browser or used in an application to download the media.
<category> Each category or tag associated with the item is given 2 category attributes. The domain attribute lists either
post_tag or category while the nickname is the URL friendly name. Media items are not given category tags. elements. The first element contains just the category as a name, while the second element contains both the category name and the URL friendly nicename attribute.
<wp:postmeta> Are containers for newer additions the WXR document format that have not been given their own WXR tags. have been introduced after the original WXR specification. Each
<wp:postmeta> element contains 2 child elements.
<wp:meta_key> Is URL friendly reference key for the meta data element.
<wp:meta_value> Is the value for the meta data element contained within a character data enclosure.
Below are some of the
<wp:meta_key> references currently used by WXR.
delicious; is data related to the Delicious social bookmarking web service. http://www.delicious.com/
geo_latitude; is the positioning location of the author when submitted the post. The value is the latitude in degrees using the World Geodetic System 1984 (WGS84) datum. It seems to be based on the Google Gears Geolocation API. http://code.google.com/apis/gears/api_geolocation.html
geo_longitude; is the positioning location of the author when they submitted the post. The value is the longitude coordinates.
geo_accuracy; is the horizontal accuracy of the above positioning values in metres.
geo_address; is the address determined by the above geolocation data.
geo_public; is a Boolean numeric value that determines if the geolocation data should be displayed in the post.
_wpas_; related tags may have something to do with the WordPress Sharing services.
<wp:comment> Is a child element for the post item that contains 12 13 sub-elements listed below. These sub-elements belong to the a single post comment contained within a
<wp:comment> element set.
<wp:comment_id> This is an auto-incremental, numeric, unique identification number given to each comment.
<wp:comment_author> The name of author who submitted the comment. The name value is contained within an unparsed character data enclosure.
<wp:comment_author_email> An e-mail address provided by the author of the comment.
<wp:comment_author_url> The URL of the author’s website provided by the author of the comment.
<wp:comment_author_IP> The IP address belonging to the author of the comment. The IP address is automatically recorded by WordPress.
<wp:comment_date> The date and time local to the blog that the comment was posted.
<wp:comment_date_gmt> The date and time at GMT that the comment was posted.
<wp:comment_content> The comment text enclosed within a character data enclosure.
<wp:comment_approved> A numeric Boolean value to determine if the comment is displayed.
<wp:comment_type> The type of comment. If left blank it is classed as a normal comment. A value of pingback or trackback means it is a post request notification link http://en.wikipedia.org/wiki/Trackback.
<wp:comment_parent> The numeric identification of the parent comment used when the comment is a response to a pre-existing comment.
<wp:comment_user_id> A numeric identification belonging to the author if they were logged in when they submitted the comment.
<wp:comment_metadata> Seems to offer additional data much like the earlier
Hopefully that extensive list helps you out. ItAs of December 2013 it should be current with all the main elements in a standard WordPress Extended RSS document. If you find any mistakes, errors or know the purpose of any of the unknown elements please leave a comment.
Written by Ben Garrett