The WordPress eXtended Rss (WXR) Export/Import, XML Document Format Decoded and Explained.


This article was written in March of 2011 but has been revised in December 2013. As there is no official documentation on WRX this information has been reverse-engineered. So it may not be accurate and could be out-dated at any time. New additions are underlined while redundant sections are marked out.

One of the great things about WordPress is its portability and its popularity. It is extremely easy for a WordPress owner to move their entire site, comments and all between different hosting providers without the use of complex database languages such as SQL.

Every WordPress site provides the option to import and export data between WordPress servers. This is not restricted to the site entries themselves but can also include the post categories, tags, comments, drafts and even spam! It does all this with the WordPress Extended Rss document format, WXR.

The WXR format is based on the Really Simple Syndication or Rss specification which is a very popular dialect of XML. It has been designed as a syndication format for websites who wish to share and serialise some of their data. http://www.rssboard.org/

A web syndication specification might seem an odd choice for a site exporting tool but Rss popularity on today’s Internet, its simplicity and its expandable format through the use of 3rd party extensions make it a great choice. Being an XML dialect also means you can open up any text editor and have complete access to all data in a mark-up format that is human readable, in a layout not too different from a HTML file.

To create a WXR export file you need to login at your WordPress Dashboard, scroll down to Tools and select Export. Select All content under the Choose what the export option and then press the Download Export File. A filter option allows you to drill down to specific data to trim your export file size. If you are exporting the complete site I’d recommend changing the Statuses filter to ‘Published’. If left as ‘All Statuses’ the blog’s redundant auto-saved entries will be included which ineffectively duplicate the published articles.

tools export

Once you have pressed Download Export File button and Once it has finished downloading you should have an XML document with the name of [site_title].wordpress-[yyyy]-[mm]-[dd].xml. You can open this with any text editor or even Windows Notepad. But it is preferable that you use a text editor that can parse the XML document with colourisation as it makes the document much easier to read. At the time of writting in 2011, NotePad++ http://notepad-plus-plus.org/ is a good choice for Windows users while TextMate http://macromates.com/ was probably the best choice for OS/X.

As the title suggests in this post I will attempt to decode the content of the WordPress Extended Rss document. This means I will list in published order the Rss elements contained within a standard export and briefly describe their purpose.

This will not be a tutorial on XML or Rss and I will assume you have some understanding of both. However if this is not the case things should not be too hard to follow especially for people familiar with HTML documents.

<!-- This is a WordPress eXtended RSS file generated by WordPress as an export of your site. -->

At the top of the WXR file there is a large commented section explaining the purpose of the document and in case you have forgotten instructions on how to import the file to a WordPress site.

Beyond the comments is the required <rss> element containing 5 namespace extensions as well as the Rss numeric version. The extensions include the RDF site summary content module (xmlns:content), the well-formed web comment API (xmlns:wfw), the Dublin Core metadata element set (xmlns:dc) and 2 WordPress extensions (xmlns:excerpt, xmlns:wp). If this isn’t making too much sense then don’t worry as it is not really important unless you are developing a Rss parser.

The namespaces listed are unique with each serving specific functions that the base Rss specification does not cover. Each XML namespace starts with xmlns: and is followed by an abbreviated title of the namespace which is usually an acronym. The URL that follows each title is a requirement and should point to a webpage that provides further information on the namespace. These days though the URLs usually point to non-existent pages.

Xmlns:dc="http://purl.org/dc/elements/1.1/" Is an example of the Dublin Core element set namespace.

<![CDDATA[   ]]> Some tags in a Rss or XML document contain unparsed character data enclosures. These let the XML parsers know to not process the text contained within. It is a safety measure against any illegal characters that would normally generate errors. http://www.w3schools.com/xml/xml_cdata.asp

Below the <rss> element is the <channel> container element. This holds all the child elements and data related to the WordPress site. You can find the closing </rss> element at the bottom of the Rss document. At the top of the <channel> we have the elements that are associated with the WordPress metadata.

<title> Contains the title of the site.

<link> Is the URL of the site as determined by WordPress.

<description> Is a tagline that can be modified in the Dashboard under Settings, General Settings, Tagline General Settings.

<pubDate> Was the time and date that the WXR document was created. It is in the RFC-822 format http://asg.web.cmu.edu/rfc/rfc822.html as required by the Rss standard. The format should be self explanatory except for the last numeric value which represents the local differential from GMT using a +/-hhmm format. Plus 2 hours from GMT would be represented as +0200. The WordPress time zone can be changed in the Dashboard under Settings, General Settings, Timezone General Settings, Timezone.

<language> Is the primary language the site is written in as determined by Settings, General Settings, Language in the WordPress Dashboard. A list of valid codes used to represent the language can be found at http://www.rssboard.org/rss-language-codes.

<wp: wxr_version> This is our first example of an extended Rss element. We can recognise that it does not belong to the Rss specification as the element contains a colon. Left of the colon contains the elements extension while right is the element name. wp:wxr_version is the version number for the WordPress extension Rss. At the last update to this article in December 2013 the version number was at 1.2.

<wp:base_site_url> Is the root URL  of the WordPress hosting provider.

<wp:base_blog_url> Is the root URL of the WordPress site.

<wp:wp:wp_author> Contains details on the authors of the site. Each author gets their own wp_author container.
<wp:author_login> Is the author’s WordPress login user name.
<wp:author_email> Is the author’s e-mail address associated with their WordPress account.
<wp:author_display_name> Is the author’s public display name used in instead of the login user name for comments and posts.
<wp:author_first_name> Is the author’s first name.
<wp:author_last_name> Is the author’s last name.

<wp:category> Each container holds information on a category used by the site for the classification of posts. Contains a complete collection of categories associated with the blog. You can view and edit the list within the WordPress Dashboard under Posts, Categories. Each category is given its own <category> element and contains the following 3 4 child elements.
<wp:term_idname> Is a unique numeric identifier assigned by WordPress to this category. It is found in URL strings that reference this category.
<wp:category_nicename> Is the category name in a URL friendly format.
<wp:category_parent> If the category belongs to a hierarchy then the parent category is listed.
<wp:cat_name><![CDATA[]]> The original name of the category contained within an unparsed character data enclosure.

<wp:tag> Contains a complete collection of the tags assigned to posts. You can view and edit the tags within the Dashboard under Posts, Posts Tags. It contains the following 2 3 child elements.
<wp:term_idname> Is a unique numeric identifier assigned by WordPress to this tag. It is found in URL strings that reference this tag.
<wp:tag_slug> Is the URL friendly name of the tag.
<wp:tag_name> Is the original name of the tag contained within an unparsed character data enclosure.

<generator> Is the name or a URL pointing to the homepage of the application that was used to create the Rss document.

<cloud> Is a pointer to the RssCloud API which is a blog monitoring service supported by WordPress.com. It enables a supporting client to receive instant notification when the blog is updated. http://www.rssboard.org/rsscloud-interface

<image> Is a logo belonging to the site that can be displayed by Rss clients. You can modify the logo under the General Settings, Blog Picture / Icon dialog in the Dashboard under Settings, General, Big Picture / Icon. There are strict size and image formats requirements imposed by the Rss standard. http://www.rssboard.org/rss-specification#ltimagegtSubelementOfLtchannelgt

<atom:link rel="search"> Is a URL pointing to the Open Search description document supplied by WordPress. It enables supported Rss clients and web browsers an easy means to provide search terms to the blog and receive results in a standardised XML format. http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_description_document

<atom:link rel="pub"> Is a URL pointing to the Google designed pubsubhubbub notification service that is supported by WordPress. In my opinion this is easier to implement and use then the alternative <cloud> service that offers similar functionality. http://code.google.com/p/pubsubhubbub/

That is the end of the Rss metadata related elements. Below are the list of child elements contained within the <item></item> elements. Items are repeated multiple times as each item holds a single blog post, article or page. Items contain the details of the unique resources used by the WordPress site. These include Posts, Pages and Media.

<title> Is the Title for a page and a post or the Name for media. Title of the blog post or page.

<link> Is the site URL that points to the site page that displays the item. URL to the blog post or page.

<pubDate> Time and date the item posted to the site formatted to the RFC 822 specification. that the post was posted online.

<dc:creator> Lists the author of the item using the user name found in <wp:author_login> post. The element is a Dublin Core Rss extension as the Rss specification doesn’t contain any suitable elements for this role.

<guid> Is the globally unique identifier used for the identification of the blog post item by Rss and WordPress clients. The isPermaLink=false attribute just means according to the Rss standard should mean that this identifier is not a legitimate website URL and is not usable in a web browser. Though in WXR the URLs are valid and point to the asset.

<description> In Rss documents this element contains the synopsis of the item but in WXR it is left blank.

<content:encoded> Is the replacement for the restrictive Rss <description> element. Enclosed within a character data enclosure is the complete WordPress formatted blog post or page complete with HTML tags and all. For media this contains the Description which is also formatted in HTML.

<excerpt:encoded> This contains a Caption used by media. This is an unknown element. This is a summary or description of the post often used by RSS/Atom feeds.

<wp:post_id> This is an auto-incremental, numeric, unique identification number given to each post, article media or page.

<wp:post_date> Time and date that the post item was published to the site.

<wp:post_date_gmt> Time and date in GMT that the post item was published to the site.

<wp:comment_status> A value stating whether public access for posting comments is opened or closed.

<wp:post_name> Is a unique, URL friendly nicename based on the post title at the time of the first save.

<wp:status> Publish status of the post item with the options;  publish, draft, pending, private, trash, inherit.

<wp:post_parent> The numeric identification number if the post’s parent item. I think this is applicable to WordPress pages which can be nested within each other.

<wp:menu_order> I assume is related to menu navigation of nested pages.

<wp:post_type> Item Post type either post, page, attachment media.

<wp:post_password> A non-encrypted password used by WordPress to restrict reading access to the post.

<wp:is_sticky> A numeric Boolean value (0 is false, 1 is true) to determine if the post as a sticky. A sticky post means the post will be displayed before all other non-sticky posts.

<wp:attachment_url> The URL that points to the media item source. The URL could be used to display in a browser or used in an application to download the media.

<category> Each category or tag associated with the item is given 2 category attributes. The domain attribute lists either post_tag or category while the nickname is the URL friendly name. Media items are not given category tags. elements. The first element contains just the category as a name, while the second element contains both the category name and the URL friendly nicename attribute.

<wp:postmeta> Are containers for newer additions the WXR document format that have not been given their own WXR tags. have been introduced after the original WXR specification. Each <wp:postmeta> element contains 2 child elements.
<wp:meta_key> Is URL friendly reference key for the meta data element.
<wp:meta_value> Is the value for the meta data element contained within a character data enclosure.

Below are some of the <wp:meta_key> references currently used by WXR.

delicious; is data related to the Delicious social bookmarking web service. http://www.delicious.com/
geo_latitude; is the positioning location of the author when submitted the post. The value is the latitude in degrees using the World Geodetic System 1984 (WGS84) datum. It seems to be based on the Google Gears Geolocation API. http://code.google.com/apis/gears/api_geolocation.html
geo_longitude; is the positioning location of the author when they submitted the post. The value is the longitude coordinates.
geo_accuracy; is the horizontal accuracy of the above positioning values in metres.
geo_address; is the address determined by the above geolocation data.
geo_public; is a Boolean numeric value that determines if the geolocation data should be displayed in the post.
_wpas_; related tags may have something to do with the WordPress Sharing services.
reddit; is data related to the reddit social news web service. http://www.reddit.com/

<wp:comment> Is a child element for the post item that contains 12 13 sub-elements listed below. These sub-elements belong to the a single post comment contained within a <wp:comment> element set.
<wp:comment_id> This is an auto-incremental, numeric, unique identification number given to each comment.
<wp:comment_author> The name of author who submitted the comment. The name value is contained within an unparsed character data enclosure.
<wp:comment_author_email> An e-mail address provided by the author of the comment.
<wp:comment_author_url> The URL of the author’s website provided by the author of the comment.
<wp:comment_author_IP> The IP address belonging to the author of the comment. The IP address is automatically recorded by WordPress.
<wp:comment_date> The date and time local to the blog that the comment was posted.
<wp:comment_date_gmt> The date and time at GMT that the comment was posted.
<wp:comment_content> The comment text enclosed within a character data enclosure.
<wp:comment_approved> A numeric Boolean value to determine if the comment is displayed.
<wp:comment_type> The type of comment. If left blank it is classed as a normal comment. A value of pingback or trackback means it is a post request notification link http://en.wikipedia.org/wiki/Trackback.
<wp:comment_parent> The numeric identification of the parent comment used when the comment is a response to a pre-existing comment.
<wp:comment_user_id> A numeric identification belonging to the author if they were logged in when they submitted the comment.
<wp:comment_metadata> Seems to offer additional data much like the earlier <wp:postmeta> tag.

Hopefully that extensive list helps you out. ItAs of December 2013 it should be current with all the main elements in a standard WordPress Extended Rss document. If you find any mistakes, errors or know the purpose of any of the unknown elements please leave a comment.

About these ads

29 thoughts on “The WordPress eXtended Rss (WXR) Export/Import, XML Document Format Decoded and Explained.

  1. Pingback: Is comment metadata included in the export file? | WordPress Stack Exchange Monitor

  2. Hi,
    great article, got my project to import from a proprietary db to wordpress going faster than I thought it would.

    Two things thou: the excerpt:encoded is used for the posts excerpt, as seen on the edit post page.
    And the title within the item is not the blog’s title, but the posts one, same as link (for the sake of the newbies)

    best regards

  3. Pingback: DevHawk Has A Brand New Blog (Engine) – DevHawk

  4. Nice explanation. I would like to use this to perform a bulk import of new posts. Are all of the elements required? For example the under and ? If so how would those values be determined?

  5. I need help with translating a WP site to several different foreign languages. We have to use translation memory and be able to parse out the xml. We have tools to do this, but require a dtd spec.

    If we used the rss spec would it work sufficiently enough for translation of the content?

    Have you any experience with this or might point me in the correct direction?

    Thanks,

  6. I’m constantly searching for great information on this theme. It can be hard to locate occasionally. With thanks!. I’ll check back on your internet site from time to time for more information.

  7. Pingback: How To Spot A Psychopath :: Bye-bye, Blogsome :: November :: 2011

  8. Pingback: Hvordan vedlikeholde gamle permalenker i nytt publiseringsverktøy? | Thomas Misund

  9. Hi sir, i have followed your tutorial and it help me alot in understanding the wordpress and standard RSS format, i am having some issues, issue is that i want to import images aswell and i am importing from custom blog, i am able to make XML format according to WXR, used RSS importer to import that XML file, but i dont know how to import images aswell, can you help me ?

  10. Hi, I am trying to import wrx file to my blog. I defined custom post type and import works with one problem: Only the last comment of a post is imported the first ones are irgnored. Here an example code part:

    Battlefront (Xbox)

    http://www.darth-sonic.de/wordpress/videogames/battlefront-xbox/

    Sun, 23 Jan 2005 00:00:00 +0200

    <![CDATA[

    System:
    Xbox

    Genre:
    Taktik Shooter

    Entwickler:
    Pandemic Studios

    Veröffentlichung:
    09/2004

    InhaltsangabeStar Wars: Battlefront wird ein Taktik-Shooter im Stile des erfolgreichen PC-Spiels Battlefield 1942.Ihr werdet mit bis zu 32 Spielern heiße Gefechte als Soldat auskämpfen. Vier verschiedene Armeen stehen zur Auswahl. Kämpft auf 10 bekannten Planeten und nutzt ein breites Spektrum an waffen und Fahrzeugen, inklusive X-Wings, Snowspeeder und AT-ST´s.Meinung des Autors

    Noch nicht getestet!

    Grafik:
    0%

    Sound:
    0%

    Steuerung:
    0%

    Spielspaß:
    0%

    Anzeige - Diese Spiel jetzt kaufen?

    ]]>

    2005-01-23 00:00:00
    2005-01-22 22:00:00
    open
    open
    battlefront-xbox
    publish
    0
    0
    videogames

    additional_link_1

    2008-10-03 00:08:38
    2008-10-02 22:08:38

    1

    0

    2008-10-03 00:09:07
    2008-10-02 22:09:07

    1

    0

    2008-10-03 00:09:50
    2008-10-02 22:09:50

    1

    0

    2008-10-03 00:12:14
    2008-10-02 22:12:14

    1

    0

    • There would be a couple of things I would try.

      Firstly when you export your blog, make sure you select “All Content” when you “Choose what to export”

      I would then run an XML validator on your blog’s export file, there maybe some illegal characters in there that might be breaking the XML syntax.

      http://www.w3schools.com/dom/dom_validate.asp

      Also it might be wise to make sure the version of WordPress you are exporting from is the same as the one you are importing to.

  11. Hi, I have a Blogger XML file that is 47.9MB. I need to convert it to a wordpress WXR file. The Google app I normally use throws me errors about the file being too large. How do I manually convert a XML file to a WXR or conversely, break up a Blogger XML successfully and then use the conversion app? I’ve Googled for nearly 3 hours and found nothing helpful. :(

    • You shouldn’t need to use a ‘conversion app’ as WordPress allows you to directly import a Blogger XML file. https:// (your word press blog url) /wp-admin/admin.php?import=blogger

      I have not downloaded or tried this program myself but this looks to do what you are requesting.

      http://sourceforge.net/projects/splitthatxml/files/

      Otherwise as a last resort an XML file is just a plain text file. You could simply use a text editor like Notepad++ http://notepad-plus-plus.org/ to manually copy parts of the XML into separate files. But a word of warning you would need to maintain a valid XML structure for each new file you created otherwise it probably wouldn’t work. http://www.w3schools.com/xml/xml_syntax.asp

      • The problem with the Blogger import tool is that not everything comes over. Converting the XML file to a WXR file, then breaking it up into smaller pieces is the most efficient way to get it all at once with no issues. The problem is that this blog has over 800 posts and the conversion tool I use says it’s too large. I’m just afraid that breaking up the XML file will make it not valid and them the WXR will not be valid. The best option for me is something that can convert such a large file to WXR and then I can split it.

    • Well both Blogger’s exported file and a WordPress WXR files are XML files. XML is basically a machine readable text file. That means you can load either file type into a text editor and modify them as you please. I suggest making a copy of the Blogger XML for a backup and then use a text editor like notepad++ to manually cut and paste its content into smaller files, and then individually use those on your converter. It is quite doable, otherwise you can pay someone to do it for you.

  12. Thank you for such a great post. It would be a good idea to give an example of how a particular tag is used. This would make it complete I believe.

    Thank you.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s