pyMicrodata Package

pyMicrodata Package

This module implements the microdata->RDF algorithm, as documented by the U{W3C Semantic Web Interest Group Note<http://www.w3.org/TR/2012/NOTE-microdata-rdf-20120308/>}.

The module can be used via a stand-alone script (an example is part of the distribution) or bound to a CGI script as a Web Service. An example CGI script is also added to the distribution. Both the local script and the distribution may have to be adapted to local circumstances.

(Simple) Usage

From a Python file, expecting a Turtle output::
from pyMicrodata import pyMicrodata print pyMicrodata().rdf_from_source(‘filename’)
Other output formats are also possible. E.g., to produce RDF/XML output, one could use::
from pyMicrodata import pyMicrodata print pyMicrodata().rdf_from_source(‘filename’, outputFormat=’pretty-xml’)
It is also possible to embed an RDFa processing. Eg, using::
from pyMicrodata import pyMicrodata graph = pyMicrodata().graph_from_source(‘filename’)

returns an RDFLib.Graph object instead of a serialization thereof. See the the description of the L{pyMicrodata class<pyMicrodata.pyMicrodata>} for further possible entry points details.

There is also, as part of this module, a L{separate entry for CGI calls<processURI>}.

Return formats

By default, the output format for the graph is RDF/XML. At present, the following formats are also available (with the corresponding key to be used in the package entry points):

@summary: Microdata parser (distiller) @requires: Python version 2.5 or up @requires: U{RDFLib<http://rdflib.net>} @requires: U{html5lib<http://code.google.com/p/html5lib/>} for the HTML5 parsing; note possible dependecies on Python’s version on the project’s web site @organization: U{World Wide Web Consortium<http://www.w3.org>} @author: U{Ivan Herman<http://www.w3.org/People/Ivan/>} @license: This software is available for use under the U{W3C® SOFTWARE NOTICE AND LICENSE<href=”http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231“>} @copyright: W3C

exception rdflib.plugins.parsers.pyMicrodata.HTTPError(http_msg, http_code)[source]

Bases: rdflib.plugins.parsers.pyMicrodata.MicrodataError

Raised when HTTP problems are detected. It does not add any new functionality to the Exception class.

__init__(http_msg, http_code)[source]
__module__ = 'rdflib.plugins.parsers.pyMicrodata'
exception rdflib.plugins.parsers.pyMicrodata.MicrodataError(msg)[source]

Bases: exceptions.Exception

Superclass exceptions representing error conditions defined by the RDFa 1.1 specification. It does not add any new functionality to the Exception class.

__init__(msg)[source]
__module__ = 'rdflib.plugins.parsers.pyMicrodata'
__weakref__

list of weak references to the object (if defined)

rdflib.plugins.parsers.pyMicrodata.processURI(uri, outputFormat, form)[source]

The standard processing of a microdata uri options in a form, ie, as an entry point from a CGI call.

The call accepts extra form options (eg, HTTP GET options) as follows:

@param uri: URI to access. Note that the “text:” and “uploaded:” values are treated separately; the former is for textual intput (in which case a StringIO is used to get the data) and the latter is for uploaded file, where the form gives access to the file directly. @param outputFormat: serialization formats, as understood by RDFLib. Note that though “turtle” is a possible parameter value, some versions of the RDFLib turtle generation does funny (though legal) things with namespaces, defining unusual and unwanted prefixes... @param form: extra call options (from the CGI call) to set up the local options (if any) @type form: cgi FieldStorage instance @return: serialized graph @rtype: string

class rdflib.plugins.parsers.pyMicrodata.pyMicrodata(base='', vocab_expansion=False, vocab_cache=True)[source]

Main processing class for the distiller @ivar base: the base value for processing @ivar http_status: HTTP Status, to be returned when the package is used via a CGI entry. Initially set to 200, may be modified by exception handlers

__init__(base='', vocab_expansion=False, vocab_cache=True)[source]

@keyword base: URI for the default “base” value (usually the URI of the file to be processed) @keyword vocab_expansion: whether vocab expansion should be performed or not @type vocab_expansion: Boolean @keyword vocab_cache: if vocabulary expansion is done, then perform caching of the vocabulary data @type vocab_cache: Boolean

__module__ = 'rdflib.plugins.parsers.pyMicrodata'
graph_from_DOM(dom, graph=None)[source]

Extract the RDF Graph from a DOM tree. @param dom: a DOM Node element, the top level entry node for the whole tree (to make it clear, a dom.documentElement is used to initiate processing) @keyword graph: an RDF Graph (if None, than a new one is created) @type graph: rdflib Graph instance. If None, a new one is created. @return: an RDF Graph @rtype: rdflib Graph instance

graph_from_source(name, graph=None, rdfOutput=False)[source]

Extract an RDF graph from an microdata source. The source is parsed, the RDF extracted, and the RDF Graph is returned. This is a front-end to the L{pyMicrodata.graph_from_DOM} method.

@param name: a URI, a file name, or a file-like object @return: an RDF Graph @rtype: rdflib Graph instance

rdf_from_source(name, outputFormat='pretty-xml', rdfOutput=False)[source]

Extract and RDF graph from an RDFa source and serialize it in one graph. The source is parsed, the RDF extracted, and serialization is done in the specified format. @param name: a URI, a file name, or a file-like object @keyword outputFormat: serialization format. Can be one of “turtle”, “n3”, “xml”, “pretty-xml”, “nt”. “xml” and “pretty-xml”, as well as “turtle” and “n3” are synonyms. @return: a serialized RDF Graph @rtype: string

rdf_from_sources(names, outputFormat='pretty-xml', rdfOutput=False)[source]

Extract and RDF graph from a list of RDFa sources and serialize them in one graph. The sources are parsed, the RDF extracted, and serialization is done in the specified format. @param names: list of sources, each can be a URI, a file name, or a file-like object @keyword outputFormat: serialization format. Can be one of “turtle”, “n3”, “xml”, “pretty-xml”, “nt”. “xml” and “pretty-xml”, as well as “turtle” and “n3” are synonyms. @return: a serialized RDF Graph @rtype: string

microdata Module

The core of the Microdata->RDF conversion, a more or less verbatim implementation of the U{W3C IG Note<http://www.w3.org/TR/microdata-rdf/>}. Because the implementation was also used to check the note itself, it tries to be fairly close to the text.

@organization: U{World Wide Web Consortium<http://www.w3.org>} @author: U{Ivan Herman<a href=”http://www.w3.org/People/Ivan/“>} @license: This software is available for use under the U{W3C® SOFTWARE NOTICE AND LICENSE<href=”http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231“>}

class rdflib.plugins.parsers.pyMicrodata.microdata.Evaluation_Context[source]

Evaluation context structure. See Section 4.1 of the U{W3C IG Note<http://www.w3.org/TR/microdata-rdf/>}for the details.

@ivar current_type : an absolute URL for the current type, used when an item does not contain an item type @ivar memory: mapping from items to RDF subjects @type memory: dictionary @ivar current_name: an absolute URL for the in-scope name, used for generating URIs for properties of items without an item type @ivar current_vocabulary: an absolute URL for the current vocabulary, from the registry

__init__()[source]
__module__ = 'rdflib.plugins.parsers.pyMicrodata.microdata'
__str__()[source]
get_memory(item)[source]

Get the memory content (ie, RDF subject) for ‘item’, or None if not stored yet @param item: an ‘item’, in microdata terminology @type item: DOM Element Node @return: None, or an RDF Subject (URIRef or BNode)

new_copy(itype)[source]

During the generation algorithm a new copy of the current context has to be done with a new current type.

At the moment, the content of memory is copied, ie, a fresh dictionary is created and the content copied over. Not clear whether that is necessary, though, maybe a simple reference is enough... @param itype : an absolute URL for the current type @return: a new evaluation context instance

set_memory(item, subject)[source]

Set the memory content, ie, the subject, for ‘item’. @param item: an ‘item’, in microdata terminology @type item: DOM Element Node @param subject: RDF Subject @type subject: URIRef or Blank Node

class rdflib.plugins.parsers.pyMicrodata.microdata.Microdata(document, base=None)[source]

This class encapsulates methods that are defined by the U{microdata spec<http://dev.w3.org/html5/md/Overview.html>}, as opposed to the RDF conversion note.

@ivar document: top of the DOM tree, as returned by the HTML5 parser @ivar base: the base URI of the Dom tree, either set from the outside or via a @base element

__init__(document, base=None)[source]

@param document: top of the DOM tree, as returned by the HTML5 parser @param base: the base URI of the Dom tree, either set from the outside or via a @base element

__module__ = 'rdflib.plugins.parsers.pyMicrodata.microdata'
getElementById(id)[source]

This is a method defined for DOM 2 HTML, but the HTML5 parser does not seem to define it. Oh well... @param id: value of an @id attribute to look for @return: array of nodes whose @id attribute matches C{id} (formally, there should be only one...)

get_item_properties(item)[source]

Collect the item’s properties, ie, all DOM descendent nodes with @itemprop until the subtree hits another @itemscope. @itemrefs are also added at this point.

@param item: current item @type item: DOM Node @return: array of items, ie, DOM Nodes

get_top_level_items()[source]

A top level item is and element that has the @itemscope set, but no @itemtype. They have to be collected in pre-order and depth-first fashion.

@return: list of items (ie, DOM Nodes)

class rdflib.plugins.parsers.pyMicrodata.microdata.MicrodataConversion(document, graph, base=None, vocab_expansion=False, vocab_cache=True)[source]

Bases: rdflib.plugins.parsers.pyMicrodata.microdata.Microdata

Top level class encapsulating the conversion algorithms as described in the W3C note.

@ivar graph: an RDF graph; an RDFLib Graph @type graph: RDFLib Graph @ivar document: top of the DOM tree, as returned by the HTML5 parser @ivar ns_md: the Namespace for the microdata vocabulary @ivar base: the base of the Dom tree, either set from the outside or via a @base element

__init__(document, graph, base=None, vocab_expansion=False, vocab_cache=True)[source]

@param graph: an RDF graph; an RDFLib Graph @type graph: RDFLib Graph @param document: top of the DOM tree, as returned by the HTML5 parser @keyword base: the base of the Dom tree, either set from the outside or via a @base element @keyword vocab_expansion: whether vocab expansion should be performed or not @type vocab_expansion: Boolean @keyword vocab_cache: if vocabulary expansion is done, then perform caching of the vocabulary data @type vocab_cache: Boolean

__module__ = 'rdflib.plugins.parsers.pyMicrodata.microdata'
convert()[source]

Top level entry to convert and generate all the triples. It finds the top level items, and generates triples for each of them; additionally, it generates a top level entry point to the items from base in the form of an RDF list.

generate_predicate_URI(name, context)[source]

Generate a full URI for a predicate, using the type, the vocabulary, etc.

For details of this entry, see Section 4.4 @param name: name of the property, ie, what appears in @itemprop @param context: an instance of an evaluation context @type context: L{Evaluation_Context}

generate_property_values(subject, predicate, objects, context)[source]

Generate the property values for a specific subject and predicate. The context should specify whether the objects should be added in an RDF list or each triples individually.

@param subject: RDF subject @type subject: RDFLib Node (URIRef or blank node) @param predicate: RDF predicate @type predicate: RDFLib URIRef @param objects: RDF objects @type objects: list of RDFLib nodes (URIRefs, Blank Nodes, or literals) @param context: evaluation context @type context: L{Evaluation_Context}

generate_triples(item, context)[source]

Generate the triples for a specific item. See the W3C Note for the details.

@param item: the DOM Node for the specific item @type item: DOM Node @param context: an instance of an evaluation context @type context: L{Evaluation_Context} @return: a URIRef or a BNode for the (RDF) subject

get_property_value(node, context)[source]

Generate an RDF object, ie, the value of a property. Note that if this element contains an @itemscope, then a recursive call to L{MicrodataConversion.generate_triples} is done and the return value of that method (ie, the subject for the corresponding item) is return as an object.

Otherwise, either URIRefs are created for <a>, <img>, etc, elements, or a Literal; the latter gets a time-related type for the <time> element.

@param node: the DOM Node for which the property values should be generated @type node: DOM Node @param context: an instance of an evaluation context @type context: L{Evaluation_Context} @return: an RDF resource (URIRef, BNode, or Literal)

class rdflib.plugins.parsers.pyMicrodata.microdata.PropertySchemes[source]
__module__ = 'rdflib.plugins.parsers.pyMicrodata.microdata'
contextual = 'contextual'
vocabulary = 'vocabulary'
class rdflib.plugins.parsers.pyMicrodata.microdata.ValueMethod[source]
__module__ = 'rdflib.plugins.parsers.pyMicrodata.microdata'
list = 'list'
unordered = 'unordered'

registry Module

Hardcoded version of the current microdata->RDF registry. There is also a local registry to include some test cases. Finally, there is a local dictionary for prefix mapping for the registry items; these are the preferred prefixes for those vocabularies, and are used to make the output nicer.

@organization: U{World Wide Web Consortium<http://www.w3.org>} @author: U{Ivan Herman<a href=”http://www.w3.org/People/Ivan/“>} @license: This software is available for use under the U{W3C® SOFTWARE NOTICE AND LICENSE<href=”http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231“>}

utils Module

Various utilities for pyMicrodata

@organization: U{World Wide Web Consortium<http://www.w3.org>} @author: U{Ivan Herman<a href=”http://www.w3.org/People/Ivan/“>} @license: This software is available for use under the U{W3C® SOFTWARE NOTICE AND LICENSE<href=”http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231“>}

class rdflib.plugins.parsers.pyMicrodata.utils.URIOpener(name)[source]

A wrapper around the urllib2 method to open a resource. Beyond accessing the data itself, the class sets the content location. The class also adds an accept header to the outgoing request, namely text/html and application/xhtml+xml (unless set explicitly by the caller).

@ivar data: the real data, ie, a file-like object @ivar headers: the return headers as sent back by the server @ivar location: the real location of the data (ie, after possible redirection and content negotiation)

CONTENT_LOCATION = 'Content-Location'
__init__(name)[source]

@param name: URL to be opened @keyword additional_headers: additional HTTP request headers to be added to the call

__module__ = 'rdflib.plugins.parsers.pyMicrodata.utils'
rdflib.plugins.parsers.pyMicrodata.utils.fragment_escape(name)[source]
rdflib.plugins.parsers.pyMicrodata.utils.generate_RDF_collection(graph, vals)[source]

Generate an RDF List from vals, returns the head of the list @param graph: RDF graph @type graph: RDFLib Graph @param vals: array of RDF Resources @return: head of the List (an RDF Resource)

rdflib.plugins.parsers.pyMicrodata.utils.generate_URI(base, v)[source]

Generate an (absolute) URI; if val is a fragment, then using it with base, otherwise just return the value @param base: Absolute URI for base @param v: relative or absolute URI

rdflib.plugins.parsers.pyMicrodata.utils.get_Literal(Pnode)[source]

Get (recursively) the full text from a DOM Node.

@param Pnode: DOM Node @return: string

rdflib.plugins.parsers.pyMicrodata.utils.get_lang(node)[source]
rdflib.plugins.parsers.pyMicrodata.utils.get_lang_from_hierarchy(document, node)[source]
rdflib.plugins.parsers.pyMicrodata.utils.get_time_type(string)[source]

Check whether the string abides to one of the accepted time related datatypes, and returns that one if yes @param string: the attribute value to be checked @return : a datatype URI or None

rdflib.plugins.parsers.pyMicrodata.utils.is_absolute_URI(uri)[source]