pyMicrodata Package¶
pyMicrodata
Package¶
This module implements the microdata->RDF algorithm, as documented by the U{W3C Semantic Web Interest Group Note<http://www.w3.org/TR/2012/NOTE-microdata-rdf-20120308/>}.
The module can be used via a stand-alone script (an example is part of the distribution) or bound to a CGI script as a Web Service. An example CGI script is also added to the distribution. Both the local script and the distribution may have to be adapted to local circumstances.
(Simple) Usage¶
- From a Python file, expecting a Turtle output::
- from pyMicrodata import pyMicrodata print pyMicrodata().rdf_from_source(‘filename’)
- Other output formats are also possible. E.g., to produce RDF/XML output, one could use::
- from pyMicrodata import pyMicrodata print pyMicrodata().rdf_from_source(‘filename’, outputFormat=’pretty-xml’)
- It is also possible to embed an RDFa processing. Eg, using::
- from pyMicrodata import pyMicrodata graph = pyMicrodata().graph_from_source(‘filename’)
returns an RDFLib.Graph object instead of a serialization thereof. See the the description of the L{pyMicrodata class<pyMicrodata.pyMicrodata>} for further possible entry points details.
There is also, as part of this module, a L{separate entry for CGI calls<processURI>}.
Return formats¶
By default, the output format for the graph is RDF/XML. At present, the following formats are also available (with the corresponding key to be used in the package entry points):
- “xml”: U{RDF/XML<http://www.w3.org/TR/rdf-syntax-grammar/>}
- “turtle”: U{Turtle<http://www.w3.org/TR/turtle/>} (default)
- “nt”: U{N-triple<http://www.w3.org/TR/rdf-testcases/#ntriples>}
- “json”: U{JSON-LD<http://json-ld.org/spec/latest/json-ld-syntax/>}
@summary: Microdata parser (distiller) @requires: Python version 2.5 or up @requires: U{RDFLib<http://rdflib.net>} @requires: U{html5lib<http://code.google.com/p/html5lib/>} for the HTML5 parsing; note possible dependecies on Python’s version on the project’s web site @organization: U{World Wide Web Consortium<http://www.w3.org>} @author: U{Ivan Herman<http://www.w3.org/People/Ivan/>} @license: This software is available for use under the U{W3C® SOFTWARE NOTICE AND LICENSE<href=”http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231“>} @copyright: W3C
-
exception
rdflib.plugins.parsers.pyMicrodata.
HTTPError
(http_msg, http_code)[source]¶ Bases:
rdflib.plugins.parsers.pyMicrodata.MicrodataError
Raised when HTTP problems are detected. It does not add any new functionality to the Exception class.
-
__module__
= 'rdflib.plugins.parsers.pyMicrodata'¶
-
-
exception
rdflib.plugins.parsers.pyMicrodata.
MicrodataError
(msg)[source]¶ Bases:
exceptions.Exception
Superclass exceptions representing error conditions defined by the RDFa 1.1 specification. It does not add any new functionality to the Exception class.
-
__module__
= 'rdflib.plugins.parsers.pyMicrodata'¶
-
__weakref__
¶ list of weak references to the object (if defined)
-
-
rdflib.plugins.parsers.pyMicrodata.
processURI
(uri, outputFormat, form)[source]¶ The standard processing of a microdata uri options in a form, ie, as an entry point from a CGI call.
The call accepts extra form options (eg, HTTP GET options) as follows:
@param uri: URI to access. Note that the “text:” and “uploaded:” values are treated separately; the former is for textual intput (in which case a StringIO is used to get the data) and the latter is for uploaded file, where the form gives access to the file directly. @param outputFormat: serialization formats, as understood by RDFLib. Note that though “turtle” is a possible parameter value, some versions of the RDFLib turtle generation does funny (though legal) things with namespaces, defining unusual and unwanted prefixes... @param form: extra call options (from the CGI call) to set up the local options (if any) @type form: cgi FieldStorage instance @return: serialized graph @rtype: string
-
class
rdflib.plugins.parsers.pyMicrodata.
pyMicrodata
(base='', vocab_expansion=False, vocab_cache=True)[source]¶ Main processing class for the distiller @ivar base: the base value for processing @ivar http_status: HTTP Status, to be returned when the package is used via a CGI entry. Initially set to 200, may be modified by exception handlers
-
__init__
(base='', vocab_expansion=False, vocab_cache=True)[source]¶ @keyword base: URI for the default “base” value (usually the URI of the file to be processed) @keyword vocab_expansion: whether vocab expansion should be performed or not @type vocab_expansion: Boolean @keyword vocab_cache: if vocabulary expansion is done, then perform caching of the vocabulary data @type vocab_cache: Boolean
-
__module__
= 'rdflib.plugins.parsers.pyMicrodata'¶
-
graph_from_DOM
(dom, graph=None)[source]¶ Extract the RDF Graph from a DOM tree. @param dom: a DOM Node element, the top level entry node for the whole tree (to make it clear, a dom.documentElement is used to initiate processing) @keyword graph: an RDF Graph (if None, than a new one is created) @type graph: rdflib Graph instance. If None, a new one is created. @return: an RDF Graph @rtype: rdflib Graph instance
-
graph_from_source
(name, graph=None, rdfOutput=False)[source]¶ Extract an RDF graph from an microdata source. The source is parsed, the RDF extracted, and the RDF Graph is returned. This is a front-end to the L{pyMicrodata.graph_from_DOM} method.
@param name: a URI, a file name, or a file-like object @return: an RDF Graph @rtype: rdflib Graph instance
-
rdf_from_source
(name, outputFormat='pretty-xml', rdfOutput=False)[source]¶ Extract and RDF graph from an RDFa source and serialize it in one graph. The source is parsed, the RDF extracted, and serialization is done in the specified format. @param name: a URI, a file name, or a file-like object @keyword outputFormat: serialization format. Can be one of “turtle”, “n3”, “xml”, “pretty-xml”, “nt”. “xml” and “pretty-xml”, as well as “turtle” and “n3” are synonyms. @return: a serialized RDF Graph @rtype: string
-
rdf_from_sources
(names, outputFormat='pretty-xml', rdfOutput=False)[source]¶ Extract and RDF graph from a list of RDFa sources and serialize them in one graph. The sources are parsed, the RDF extracted, and serialization is done in the specified format. @param names: list of sources, each can be a URI, a file name, or a file-like object @keyword outputFormat: serialization format. Can be one of “turtle”, “n3”, “xml”, “pretty-xml”, “nt”. “xml” and “pretty-xml”, as well as “turtle” and “n3” are synonyms. @return: a serialized RDF Graph @rtype: string
-
microdata
Module¶
The core of the Microdata->RDF conversion, a more or less verbatim implementation of the U{W3C IG Note<http://www.w3.org/TR/microdata-rdf/>}. Because the implementation was also used to check the note itself, it tries to be fairly close to the text.
@organization: U{World Wide Web Consortium<http://www.w3.org>} @author: U{Ivan Herman<a href=”http://www.w3.org/People/Ivan/“>} @license: This software is available for use under the U{W3C® SOFTWARE NOTICE AND LICENSE<href=”http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231“>}
-
class
rdflib.plugins.parsers.pyMicrodata.microdata.
Evaluation_Context
[source]¶ Evaluation context structure. See Section 4.1 of the U{W3C IG Note<http://www.w3.org/TR/microdata-rdf/>}for the details.
@ivar current_type : an absolute URL for the current type, used when an item does not contain an item type @ivar memory: mapping from items to RDF subjects @type memory: dictionary @ivar current_name: an absolute URL for the in-scope name, used for generating URIs for properties of items without an item type @ivar current_vocabulary: an absolute URL for the current vocabulary, from the registry
-
__module__
= 'rdflib.plugins.parsers.pyMicrodata.microdata'¶
-
get_memory
(item)[source]¶ Get the memory content (ie, RDF subject) for ‘item’, or None if not stored yet @param item: an ‘item’, in microdata terminology @type item: DOM Element Node @return: None, or an RDF Subject (URIRef or BNode)
-
new_copy
(itype)[source]¶ During the generation algorithm a new copy of the current context has to be done with a new current type.
At the moment, the content of memory is copied, ie, a fresh dictionary is created and the content copied over. Not clear whether that is necessary, though, maybe a simple reference is enough... @param itype : an absolute URL for the current type @return: a new evaluation context instance
-
-
class
rdflib.plugins.parsers.pyMicrodata.microdata.
Microdata
(document, base=None)[source]¶ This class encapsulates methods that are defined by the U{microdata spec<http://dev.w3.org/html5/md/Overview.html>}, as opposed to the RDF conversion note.
@ivar document: top of the DOM tree, as returned by the HTML5 parser @ivar base: the base URI of the Dom tree, either set from the outside or via a @base element
-
__init__
(document, base=None)[source]¶ @param document: top of the DOM tree, as returned by the HTML5 parser @param base: the base URI of the Dom tree, either set from the outside or via a @base element
-
__module__
= 'rdflib.plugins.parsers.pyMicrodata.microdata'¶
-
getElementById
(id)[source]¶ This is a method defined for DOM 2 HTML, but the HTML5 parser does not seem to define it. Oh well... @param id: value of an @id attribute to look for @return: array of nodes whose @id attribute matches C{id} (formally, there should be only one...)
-
-
class
rdflib.plugins.parsers.pyMicrodata.microdata.
MicrodataConversion
(document, graph, base=None, vocab_expansion=False, vocab_cache=True)[source]¶ Bases:
rdflib.plugins.parsers.pyMicrodata.microdata.Microdata
Top level class encapsulating the conversion algorithms as described in the W3C note.
@ivar graph: an RDF graph; an RDFLib Graph @type graph: RDFLib Graph @ivar document: top of the DOM tree, as returned by the HTML5 parser @ivar ns_md: the Namespace for the microdata vocabulary @ivar base: the base of the Dom tree, either set from the outside or via a @base element
-
__init__
(document, graph, base=None, vocab_expansion=False, vocab_cache=True)[source]¶ @param graph: an RDF graph; an RDFLib Graph @type graph: RDFLib Graph @param document: top of the DOM tree, as returned by the HTML5 parser @keyword base: the base of the Dom tree, either set from the outside or via a @base element @keyword vocab_expansion: whether vocab expansion should be performed or not @type vocab_expansion: Boolean @keyword vocab_cache: if vocabulary expansion is done, then perform caching of the vocabulary data @type vocab_cache: Boolean
-
__module__
= 'rdflib.plugins.parsers.pyMicrodata.microdata'¶
-
convert
()[source]¶ Top level entry to convert and generate all the triples. It finds the top level items, and generates triples for each of them; additionally, it generates a top level entry point to the items from base in the form of an RDF list.
-
generate_predicate_URI
(name, context)[source]¶ Generate a full URI for a predicate, using the type, the vocabulary, etc.
For details of this entry, see Section 4.4 @param name: name of the property, ie, what appears in @itemprop @param context: an instance of an evaluation context @type context: L{Evaluation_Context}
-
generate_property_values
(subject, predicate, objects, context)[source]¶ Generate the property values for a specific subject and predicate. The context should specify whether the objects should be added in an RDF list or each triples individually.
@param subject: RDF subject @type subject: RDFLib Node (URIRef or blank node) @param predicate: RDF predicate @type predicate: RDFLib URIRef @param objects: RDF objects @type objects: list of RDFLib nodes (URIRefs, Blank Nodes, or literals) @param context: evaluation context @type context: L{Evaluation_Context}
-
generate_triples
(item, context)[source]¶ Generate the triples for a specific item. See the W3C Note for the details.
@param item: the DOM Node for the specific item @type item: DOM Node @param context: an instance of an evaluation context @type context: L{Evaluation_Context} @return: a URIRef or a BNode for the (RDF) subject
-
get_property_value
(node, context)[source]¶ Generate an RDF object, ie, the value of a property. Note that if this element contains an @itemscope, then a recursive call to L{MicrodataConversion.generate_triples} is done and the return value of that method (ie, the subject for the corresponding item) is return as an object.
Otherwise, either URIRefs are created for <a>, <img>, etc, elements, or a Literal; the latter gets a time-related type for the <time> element.
@param node: the DOM Node for which the property values should be generated @type node: DOM Node @param context: an instance of an evaluation context @type context: L{Evaluation_Context} @return: an RDF resource (URIRef, BNode, or Literal)
-
registry
Module¶
Hardcoded version of the current microdata->RDF registry. There is also a local registry to include some test cases. Finally, there is a local dictionary for prefix mapping for the registry items; these are the preferred prefixes for those vocabularies, and are used to make the output nicer.
@organization: U{World Wide Web Consortium<http://www.w3.org>} @author: U{Ivan Herman<a href=”http://www.w3.org/People/Ivan/“>} @license: This software is available for use under the U{W3C® SOFTWARE NOTICE AND LICENSE<href=”http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231“>}
utils
Module¶
Various utilities for pyMicrodata
@organization: U{World Wide Web Consortium<http://www.w3.org>} @author: U{Ivan Herman<a href=”http://www.w3.org/People/Ivan/“>} @license: This software is available for use under the U{W3C® SOFTWARE NOTICE AND LICENSE<href=”http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231“>}
-
class
rdflib.plugins.parsers.pyMicrodata.utils.
URIOpener
(name)[source]¶ A wrapper around the urllib2 method to open a resource. Beyond accessing the data itself, the class sets the content location. The class also adds an accept header to the outgoing request, namely text/html and application/xhtml+xml (unless set explicitly by the caller).
@ivar data: the real data, ie, a file-like object @ivar headers: the return headers as sent back by the server @ivar location: the real location of the data (ie, after possible redirection and content negotiation)
-
CONTENT_LOCATION
= 'Content-Location'¶
-
__init__
(name)[source]¶ @param name: URL to be opened @keyword additional_headers: additional HTTP request headers to be added to the call
-
__module__
= 'rdflib.plugins.parsers.pyMicrodata.utils'¶
-
-
rdflib.plugins.parsers.pyMicrodata.utils.
generate_RDF_collection
(graph, vals)[source]¶ Generate an RDF List from vals, returns the head of the list @param graph: RDF graph @type graph: RDFLib Graph @param vals: array of RDF Resources @return: head of the List (an RDF Resource)
-
rdflib.plugins.parsers.pyMicrodata.utils.
generate_URI
(base, v)[source]¶ Generate an (absolute) URI; if val is a fragment, then using it with base, otherwise just return the value @param base: Absolute URI for base @param v: relative or absolute URI
-
rdflib.plugins.parsers.pyMicrodata.utils.
get_Literal
(Pnode)[source]¶ Get (recursively) the full text from a DOM Node.
@param Pnode: DOM Node @return: string