pyRdfa Package

pyRdfa Package

RDFa 1.1 parser, also referred to as a “RDFa Distiller”. It is deployed, via a CGI front-end, on the U{W3C RDFa 1.1 Distiller page<http://www.w3.org/2012/pyRdfa/>}.

For details on RDFa, the reader should consult the U{RDFa Core 1.1<http://www.w3.org/TR/rdfa-core/>}, U{XHTML+RDFa1.1<http://www.w3.org/TR/2010/xhtml-rdfa>}, and the U{RDFa 1.1 Lite<http://www.w3.org/TR/rdfa-lite/>} documents. The U{RDFa 1.1 Primer<http://www.w3.org/TR/owl2-primer/>} may also prove helpful.

This package can also be downloaded U{from GitHub<https://github.com/RDFLib/pyrdfa3>}. The distribution also includes the CGI front-end and a separate utility script to be run locally.

Note that this package is an updated version of a U{previous RDFa distiller<http://www.w3.org/2007/08/pyRdfa>} that was developed for RDFa 1.0. Although it reuses large portions of that code, it has been quite thoroughly rewritten, hence put in a completely different project. (The version numbering has been continued, though, to avoid any kind of misunderstandings. This version has version numbers “3.0.0” or higher.)

(Simple) Usage

From a Python file, expecting a Turtle output::
from pyRdfa import pyRdfa print pyRdfa().rdf_from_source(‘filename’)
Other output formats are also possible. E.g., to produce RDF/XML output, one could use::
from pyRdfa import pyRdfa print pyRdfa().rdf_from_source(‘filename’, outputFormat=’pretty-xml’)
It is also possible to embed an RDFa processing. Eg, using::
from pyRdfa import pyRdfa graph = pyRdfa().graph_from_source(‘filename’)

returns an RDFLib.Graph object instead of a serialization thereof. See the the description of the L{pyRdfa class<pyRdfa.pyRdfa>} for further possible entry points details.

There is also, as part of this module, a L{separate entry for CGI calls<processURI>}.

Return (serialization) formats

The package relies on RDFLib. By default, it relies therefore on the serializers coming with the local RDFLib distribution. However, there has been some issues with serializers of older RDFLib releases; also, some output formats, like JSON-LD, are not (yet) part of the standard RDFLib distribution. A companion package, called pyRdfaExtras, is part of the download, and it includes some of those extra serializers. The extra format (not part of the RDFLib core) is U{JSON-LD<http://json-ld.org/spec/latest/json-ld-syntax/>}, whose ‘key’ is ‘json’, when used in the ‘parse’ method of an RDFLib graph.

Options

The package also implements some optional features that are not part of the RDFa recommendations. At the moment these are:

  • possibility for plain literals to be normalized in terms of white spaces. Default: false. (The RDFa specification requires keeping the white spaces and leave applications to normalize them, if needed)
  • inclusion of embedded RDF: Turtle content may be enclosed in a C{script} element and typed as C{text/turtle}, U{defined by the RDF Working Group<http://www.w3.org/TR/turtle/>}. Alternatively, some XML dialects (e.g., SVG) allows the usage of RDF/XML as part of their core content to define metadata in RDF. For both of these cases pyRdfa parses these serialized RDF content and adds the resulting triples to the output Graph. Default: true.
  • extra, built-in transformers are executed on the DOM tree prior to RDFa processing (see below). These transformers can be provided by the end user.
Options are collected in an instance of the L{Options} class and may be passed to the processing functions as an extra argument. E.g., to allow the inclusion of embedded content::
from pyRdfa.options import Options options = Options(embedded_rdf=True) print pyRdfa(options=options).rdf_from_source(‘filename’)

See the description of the L{Options} class for the details.

Host Languages

RDFa 1.1. Core is defined for generic XML; there are specific documents to describe how the generic specification is applied to XHTML and HTML5.

pyRdfa makes an automatic switch among these based on the content type of the source as returned by an HTTP request. The following are the possible host languages:

  • if the content type is C{text/html}, the content is HTML5
  • if the content type is C{application/xhtml+xml} I{and} the right DTD is used, the content is XHTML1
  • if the content type is C{application/xhtml+xml} and no or an unknown DTD is used, the content is XHTML5
  • if the content type is C{application/svg+xml}, the content type is SVG
  • if the content type is C{application/atom+xml}, the content type is SVG
  • if the content type is C{application/xml} or C{application/xxx+xml} (but ‘xxx’ is not ‘atom’ or ‘svg’), the content type is XML

If local files are used, pyRdfa makes a guess on the content type based on the file name suffix: C{.html} is for HTML5, C{.xhtml} for XHTML1, C{.svg} for SVG, anything else is considered to be general XML. Finally, the content type may be set by the caller when initializing the L{pyRdfa class<pyRdfa.pyRdfa>}.

Beyond the differences described in the RDFa specification, the main difference is the parser used to parse the source. In the case of HTML5, pyRdfa uses an U{HTML5 parser<http://code.google.com/p/html5lib/>}; for all other cases the simple XML parser, part of the core Python environment, is used. This may be significant in the case of erronuous sources: indeed, the HTML5 parser may do adjustments on the DOM tree before handing it over to the distiller. Furthermore, SVG is also recognized as a type that allows embedded RDF in the form of RDF/XML.

See the variables in the L{host} module if a new host language is added to the system. The current host language information is available for transformers via the option argument, too, and can be used to control the effect of the transformer.

Vocabularies

RDFa 1.1 has the notion of vocabulary files (using the C{@vocab} attribute) that may be used to expand the generated RDF graph. Expansion is based on some very simply RDF Schema and OWL statements on sub-properties and sub-classes, and equivalences.

pyRdfa implements this feature, although it does not do this by default. The extra C{vocab_expansion} parameter should be used for this extra step, for example::
from pyRdfa.options import Options options = Options(vocab_expansion=True) print pyRdfa(options=options).rdf_from_source(‘filename’)

The triples in the vocabulary files themselves (i.e., the small ontology in RDF Schema and OWL) are removed from the result, leaving the inferred property and type relationships only (additionally to the “core” RDF content).

Vocabulary caching

By default, pyRdfa uses a caching mechanism instead of fetching the vocabulary files each time their URI is met as a C{@vocab} attribute value. (This behavior can be switched off setting the C{vocab_cache} option to false.)

Caching happens in a file system directory. The directory itself is determined by the platform the tool is used on, namely:
  • On Windows, it is the C{pyRdfa-cache} subdirectory of the C{%APPDATA%} environment variable
  • On MacOS, it is the C{~/Library/Application Support/pyRdfa-cache}
  • Otherwise, it is the C{~/.pyRdfa-cache}

This automatic choice can be overridden by the C{PyRdfaCacheDir} environment variable.

Caching can be set to be read-only, i.e., the setup might generate the cache files off-line instead of letting the tool writing its own cache when operating, e.g., as a service on the Web. This can be achieved by making the cache directory read only.

If the directories are neither readable nor writable, the vocabulary files are retrieved via HTTP every time they are hit. This may slow down processing, it is advised to avoid such a setup for the package.

The cache includes a separate index file and a file for each vocabulary file. Cache control is based upon the C{EXPIRES} header of a vocabulary file’s HTTP return header: when first seen, this data is stored in the index file and controls whether the cache has to be renewed or not. If the HTTP return header does not have this entry, the date is artificially set ot the current date plus one day.

(The cache files themselves are dumped and loaded using U{Python’s built in cPickle package<http://docs.python.org/release/2.7/library/pickle.html#module-cPickle>}. These are binary files. Care should be taken if they are managed by CVS: they must be declared as binary files when adding them to the repository.)

RDFa 1.1 vs. RDFa 1.0

Unfortunately, RDFa 1.1 is I{not} fully backward compatible with RDFa 1.0, meaning that, in a few cases, the triples generated from an RDFa 1.1 source are not the same as for RDFa 1.0. (See the separate U{section in the RDFa 1.1 specification<http://www.w3.org/TR/rdfa-core/#major-differences-with-rdfa-syntax-1.0>} for some further details.)

This distiller’s default behavior is RDFa 1.1. However, if the source includes, in the top element of the file (e.g., the C{html} element) a C{@version} attribute whose value contains the C{RDFa 1.0} string, then the distiller switches to a RDFa 1.0 mode. (Although the C{@version} attribute is not required in RDFa 1.0, it is fairly commonly used.) Similarly, if the RDFa 1.0 DTD is used in the XHTML source, it will be taken into account (a very frequent setup is that an XHTML file is defined with that DTD and is served as text/html; pyRdfa will consider that file as XHTML5, i.e., parse it with the HTML5 parser, but interpret the RDFa attributes under the RDFa 1.0 rules).

Transformers

The package uses the concept of ‘transformers’: the parsed DOM tree is possibly transformed I{before} performing the real RDFa processing. This transformer structure makes it possible to add additional ‘services’ without distoring the core code of RDFa processing.

A transformer is a function with three arguments:

  • C{node}: a DOM node for the top level element of the DOM tree
  • C{options}: the current L{Options} instance
  • C{state}: the current L{ExecutionContext} instance, corresponding to the top level DOM Tree element

The function may perform any type of change on the DOM tree; the typical behaviour is to add or remove attributes on specific elements. Some transformations are included in the package and can be used as examples; see the L{transform} module of the distribution. These are:

  • The C{@name} attribute of the C{meta} element is copied into a C{@property} attribute of the same element
  • Interpreting the ‘openid’ references in the header. See L{transform.OpenID} for further details.
  • Implementing the Dublin Core dialect to include DC statements from the header. See L{transform.DublinCore} for further details.
The user of the package may refer add these transformers to L{Options} instance. Here is a possible usage with the “openid” transformer added to the call::
from pyRdfa.options import Options from pyRdfa.transform.OpenID import OpenID_transform options = Options(transformers=[OpenID_transform]) print pyRdfa(options=options).rdf_from_source(‘filename’)

@summary: RDFa parser (distiller) @requires: Python version 2.5 or up; 2.7 is preferred @requires: U{RDFLib<http://rdflib.net>}; version 3.X is preferred. @requires: U{html5lib<http://code.google.com/p/html5lib/>} for the HTML5 parsing. @requires: U{httpheader<http://deron.meranda.us/python/httpheader/>}; however, a small modification had to make on the original file, so for this reason and to make distribution easier this module (single file) is added to the package. @organization: U{World Wide Web Consortium<http://www.w3.org>} @author: U{Ivan Herman<a href=”http://www.w3.org/People/Ivan/“>} @license: This software is available for use under the U{W3C® SOFTWARE NOTICE AND LICENSE<href=”http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231“>} @copyright: W3C

@var builtInTransformers: List of built-in transformers that are to be run regardless, because they are part of the RDFa spec @var CACHE_DIR_VAR: Environment variable used to define cache directories for RDFa vocabularies in case the default setting does not work or is not appropriate. @var rdfa_current_version: Current “official” version of RDFa that this package implements by default. This can be changed at the invocation of the package @var uri_schemes: List of registered (or widely used) URI schemes; used for warnings...

exception rdflib.plugins.parsers.pyRdfa.FailedSource(msg, http_code=None)[source]

Bases: rdflib.plugins.parsers.pyRdfa.RDFaError

Raised when the original source cannot be accessed. It does not add any new functionality to the Exception class.

__init__(msg, http_code=None)[source]
__module__ = 'rdflib.plugins.parsers.pyRdfa'
exception rdflib.plugins.parsers.pyRdfa.HTTPError(http_msg, http_code)[source]

Bases: rdflib.plugins.parsers.pyRdfa.RDFaError

Raised when HTTP problems are detected. It does not add any new functionality to the Exception class.

__init__(http_msg, http_code)[source]
__module__ = 'rdflib.plugins.parsers.pyRdfa'
exception rdflib.plugins.parsers.pyRdfa.ProcessingError(msg)[source]

Bases: rdflib.plugins.parsers.pyRdfa.RDFaError

Error found during processing. It does not add any new functionality to the Exception class.

__module__ = 'rdflib.plugins.parsers.pyRdfa'
exception rdflib.plugins.parsers.pyRdfa.RDFaError(msg)[source]

Bases: exceptions.Exception

Superclass exceptions representing error conditions defined by the RDFa 1.1 specification. It does not add any new functionality to the Exception class.

__init__(msg)[source]
__module__ = 'rdflib.plugins.parsers.pyRdfa'
__weakref__

list of weak references to the object (if defined)

rdflib.plugins.parsers.pyRdfa.processURI(uri, outputFormat, form={})[source]

The standard processing of an RDFa uri options in a form; used as an entry point from a CGI call.

The call accepts extra form options (i.e., HTTP GET options) as follows:

  • C{graph=[output|processor|output,processor|processor,output]} specifying which graphs are returned. Default: C{output}
  • C{space_preserve=[true|false]} means that plain literals are normalized in terms of white spaces. Default: C{false}
  • C{rfa_version} provides the RDFa version that should be used for distilling. The string should be of the form “1.0” or “1.1”. Default is the highest version the current package implements, currently “1.1”
  • C{host_language=[xhtml,html,xml]} : the host language. Used when files are uploaded or text is added verbatim, otherwise the HTTP return header should be used. Default C{xml}
  • C{embedded_rdf=[true|false]} : whether embedded turtle or RDF/XML content should be added to the output graph. Default: C{false}
  • C{vocab_expansion=[true|false]} : whether the vocabularies should be expanded through the restricted RDFS entailment. Default: C{false}
  • C{vocab_cache=[true|false]} : whether vocab caching should be performed or whether it should be ignored and vocabulary files should be picked up every time. Default: C{false}
  • C{vocab_cache_report=[true|false]} : whether vocab caching details should be reported. Default: C{false}
  • C{vocab_cache_bypass=[true|false]} : whether vocab caches have to be regenerated every time. Default: C{false}
  • C{rdfa_lite=[true|false]} : whether warnings should be generated for non RDFa Lite attribute usage. Default: C{false}

@param uri: URI to access. Note that the C{text:} and C{uploaded:} fake URI values are treated separately; the former is for textual intput (in which case a StringIO is used to get the data) and the latter is for uploaded file, where the form gives access to the file directly. @param outputFormat: serialization format, as defined by the package. Currently “xml”, “turtle”, “nt”, or “json”. Default is “turtle”, also used if any other string is given. @param form: extra call options (from the CGI call) to set up the local options @type form: cgi FieldStorage instance @return: serialized graph @rtype: string

class rdflib.plugins.parsers.pyRdfa.pyRdfa(options=None, base='', media_type='', rdfa_version=None)[source]

Main processing class for the distiller

@ivar options: an instance of the L{Options} class @ivar media_type: the preferred default media type, possibly set at initialization @ivar base: the base value, possibly set at initialization @ivar http_status: HTTP Status, to be returned when the package is used via a CGI entry. Initially set to 200, may be modified by exception handlers

__init__(options=None, base='', media_type='', rdfa_version=None)[source]

@keyword options: Options for the distiller @type options: L{Options} @keyword base: URI for the default “base” value (usually the URI of the file to be processed) @keyword media_type: explicit setting of the preferred media type (a.k.a. content type) of the the RDFa source @keyword rdfa_version: the RDFa version that should be used. If not set, the value of the global L{rdfa_current_version} variable is used

__module__ = 'rdflib.plugins.parsers.pyRdfa'
graph_from_DOM(dom, graph=None, pgraph=None)[source]

Extract the RDF Graph from a DOM tree. This is where the real processing happens. All other methods get down to this one, eventually (e.g., after opening a URI and parsing it into a DOM). @param dom: a DOM Node element, the top level entry node for the whole tree (i.e., the C{dom.documentElement} is used to initiate processing down the node hierarchy) @keyword graph: an RDF Graph (if None, than a new one is created) @type graph: rdflib Graph instance. @keyword pgraph: an RDF Graph to hold (possibly) the processor graph content. If None, and the error/warning triples are to be generated, they will be added to the returned graph. Otherwise they are stored in this graph. @type pgraph: rdflib Graph instance @return: an RDF Graph @rtype: rdflib Graph instance

graph_from_source(name, graph=None, rdfOutput=False, pgraph=None)[source]

Extract an RDF graph from an RDFa source. The source is parsed, the RDF extracted, and the RDFa Graph is returned. This is a front-end to the L{pyRdfa.graph_from_DOM} method.

@param name: a URI, a file name, or a file-like object @param graph: rdflib Graph instance. If None, a new one is created. @param pgraph: rdflib Graph instance for the processor graph. If None, and the error/warning triples are to be generated, they will be added to the returned graph. Otherwise they are stored in this graph. @param rdfOutput: whether runtime exceptions should be turned into RDF and returned as part of the processor graph @return: an RDF Graph @rtype: rdflib Graph instance

rdf_from_source(name, outputFormat='turtle', rdfOutput=False)[source]

Extract and RDF graph from an RDFa source and serialize it in one graph. The source is parsed, the RDF extracted, and serialization is done in the specified format. @param name: a URI, a file name, or a file-like object @keyword outputFormat: serialization format. Can be one of “turtle”, “n3”, “xml”, “pretty-xml”, “nt”. “xml”, “pretty-xml”, “json” or “json-ld”. “turtle” and “n3”, “xml” and “pretty-xml”, and “json” and “json-ld” are synonyms, respectively. Note that the JSON-LD serialization works with RDFLib 3.* only. @keyword rdfOutput: controls what happens in case an exception is raised. If the value is False, the caller is responsible handling it; otherwise a graph is returned with an error message included in the processor graph @type rdfOutput: boolean @return: a serialized RDF Graph @rtype: string

rdf_from_sources(names, outputFormat='turtle', rdfOutput=False)[source]

Extract and RDF graph from a list of RDFa sources and serialize them in one graph. The sources are parsed, the RDF extracted, and serialization is done in the specified format. @param names: list of sources, each can be a URI, a file name, or a file-like object @keyword outputFormat: serialization format. Can be one of “turtle”, “n3”, “xml”, “pretty-xml”, “nt”. “xml”, “pretty-xml”, “json” or “json-ld”. “turtle” and “n3”, “xml” and “pretty-xml”, and “json” and “json-ld” are synonyms, respectively. Note that the JSON-LD serialization works with RDFLib 3.* only. @keyword rdfOutput: controls what happens in case an exception is raised. If the value is False, the caller is responsible handling it; otherwise a graph is returned with an error message included in the processor graph @type rdfOutput: boolean @return: a serialized RDF Graph @rtype: string

exception rdflib.plugins.parsers.pyRdfa.pyRdfaError[source]

Bases: exceptions.Exception

Superclass exceptions representing error conditions outside the RDFa 1.1 specification.

__module__ = 'rdflib.plugins.parsers.pyRdfa'
__weakref__

list of weak references to the object (if defined)

embeddedRDF Module

Extracting possible embedded RDF/XML content from the file and parse it separately into the Graph. This is used, for example by U{SVG 1.2 Tiny<http://www.w3.org/TR/SVGMobile12/>}.

@author: U{Ivan Herman<a href=”http://www.w3.org/People/Ivan/“>} @license: This software is available for use under the U{W3C® SOFTWARE NOTICE AND LICENSE<href=”http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231“>} @contact: Ivan Herman, ivan@w3.org @version: $Id: embeddedRDF.py,v 1.15 2012/11/16 17:51:53 ivan Exp $

rdflib.plugins.parsers.pyRdfa.embeddedRDF.handle_embeddedRDF(node, graph, state)[source]

Handles embedded RDF. There are two possibilities:

  • the file is one of the XML dialects that allows for an embedded RDF/XML portion. See the L{host.accept_embedded_rdf_xml} for those (a typical example is SVG).
  • the file is HTML and there is a turtle portion in the C{<script>} element with type text/turtle.

@param node: a DOM node for the top level element @param graph: target rdf graph @type graph: RDFLib’s Graph object instance @param state: the inherited state (namespaces, lang, etc) @type state: L{state.ExecutionContext} @return: whether an RDF/XML or turtle content has been detected or not. If TRUE, the RDFa processing should not occur on the node and its descendents. @rtype: Boolean

initialcontext Module

Built-in version of the initial contexts for RDFa Core, and RDFa + HTML

@summary: Management of vocabularies, terms, and their mapping to URI-s. @requires: U{RDFLib package<http://rdflib.net>} @organization: U{World Wide Web Consortium<http://www.w3.org>} @author: U{Ivan Herman<a href=”http://www.w3.org/People/Ivan/“>} @license: This software is available for use under the U{W3C® SOFTWARE NOTICE AND LICENSE<href=”http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231“>}

@var initial_context: dictionary for all the initial context data, keyed through the context URI-s

class rdflib.plugins.parsers.pyRdfa.initialcontext.Wrapper[source]
__module__ = 'rdflib.plugins.parsers.pyRdfa.initialcontext'

options Module

L{Options} class: collect the possible options that govern the parsing possibilities. The module also includes the L{ProcessorGraph} class that handles the processor graph, per RDFa 1.1 (i.e., the graph containing errors and warnings).

@summary: RDFa parser (distiller) @requires: U{RDFLib<http://rdflib.net>} @organization: U{World Wide Web Consortium<http://www.w3.org>} @author: U{Ivan Herman<a href=”http://www.w3.org/People/Ivan/“>} @license: This software is available for use under the U{W3C SOFTWARE NOTICE AND LICENSE<href=”http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231“>}

class rdflib.plugins.parsers.pyRdfa.options.Options(output_default_graph=True, output_processor_graph=False, space_preserve=True, transformers=[], embedded_rdf=True, vocab_expansion=False, vocab_cache=True, vocab_cache_report=False, refresh_vocab_cache=False, add_informational_messages=False, check_lite=False, experimental_features=False)[source]

Settable options. An instance of this class is stored in the L{execution context<ExecutionContext>} of the parser.

@ivar space_preserve: whether plain literals should preserve spaces at output or not @type space_preserve: Boolean

@ivar output_default_graph: whether the ‘default’ graph should be returned to the user @type output_default_graph: Boolean

@ivar output_processor_graph: whether the ‘processor’ graph should be returned to the user @type output_processor_graph: Boolean

@ivar processor_graph: the ‘processor’ Graph @type processor_graph: L{ProcessorGraph}

@ivar transformers: extra transformers @type transformers: list

@ivar vocab_cache_report: whether the details of vocabulary file caching process should be reported as information (mainly for debug) @type vocab_cache_report: Boolean

@ivar refresh_vocab_cache: whether the caching checks of vocabs should be by-passed, ie, if caches should be re-generated regardless of the stored date (important for vocab development) @type refresh_vocab_cache: Boolean

@ivar embedded_rdf: whether embedded RDF (ie, turtle in an HTML script element or an RDF/XML content in SVG) should be extracted and added to the final graph. This is a non-standard option... @type embedded_rdf: Boolean

@ivar vocab_expansion: whether the @vocab elements should be expanded and a mini-RDFS processing should be done on the merged graph @type vocab_expansion: Boolean

@ivar vocab_cache: whether the system should use the vocabulary caching mechanism when expanding via the mini-RDFS, or should just fetch the graphs every time @type vocab_cache: Boolean

@ivar host_language: the host language for the RDFa attributes. Default is HostLanguage.xhtml, but it can be HostLanguage.rdfa_core and HostLanguage.html5, or others... @type host_language: integer (logically: an enumeration)

@ivar content_type: the content type of the host file. Default is None @type content_type: string (logically: an enumeration)

@ivar add_informational_messages: whether informational messages should also be added to the processor graph, or only errors and warnings

@ivar experimental_features: whether experimental features should be activated; that is a developer’s option... @ivar check_lite: whether RDFa Lite should be checked, to generate warnings.

__init__(output_default_graph=True, output_processor_graph=False, space_preserve=True, transformers=[], embedded_rdf=True, vocab_expansion=False, vocab_cache=True, vocab_cache_report=False, refresh_vocab_cache=False, add_informational_messages=False, check_lite=False, experimental_features=False)[source]
__module__ = 'rdflib.plugins.parsers.pyRdfa.options'
__str__()[source]
add_error(txt, err_type=None, context=None, node=None, buggy_value=None)[source]

Add an error to the processor graph. @param txt: the information text. @keyword err_type: Error Class @type err_type: URIRef @keyword context: possible context to be added to the processor graph @type context: URIRef or String @keyword buggy_value: a special case when a ‘term’ is not recognized; no error is generated for that case if the value is part of the ‘usual’ XHTML terms, because almost all RDFa file contains some of those and that would pollute the output @type buggy_value: String

add_info(txt, info_type=None, context=None, node=None, buggy_value=None)[source]

Add an informational comment to the processor graph. @param txt: the information text. @keyword info_type: Info Class @type info_type: URIRef @keyword context: possible context to be added to the processor graph @type context: URIRef or String @keyword buggy_value: a special case when a ‘term’ is not recognized; no information is generated for that case if the value is part of the ‘usual’ XHTML terms, because almost all RDFa file contains some of those and that would pollute the output @type buggy_value: String

add_warning(txt, warning_type=None, context=None, node=None, buggy_value=None)[source]

Add a warning to the processor graph. @param txt: the warning text. @keyword warning_type: Warning Class @type warning_type: URIRef @keyword context: possible context to be added to the processor graph @type context: URIRef or String @keyword buggy_value: a special case when a ‘term’ is not recognized; no warning is generated for that case if the value is part of the ‘usual’ XHTML terms, because almost all RDFa file contains some of those and that would pollute the output @type buggy_value: String

reset_processor_graph()[source]

Empty the processor graph. This is necessary if the same options is reused for several RDFa sources, and new error messages should be generated.

set_host_language(content_type)[source]

Set the host language for processing, based on the recognized types. If this is not a recognized content type, it falls back to RDFa core (i.e., XML) @param content_type: content type @type content_type: string

class rdflib.plugins.parsers.pyRdfa.options.ProcessorGraph[source]

Wrapper around the ‘processor graph’, ie, the (RDF) Graph containing the warnings, error messages, and informational messages.

__init__()[source]
__module__ = 'rdflib.plugins.parsers.pyRdfa.options'
add_http_context(subj, http_code)[source]

Add an additional HTTP context to a message with subject in C{subj}, using the U{<http://www.w3.org/2006/http#>} vocabulary. Typically used to extend an error structure, as created by L{add_triples}.

@param subj: an RDFLib resource, typically a blank node @param http_code: HTTP status code

add_triples(msg, top_class, info_class, context, node)[source]

Add an error structure to the processor graph: a bnode with a number of predicates. The structure follows U{the processor graph vocabulary<http://www.w3.org/2010/02/rdfa/wiki/Processor_Graph_Vocabulary>} as described on the RDFa WG Wiki page.

@param msg: the core error message, added as an object to a dc:description @param top_class: Error, Warning, or Info; an explicit rdf:type added to the bnode @type top_class: URIRef @param info_class: An additional error class, added as an rdf:type to the bnode in case it is not None @type info_class: URIRef @param context: An additional information added, if not None, as an object with rdfa:context as a predicate @type context: either an URIRef or a URI String (an URIRef will be created in the second case) @param node: The node’s element name that contains the error @type node: string @return: the bnode that serves as a subject for the errors. The caller may add additional information @rtype: BNode

parse Module

The core parsing function of RDFa. Some details are put into other modules to make it clearer to update/modify (e.g., generation of C{@property} values, or managing the current state).

Note that the entry point (L{parse_one_node}) bifurcates into an RDFa 1.0 and RDFa 1.1 version, ie, to L{_parse_1_0} and L{_parse_1_1}. Some of the parsing details (management of C{@property}, list facilities, changed behavior on C{@typeof})) have changed between versions and forcing the two into one function would be counter productive.

@summary: RDFa core parser processing step @organization: U{World Wide Web Consortium<http://www.w3.org>} @author: U{Ivan Herman<a href=”http://www.w3.org/People/Ivan/“>} @license: This software is available for use under the U{W3C® SOFTWARE NOTICE AND LICENSE<href=”http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231“>}

rdflib.plugins.parsers.pyRdfa.parse.handle_role_attribute(node, graph, state)[source]

Handling the role attribute, according to http://www.w3.org/TR/role-attribute/#using-role-in-conjunction-with-rdfa @param node: the DOM node to handle @param graph: the RDF graph @type graph: RDFLib’s Graph object instance @param state: the inherited state (namespaces, lang, etc.) @type state: L{state.ExecutionContext}

rdflib.plugins.parsers.pyRdfa.parse.parse_one_node(node, graph, parent_object, incoming_state, parent_incomplete_triples)[source]

The (recursive) step of handling a single node.

This entry just switches between the RDFa 1.0 and RDFa 1.1 versions for parsing. This method is only invoked once, actually, from the top level; the recursion then happens in the L{_parse_1_0} and L{_parse_1_1} methods for RDFa 1.0 and RDFa 1.1, respectively.

@param node: the DOM node to handle @param graph: the RDF graph @type graph: RDFLib’s Graph object instance @param parent_object: the parent’s object, as an RDFLib URIRef @param incoming_state: the inherited state (namespaces, lang, etc.) @type incoming_state: L{state.ExecutionContext} @param parent_incomplete_triples: list of hanging triples (the missing resource set to None) to be handled (or not) by the current node. @return: whether the caller has to complete it’s parent’s incomplete triples @rtype: Boolean

property Module

Implementation of the C{@property} value handling.

RDFa 1.0 and RDFa 1.1 are fairly different. RDFa 1.0 generates only literals, see U{RDFa Task Force’s wiki page<http://www.w3.org/2006/07/SWD/wiki/RDFa/LiteralObject>} for the details. On the other hand, RDFa 1.1, beyond literals, can also generate URI references. Hence the duplicate method in the L{ProcessProperty} class, one for RDFa 1.0 and the other for RDFa 1.1.

@summary: RDFa Literal generation @requires: U{RDFLib package<http://rdflib.net>} @organization: U{World Wide Web Consortium<http://www.w3.org>} @author: U{Ivan Herman<a href=”http://www.w3.org/People/Ivan/“>} @license: This software is available for use under the U{W3C® SOFTWARE NOTICE AND LICENSE<href=”http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231“>}

class rdflib.plugins.parsers.pyRdfa.property.ProcessProperty(node, graph, subject, state, typed_resource=None)[source]

Generate the value for C{@property} taking into account datatype, etc. Note: this class is created only if the C{@property} is indeed present, no need to check.

@ivar node: DOM element node @ivar graph: the (RDF) graph to add the properies to @ivar subject: the RDFLib URIRef serving as a subject for the generated triples @ivar state: the current state to be used for the CURIE-s @type state: L{state.ExecutionContext} @ivar typed_resource: Typically the bnode generated by a @typeof

__init__(node, graph, subject, state, typed_resource=None)[source]

@param node: DOM element node @param graph: the (RDF) graph to add the properies to @param subject: the RDFLib URIRef serving as a subject for the generated triples @param state: the current state to be used for the CURIE-s @param state: L{state.ExecutionContext} @param typed_resource: Typically the bnode generated by a @typeof; in RDFa 1.1, that becomes the object for C{@property}

__module__ = 'rdflib.plugins.parsers.pyRdfa.property'
generate()[source]

Common entry point for the RDFa 1.0 and RDFa 1.1 versions; bifurcates based on the RDFa version, as retrieved from the state object.

generate_1_0()[source]

Generate the property object, 1.0 version

generate_1_1()[source]

Generate the property object, 1.1 version

state Module

Parser’s execution context (a.k.a. state) object and handling. The state includes:

  • language, retrieved from C{@xml:lang} or C{@lang}
  • URI base, determined by C{<base>} or set explicitly. This is a little bit superfluous, because the current RDFa syntax does not make use of C{@xml:base}; i.e., this could be a global value. But the structure is prepared to add C{@xml:base} easily, if needed.
  • options, in the form of an L{options<pyRdfa.options>} instance
  • a separate vocabulary/CURIE handling resource, in the form of an L{termorcurie<pyRdfa.TermOrCurie>} instance

The execution context object is also used to handle URI-s, CURIE-s, terms, etc.

@summary: RDFa parser execution context @organization: U{World Wide Web Consortium<http://www.w3.org>} @author: U{Ivan Herman<a href=”http://www.w3.org/People/Ivan/“>} @license: This software is available for use under the U{W3C® SOFTWARE NOTICE AND LICENSE<href=”http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231“>}

class rdflib.plugins.parsers.pyRdfa.state.ExecutionContext(node, graph, inherited_state=None, base='', options=None, rdfa_version=None)[source]

State at a specific node, including the current set of namespaces in the RDFLib sense, current language, the base, vocabularies, etc. The class is also used to interpret URI-s and CURIE-s to produce URI references for RDFLib.

@ivar options: reference to the overall options @type options: L{Options} @ivar base: the ‘base’ URI @ivar parsedBase: the parsed version of base, as produced by urlparse.urlsplit @ivar defaultNS: default namespace (if defined via @xmlns) to be used for XML Literals @ivar lang: language tag (possibly None) @ivar term_or_curie: vocabulary management class instance @type term_or_curie: L{termorcurie.TermOrCurie} @ivar list_mapping: dictionary of arrays, containing a list of URIs key-ed via properties for lists @ivar node: the node to which this state belongs @type node: DOM node instance @ivar rdfa_version: RDFa version of the content @type rdfa_version: String @ivar supress_lang: in some cases, the effect of the lang attribute should be supressed for the given node, although it should be inherited down below (example: @value attribute of the data element in HTML5) @type supress_lang: Boolean @cvar _list: list of attributes that allow for lists of values and should be treated as such @cvar _resource_type: dictionary; mapping table from attribute name to the exact method to retrieve the URI(s). Is initialized at first instantiation.

__init__(node, graph, inherited_state=None, base='', options=None, rdfa_version=None)[source]

@param node: the current DOM Node @param graph: the RDFLib Graph @keyword inherited_state: the state as inherited from upper layers. This inherited_state is mixed with the state information retrieved from the current node. @type inherited_state: L{state.ExecutionContext} @keyword base: string denoting the base URI for the specific node. This overrides the possible base inherited from the upper layers. The current XHTML+RDFa syntax does not allow the usage of C{@xml:base}, but SVG1.2 does, so this is necessary for SVG (and other possible XML dialects that accept C{@xml:base}) @keyword options: invocation options, and references to warning graphs @type options: L{Options<pyRdfa.options>}

__module__ = 'rdflib.plugins.parsers.pyRdfa.state'
add_to_list_mapping(property, resource)[source]

Add a new property-resource on the list mapping structure. The latter is a dictionary of arrays; if the array does not exist yet, it will be created on the fly.

@param property: the property URI, used as a key in the dictionary @param resource: the resource to be added to the relevant array in the dictionary. Can be None; this is a dummy placeholder for C{<span rel=”property” inlist>...</span>} constructions that may be filled in by children or siblings; if not an empty list has to be generated.

getResource(*args)[source]

Get single resources from several different attributes. The first one that returns a valid URI wins. @param args: variable list of attribute names, or a single attribute being a list itself. @return: an RDFLib URIRef instance (or None) :

getURI(attr)[source]

Get the URI(s) for the attribute. The name of the attribute determines whether the value should be a pure URI, a CURIE, etc, and whether the return is a single element of a list of those. This is done using the L{ExecutionContext._resource_type} table. @param attr: attribute name @type attr: string @return: an RDFLib URIRef instance (or None) or a list of those

get_list_origin()[source]

Return the origin of the list, ie, the subject to attach the final list(s) to @return: URIRef

get_list_props()[source]

Return the list of property values in the list structure @return: list of URIRef

get_list_value(prop)[source]

Return the list of values in the list structure for a specific property @return: list of RDF nodes

list_empty()[source]

Checks whether the list is empty. @return: Boolean

reset_list_mapping(origin=None)[source]

Reset, ie, create a new empty dictionary for the list mapping.

set_list_origin(origin)[source]

Set the origin of the list, ie, the subject to attach the final list(s) to @param origin: URIRef

class rdflib.plugins.parsers.pyRdfa.state.ListStructure[source]

Special class to handle the C{@inlist} type structures in RDFa 1.1; stores the “origin”, i.e, where the list will be attached to, and the mappings as defined in the spec.

__init__()[source]
__module__ = 'rdflib.plugins.parsers.pyRdfa.state'

termorcurie Module

Management of vocabularies, terms, and their mapping to URI-s. The main class of this module (L{TermOrCurie}) is, conceptually, part of the overall state of processing at a node (L{state.ExecutionContext}) but putting it into a separate module makes it easider to maintain.

@summary: Management of vocabularies, terms, and their mapping to URI-s. @requires: U{RDFLib package<http://rdflib.net>} @organization: U{World Wide Web Consortium<http://www.w3.org>} @author: U{Ivan Herman<a href=”http://www.w3.org/People/Ivan/“>} @license: This software is available for use under the U{W3C® SOFTWARE NOTICE AND LICENSE<href=”http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231“>}

@var XHTML_PREFIX: prefix for the XHTML vocabulary URI (set to ‘xhv’) @var XHTML_URI: URI prefix of the XHTML vocabulary @var ncname: Regular expression object for NCNAME @var termname: Regular expression object for a term @var xml_application_media_type: Regular expression object for a general XML application media type

class rdflib.plugins.parsers.pyRdfa.termorcurie.InitialContext(state, top_level)[source]

Get the initial context values. In most cases this class has an empty content, except for the top level (in case of RDFa 1.1). Each L{TermOrCurie} class has one instance of this class. It provides initial mappings for terms, namespace prefixes, etc, that the top level L{TermOrCurie} instance uses for its own initialization.

@ivar terms: collection of all term mappings @type terms: dictionary @ivar ns: namespace mapping @type ns: dictionary @ivar vocabulary: default vocabulary @type vocabulary: string

__init__(state, top_level)[source]

@param state: the state behind this term mapping @type state: L{state.ExecutionContext} @param top_level : whether this is the top node of the DOM tree (the only place where initial contexts are handled) @type top_level : boolean

__module__ = 'rdflib.plugins.parsers.pyRdfa.termorcurie'
class rdflib.plugins.parsers.pyRdfa.termorcurie.TermOrCurie(state, graph, inherited_state)[source]

Wrapper around vocabulary management, ie, mapping a term to a URI, as well as a CURIE to a URI. Each instance of this class belongs to a “state”, instance of L{state.ExecutionContext}. Context definitions are managed at initialization time.

(In fact, this class is, conceptually, part of the overall state at a node, and has been separated here for an easier maintenance.)

The class takes care of the stack-like behavior of vocabulary items, ie, inheriting everything that is possible from the “parent”. At initialization time, this works through the prefix definitions (i.e., C{@prefix} or C{@xmln:} attributes) and/or C{@vocab} attributes.

@ivar state: State to which this instance belongs @type state: L{state.ExecutionContext} @ivar graph: The RDF Graph under generation @type graph: rdflib.Graph @ivar terms: mapping from terms to URI-s @type terms: dictionary @ivar ns: namespace declarations, ie, mapping from prefixes to URIs @type ns: dictionary @ivar default_curie_uri: URI for a default CURIE

CURIE_to_URI(val)[source]

CURIE to URI mapping.

This method does I{not} take care of the last step of CURIE processing, ie, the fact that if it does not have a CURIE then the value is used a URI. This is done on the caller’s side, because this has to be combined with base, for example. The method I{does} take care of BNode processing, though, ie, CURIE-s of the form “_:XXX”.

@param val: the full CURIE @type val: string @return: URIRef of a URI or None.

__init__(state, graph, inherited_state)[source]

Initialize the vocab bound to a specific state. @param state: the state to which this vocab instance belongs to @type state: L{state.ExecutionContext} @param graph: the RDF graph being worked on @type graph: rdflib.Graph @param inherited_state: the state inherited by the current state. ‘None’ if this is the top level state. @type inherited_state: L{state.ExecutionContext}

__module__ = 'rdflib.plugins.parsers.pyRdfa.termorcurie'
term_to_URI(term)[source]

A term to URI mapping, where term is a simple string and the corresponding URI is defined via the @vocab (ie, default term uri) mechanism. Returns None if term is not defined @param term: string @return: an RDFLib URIRef instance (or None)

utils Module

Various utilities for pyRdfa.

Most of the utilities are straightforward.

@organization: U{World Wide Web Consortium<http://www.w3.org>} @author: U{Ivan Herman<a href=”http://www.w3.org/People/Ivan/“>} @license: This software is available for use under the U{W3C® SOFTWARE NOTICE AND LICENSE<href=”http://www.w3.org/Consortium/Legal/2002/copyright-software-20021231“>}

class rdflib.plugins.parsers.pyRdfa.utils.URIOpener(name, additional_headers={})[source]

A wrapper around the urllib2 method to open a resource. Beyond accessing the data itself, the class sets a number of instance variable that might be relevant for processing. The class also adds an accept header to the outgoing request, namely text/html and application/xhtml+xml (unless set explicitly by the caller).

If the content type is set by the server, the relevant HTTP response field is used. Otherwise, common suffixes are used (see L{host.preferred_suffixes}) to set the content type (this is really of importance for C{file:///} URI-s). If none of these works, the content type is empty.

Interpretation of the content type for the return is done by Deron Meranda’s U{httpheader module<http://deron.meranda.us/>}.

@ivar data: the real data, ie, a file-like object @ivar headers: the return headers as sent back by the server @ivar content_type: the content type of the resource or the empty string, if the content type cannot be determined @ivar location: the real location of the data (ie, after possible redirection and content negotiation) @ivar last_modified_date: sets the last modified date if set in the header, None otherwise @ivar expiration_date: sets the expiration date if set in the header, I{current UTC plus one day} otherwise (this is used for caching purposes, hence this artificial setting)

CONTENT_LOCATION = 'Content-Location'
CONTENT_TYPE = 'Content-Type'
EXPIRES = 'Expires'
LAST_MODIFIED = 'Last-Modified'
__init__(name, additional_headers={})[source]

@param name: URL to be opened @keyword additional_headers: additional HTTP request headers to be added to the call

__module__ = 'rdflib.plugins.parsers.pyRdfa.utils'
rdflib.plugins.parsers.pyRdfa.utils.create_file_name(uri)[source]

Create a suitable file name from an (absolute) URI. Used, eg, for the generation of a file name for a cached vocabulary file.

rdflib.plugins.parsers.pyRdfa.utils.dump(node)[source]

This is just for debug purposes: it prints the essential content of the node in the tree starting at node.

@param node: DOM node

rdflib.plugins.parsers.pyRdfa.utils.has_one_of_attributes(node, *args)[source]

Check whether one of the listed attributes is present on a (DOM) node. @param node: DOM element node @param args: possible attribute names @return: True or False @rtype: Boolean

rdflib.plugins.parsers.pyRdfa.utils.quote_URI(uri, options=None)[source]

‘quote’ a URI, ie, exchange special characters for their ‘%..’ equivalents. Some of the characters may stay as they are (listed in L{_unquotedChars}. If one of the characters listed in L{_warnChars} is also in the uri, an extra warning is also generated. @param uri: URI @param options: @type options: L{Options<pyRdfa.Options>}

rdflib.plugins.parsers.pyRdfa.utils.return_XML(state, inode, base=True, xmlns=True)[source]

Get (recursively) the XML Literal content of a DOM Element Node. (Most of the processing is done via a C{node.toxml} call of the xml minidom implementation.)

@param inode: DOM Node @param state: L{pyRdfa.state.ExecutionContext} @param base: whether the base element should be added to the output @type base: Boolean @param xmlns: whether the namespace declarations should be repeated in the generated node @type xmlns: Boolean @return: string

rdflib.plugins.parsers.pyRdfa.utils.traverse_tree(node, func)[source]

Traverse the whole element tree, and perform the function C{func} on all the elements. @param node: DOM element node @param func: function to be called on the node. Input parameter is a DOM Element Node. If the function returns a boolean True, the recursion is stopped.