从HTML标记中提取嵌入式元数据
项目描述
extruct 是一个从HTML标记中提取嵌入式元数据的库。
目前,extruct 支持
Microformat 通过 mf2py
microdata算法是对 这篇Scrapinghub博客文章 的回顾,展示了如何使用EXSLT扩展。
安装
pip install extruct
使用
一次性提取
使用extruct的最简单示例是调用 extruct.extract(htmlstring, base_url=base_url),传入一些HTML字符串和可选的基本URL。
让我们在一个使用所有支持的语法(RDFa与ogp)的网页上尝试这个例子。
首先使用python-requests获取HTML,然后将响应体传递给 extruct
>>> import extruct
>>> import requests
>>> import pprint
>>> from w3lib.html import get_base_url
>>>
>>> pp = pprint.PrettyPrinter(indent=2)
>>> r = requests.get('https://www.optimizesmart.com/how-to-use-open-graph-protocol/')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url=base_url)
>>>
>>> pp.pprint(data)
{ 'dublincore': [ { 'elements': [ { 'URI': 'http://purl.org/dc/elements/1.1/description',
'content': 'What is Open Graph Protocol '
'and why you need it? Learn to '
'implement Open Graph Protocol '
'for Facebook on your website. '
'Open Graph Protocol Meta Tags.',
'name': 'description'}],
'namespaces': {},
'terms': []}],
'json-ld': [ { '@context': 'https://schema.org',
'@id': '#organization',
'@type': 'Organization',
'logo': 'https://www.optimizesmart.com/wp-content/uploads/2016/03/optimize-smart-Twitter-logo.jpg',
'name': 'Optimize Smart',
'sameAs': [ 'https://#/optimizesmart/',
'https://uk.linkedin.com/in/analyticsnerd',
'https://www.youtube.com/user/optimizesmart',
'https://twitter.com/analyticsnerd'],
'url': 'https://www.optimizesmart.com/'}],
'microdata': [ { 'properties': {'headline': ''},
'type': 'http://schema.org/WPHeader'}],
'microformat': [ { 'children': [ { 'properties': { 'category': [ 'specialized-tracking'],
'name': [ 'Open Graph '
'Protocol for '
'Facebook '
'explained with '
'examples\n'
'\n'
'Specialized '
'Tracking\n'
'\n'
'\n'
(...)
'Follow '
'@analyticsnerd\n'
'!function(d,s,id){var '
"js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, "
"'script', "
"'twitter-wjs');"]},
'type': ['h-entry']}],
'properties': { 'name': [ 'Open Graph Protocol for '
'Facebook explained with '
'examples\n'
(...)
'Follow @analyticsnerd\n'
'!function(d,s,id){var '
"js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, "
"'script', 'twitter-wjs');"]},
'type': ['h-feed']}],
'opengraph': [ { 'namespace': {'og': 'http://ogp.me/ns#'},
'properties': [ ('og:locale', 'en_US'),
('og:type', 'article'),
( 'og:title',
'Open Graph Protocol for Facebook '
'explained with examples'),
( 'og:description',
'What is Open Graph Protocol and why you '
'need it? Learn to implement Open Graph '
'Protocol for Facebook on your website. '
'Open Graph Protocol Meta Tags.'),
( 'og:url',
'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'),
('og:site_name', 'Optimize Smart'),
( 'og:updated_time',
'2018-03-09T16:26:35+00:00'),
( 'og:image',
'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'),
( 'og:image:secure_url',
'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg')]}],
'rdfa': [ { '@id': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/#header',
'http://www.w3.org/1999/xhtml/vocab#role': [ { '@id': 'http://www.w3.org/1999/xhtml/vocab#banner'}]},
{ '@id': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/',
'article:modified_time': [ { '@value': '2018-03-09T16:26:35+00:00'}],
'article:published_time': [ { '@value': '2010-07-02T18:57:23+00:00'}],
'article:publisher': [ { '@value': 'https://#/optimizesmart/'}],
'article:section': [{'@value': 'Specialized Tracking'}],
'http://ogp.me/ns#description': [ { '@value': 'What is Open '
'Graph Protocol '
'and why you need '
'it? Learn to '
'implement Open '
'Graph Protocol '
'for Facebook on '
'your website. '
'Open Graph '
'Protocol Meta '
'Tags.'}],
'http://ogp.me/ns#image': [ { '@value': 'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'}],
'http://ogp.me/ns#image:secure_url': [ { '@value': 'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'}],
'http://ogp.me/ns#locale': [{'@value': 'en_US'}],
'http://ogp.me/ns#site_name': [{'@value': 'Optimize Smart'}],
'http://ogp.me/ns#title': [ { '@value': 'Open Graph Protocol for '
'Facebook explained with '
'examples'}],
'http://ogp.me/ns#type': [{'@value': 'article'}],
'http://ogp.me/ns#updated_time': [ { '@value': '2018-03-09T16:26:35+00:00'}],
'http://ogp.me/ns#url': [ { '@value': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'}],
'https://api.w.org/': [ { '@id': 'https://www.optimizesmart.com/wp-json/'}]}]}
选择语法
可以通过传递一个包含要提取的语法的列表来选择要提取的语法。有效值:'microdata'、'json-ld'、'opengraph'、'microformat'、'rdfa'和'dublincore'。如果不传递列表,则将提取所有语法并返回
>>> r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'])
>>>
>>> pp.pprint(data)
{ 'microdata': [],
'opengraph': [ { 'namespace': { 'concerts': 'http://ogp.me/ns/fb/songkick-concerts#',
'fb': 'https://#/2008/fbml',
'og': 'http://ogp.me/ns#'},
'properties': [ ('fb:app_id', '308540029359'),
('og:site_name', 'Songkick'),
('og:type', 'songkick-concerts:artist'),
('og:title', 'Elysian Fields'),
( 'og:description',
'Find out when Elysian Fields is next '
'playing live near you. List of all '
'Elysian Fields tour dates and concerts.'),
( 'og:url',
'https://www.songkick.com/artists/236156-elysian-fields'),
( 'og:image',
'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg')]}],
'rdfa': [ { '@id': 'https://www.songkick.com/artists/236156-elysian-fields',
'al:ios:app_name': [{'@value': 'Songkick Concerts'}],
'al:ios:app_store_id': [{'@value': '438690886'}],
'al:ios:url': [ { '@value': 'songkick://artists/236156-elysian-fields'}],
'http://ogp.me/ns#description': [ { '@value': 'Find out when '
'Elysian Fields is '
'next playing live '
'near you. List of '
'all Elysian '
'Fields tour dates '
'and concerts.'}],
'http://ogp.me/ns#image': [ { '@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}],
'http://ogp.me/ns#site_name': [{'@value': 'Songkick'}],
'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}],
'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}],
'http://ogp.me/ns#url': [ { '@value': 'https://www.songkick.com/artists/236156-elysian-fields'}],
'https://#/2008/fbmlapp_id': [ { '@value': '308540029359'}]}]}
或者,如果您在调用extruct之前已经解析了HTML,您可以使用树而不是HTML字符串
>>> # using the request from the previous example >>> base_url = get_base_url(r.text, r.url) >>> from extruct.utils import parse_html >>> tree = parse_html(r.text) >>> data = extruct.extract(tree, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'])
Microformat格式不支持HTML树,因此您需要使用HTML字符串。
统一
另一种选项是将microformat、opengraph、microdata、dublincore和json-ld语法的输出统一到以下结构
{'@context': 'http://example.com',
'@type': 'example_type',
/* All other the properties in keys here */
}
为此,在调用 extract 时将 uniform=True 设置为真,默认情况下为假以保持向后兼容。这里是与之前相同的示例,但uniform设置为True
>>> r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'], uniform=True)
>>>
>>> pp.pprint(data)
{ 'microdata': [],
'opengraph': [ { '@context': { 'concerts': 'http://ogp.me/ns/fb/songkick-concerts#',
'fb': 'https://#/2008/fbml',
'og': 'http://ogp.me/ns#'},
'@type': 'songkick-concerts:artist',
'fb:app_id': '308540029359',
'og:description': 'Find out when Elysian Fields is next '
'playing live near you. List of all '
'Elysian Fields tour dates and concerts.',
'og:image': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg',
'og:site_name': 'Songkick',
'og:title': 'Elysian Fields',
'og:url': 'https://www.songkick.com/artists/236156-elysian-fields'}],
'rdfa': [ { '@id': 'https://www.songkick.com/artists/236156-elysian-fields',
'al:ios:app_name': [{'@value': 'Songkick Concerts'}],
'al:ios:app_store_id': [{'@value': '438690886'}],
'al:ios:url': [ { '@value': 'songkick://artists/236156-elysian-fields'}],
'http://ogp.me/ns#description': [ { '@value': 'Find out when '
'Elysian Fields is '
'next playing live '
'near you. List of '
'all Elysian '
'Fields tour dates '
'and concerts.'}],
'http://ogp.me/ns#image': [ { '@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}],
'http://ogp.me/ns#site_name': [{'@value': 'Songkick'}],
'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}],
'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}],
'http://ogp.me/ns#url': [ { '@value': 'https://www.songkick.com/artists/236156-elysian-fields'}],
'https://#/2008/fbmlapp_id': [ { '@value': '308540029359'}]}]}
注意:rdfa结构尚未统一。
返回HTML节点
还可以获取每个提取的元数据项的HTML节点的引用。该功能仅由microdata语法支持。
要使用此功能,只需将 extract 方法的 return_html_node 选项设置为 True。结果,结果中将为每个项包括一个额外的“nodeHtml”键。每个节点都是 lxml.etree.Element 类型
>>> r = requests.get('http://www.rugpadcorner.com/shop/no-muv/')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url, syntaxes=['microdata'], return_html_node=True)
>>>
>>> pp.pprint(data)
{ 'microdata': [ { 'htmlNode': <Element div at 0x7f10f8e6d3b8>,
'properties': { 'description': 'KEEP RUGS FLAT ON CARPET!\n'
'Not your thin sticky pad, '
'No-Muv is truly the best!',
'image': ['', ''],
'name': ['No-Muv', 'No-Muv'],
'offers': [ { 'htmlNode': <Element div at 0x7f10f8e6d138>,
'properties': { 'availability': 'http://schema.org/InStock',
'price': 'Price: '
'$45'},
'type': 'http://schema.org/Offer'},
{ 'htmlNode': <Element div at 0x7f10f8e60f48>,
'properties': { 'availability': 'http://schema.org/InStock',
'price': '(Select '
'Size/Shape '
'for '
'Pricing)'},
'type': 'http://schema.org/Offer'}],
'ratingValue': ['5.00', '5.00']},
'type': 'http://schema.org/Product'}]}
单个提取器
您还可以单独使用每个提取器。请参见以下内容。
Microdata提取
>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>>
>>> from extruct.w3cmicrodata import MicrodataExtractor
>>>
>>> # example from http://www.w3.org/TR/microdata/#associating-names-with-items
>>> html = """<!DOCTYPE HTML>
... <html>
... <head>
... <title>Photo gallery</title>
... </head>
... <body>
... <h1>My photos</h1>
... <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses">
... <img itemprop="work" src="images/house.jpeg" alt="A white house, boarded up, sits in a forest.">
... <figcaption itemprop="title">The house I found.</figcaption>
... </figure>
... <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses">
... <img itemprop="work" src="images/mailbox.jpeg" alt="Outside the house is a mailbox. It has a leaflet inside.">
... <figcaption itemprop="title">The mailbox.</figcaption>
... </figure>
... <footer>
... <p id="licenses">All images licensed under the <a itemprop="license"
... href="https://open-source.org.cn/licenses/mit-license.php">MIT
... license</a>.</p>
... </footer>
... </body>
... </html>"""
>>>
>>> mde = MicrodataExtractor()
>>> data = mde.extract(html)
>>> pp.pprint(data)
[{'properties': {'license': 'https://open-source.org.cn/licenses/mit-license.php',
'title': 'The house I found.',
'work': 'http://www.example.com/images/house.jpeg'},
'type': 'http://n.whatwg.org/work'},
{'properties': {'license': 'https://open-source.org.cn/licenses/mit-license.php',
'title': 'The mailbox.',
'work': 'http://www.example.com/images/mailbox.jpeg'},
'type': 'http://n.whatwg.org/work'}]
JSON-LD提取
>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>>
>>> from extruct.jsonld import JsonLdExtractor
>>>
>>> html = """<!DOCTYPE HTML>
... <html>
... <head>
... <title>Some Person Page</title>
... </head>
... <body>
... <h1>This guys</h1>
... <script type="application/ld+json">
... {
... "@context": "http://schema.org",
... "@type": "Person",
... "name": "John Doe",
... "jobTitle": "Graduate research assistant",
... "affiliation": "University of Dreams",
... "additionalName": "Johnny",
... "url": "http://www.example.com",
... "address": {
... "@type": "PostalAddress",
... "streetAddress": "1234 Peach Drive",
... "addressLocality": "Wonderland",
... "addressRegion": "Georgia"
... }
... }
... </script>
... </body>
... </html>"""
>>>
>>> jslde = JsonLdExtractor()
>>>
>>> data = jslde.extract(html)
>>> pp.pprint(data)
[{'@context': 'http://schema.org',
'@type': 'Person',
'additionalName': 'Johnny',
'address': {'@type': 'PostalAddress',
'addressLocality': 'Wonderland',
'addressRegion': 'Georgia',
'streetAddress': '1234 Peach Drive'},
'affiliation': 'University of Dreams',
'jobTitle': 'Graduate research assistant',
'name': 'John Doe',
'url': 'http://www.example.com'}]
RDFa提取(实验性)
>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>> from extruct.rdfa import RDFaExtractor # you can ignore the warning about html5lib not being available
INFO:rdflib:RDFLib Version: 4.2.1
/home/paul/.virtualenvs/extruct.wheel.test/lib/python3.5/site-packages/rdflib/plugins/parsers/structureddata.py:30: UserWarning: html5lib not found! RDFa and Microdata parsers will not be available.
'parsers will not be available.')
>>>
>>> html = """<html>
... <head>
... ...
... </head>
... <body prefix="dc: http://purl.org/dc/terms/ schema: http://schema.org/">
... <div resource="/alice/posts/trouble_with_bob" typeof="schema:BlogPosting">
... <h2 property="dc:title">The trouble with Bob</h2>
... ...
... <h3 property="dc:creator schema:creator" resource="#me">Alice</h3>
... <div property="schema:articleBody">
... <p>The trouble with Bob is that he takes much better photos than I do:</p>
... </div>
... ...
... </div>
... </body>
... </html>
... """
>>>
>>> rdfae = RDFaExtractor()
>>> pp.pprint(rdfae.extract(html, base_url='http://www.example.com/index.html'))
[{'@id': 'http://www.example.com/alice/posts/trouble_with_bob',
'@type': ['http://schema.org/BlogPosting'],
'http://purl.org/dc/terms/creator': [{'@id': 'http://www.example.com/index.html#me'}],
'http://purl.org/dc/terms/title': [{'@value': 'The trouble with Bob'}],
'http://schema.org/articleBody': [{'@value': '\n'
' The trouble with Bob '
'is that he takes much better '
'photos than I do:\n'
' '}],
'http://schema.org/creator': [{'@id': 'http://www.example.com/index.html#me'}]}]
您将获得一个展开的JSON-LD节点的列表。
Open Graph提取
>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>>
>>> from extruct.opengraph import OpenGraphExtractor
>>>
>>> html = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
... <html xmlns="https://www.w3.org/1999/xhtml" xmlns:og="https://ogp.me/ns#" xmlns:fb="https://#/2008/fbml">
... <head>
... <title>Himanshu's Open Graph Protocol</title>
... <meta http-equiv="Content-Type" content="text/html;charset=WINDOWS-1252" />
... <meta http-equiv="Content-Language" content="en-us" />
... <link rel="stylesheet" type="text/css" href="event-education.css" />
... <meta name="verify-v1" content="so4y/3aLT7/7bUUB9f6iVXN0tv8upRwaccek7JKB1gs=" >
... <meta property="og:title" content="Himanshu's Open Graph Protocol"/>
... <meta property="og:type" content="article"/>
... <meta property="og:url" content="https://www.eventeducation.com/test.php"/>
... <meta property="og:image" content="https://www.eventeducation.com/images/982336_wedding_dayandouan_th.jpg"/>
... <meta property="fb:admins" content="himanshu160"/>
... <meta property="og:site_name" content="Event Education"/>
... <meta property="og:description" content="Event Education provides free courses on event planning and management to event professionals worldwide."/>
... </head>
... <body>
... <div id="fb-root"></div>
... <script>(function(d, s, id) {
... var js, fjs = d.getElementsByTagName(s)[0];
... if (d.getElementById(id)) return;
... js = d.createElement(s); js.id = id;
... js.src = "//#/en_US/all.js#xfbml=1&appId=501839739845103";
... fjs.parentNode.insertBefore(js, fjs);
... }(document, 'script', 'facebook-jssdk'));</script>
... </body>
... </html>"""
>>>
>>> opengraphe = OpenGraphExtractor()
>>> pp.pprint(opengraphe.extract(html))
[{"namespace": {
"og": "http://ogp.me/ns#"
},
"properties": [
[
"og:title",
"Himanshu's Open Graph Protocol"
],
[
"og:type",
"article"
],
[
"og:url",
"https://www.eventeducation.com/test.php"
],
[
"og:image",
"https://www.eventeducation.com/images/982336_wedding_dayandouan_th.jpg"
],
[
"og:site_name",
"Event Education"
],
[
"og:description",
"Event Education provides free courses on event planning and management to event professionals worldwide."
]
]
}]
Microformat提取
>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>>
>>> from extruct.microformat import MicroformatExtractor
>>>
>>> html = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
... <html xmlns="https://www.w3.org/1999/xhtml" xmlns:og="https://ogp.me/ns#" xmlns:fb="https://#/2008/fbml">
... <head>
... <title>Himanshu's Open Graph Protocol</title>
... <meta http-equiv="Content-Type" content="text/html;charset=WINDOWS-1252" />
... <meta http-equiv="Content-Language" content="en-us" />
... <link rel="stylesheet" type="text/css" href="event-education.css" />
... <meta name="verify-v1" content="so4y/3aLT7/7bUUB9f6iVXN0tv8upRwaccek7JKB1gs=" >
... <meta property="og:title" content="Himanshu's Open Graph Protocol"/>
... <article class="h-entry">
... <h1 class="p-name">Microformats are amazing</h1>
... <p>Published by <a class="p-author h-card" href="http://example.com">W. Developer</a>
... on <time class="dt-published" datetime="2013-06-13 12:00:00">13<sup>th</sup> June 2013</time></p>
... <p class="p-summary">In which I extoll the virtues of using microformats.</p>
... <div class="e-content">
... <p>Blah blah blah</p>
... </div>
... </article>
... </head>
... <body></body>
... </html>"""
>>>
>>> microformate = MicroformatExtractor()
>>> data = microformate.extract(html)
>>> pp.pprint(data)
[{"type": [
"h-entry"
],
"properties": {
"name": [
"Microformats are amazing"
],
"author": [
{
"type": [
"h-card"
],
"properties": {
"name": [
"W. Developer"
],
"url": [
"http://example.com"
]
},
"value": "W. Developer"
}
],
"published": [
"2013-06-13 12:00:00"
],
"summary": [
"In which I extoll the virtues of using microformats."
],
"content": [
{
"html": "\n<p>Blah blah blah</p>\n",
"value": "\nBlah blah blah\n"
}
]
}
}]
DublinCore提取
>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>> from extruct.dublincore import DublinCoreExtractor
>>> html = '''<head profile="http://dublincore.org/documents/dcq-html/">
... <title>Expressing Dublin Core in HTML/XHTML meta and link elements</title>
... <link rel="schema.DC" href="http://purl.org/dc/elements/1.1/" />
... <link rel="schema.DCTERMS" href="http://purl.org/dc/terms/" />
...
...
... <meta name="DC.title" lang="en" content="Expressing Dublin Core
... in HTML/XHTML meta and link elements" />
... <meta name="DC.creator" content="Andy Powell, UKOLN, University of Bath" />
... <meta name="DCTERMS.issued" scheme="DCTERMS.W3CDTF" content="2003-11-01" />
... <meta name="DC.identifier" scheme="DCTERMS.URI"
... content="http://dublincore.org/documents/dcq-html/" />
... <link rel="DCTERMS.replaces" hreflang="en"
... href="http://dublincore.org/documents/2000/08/15/dcq-html/" />
... <meta name="DCTERMS.abstract" content="This document describes how
... qualified Dublin Core metadata can be encoded
... in HTML/XHTML <meta> elements" />
... <meta name="DC.format" scheme="DCTERMS.IMT" content="text/html" />
... <meta name="DC.type" scheme="DCTERMS.DCMIType" content="Text" />
... <meta name="DC.Date.modified" content="2001-07-18" />
... <meta name="DCTERMS.modified" content="2001-07-18" />'''
>>> dublinlde = DublinCoreExtractor()
>>> data = dublinlde.extract(html)
>>> pp.pprint(data)
[ { 'elements': [ { 'URI': 'http://purl.org/dc/elements/1.1/title',
'content': 'Expressing Dublin Core\n'
'in HTML/XHTML meta and link elements',
'lang': 'en',
'name': 'DC.title'},
{ 'URI': 'http://purl.org/dc/elements/1.1/creator',
'content': 'Andy Powell, UKOLN, University of Bath',
'name': 'DC.creator'},
{ 'URI': 'http://purl.org/dc/elements/1.1/identifier',
'content': 'http://dublincore.org/documents/dcq-html/',
'name': 'DC.identifier',
'scheme': 'DCTERMS.URI'},
{ 'URI': 'http://purl.org/dc/elements/1.1/format',
'content': 'text/html',
'name': 'DC.format',
'scheme': 'DCTERMS.IMT'},
{ 'URI': 'http://purl.org/dc/elements/1.1/type',
'content': 'Text',
'name': 'DC.type',
'scheme': 'DCTERMS.DCMIType'}],
'namespaces': { 'DC': 'http://purl.org/dc/elements/1.1/',
'DCTERMS': 'http://purl.org/dc/terms/'},
'terms': [ { 'URI': 'http://purl.org/dc/terms/issued',
'content': '2003-11-01',
'name': 'DCTERMS.issued',
'scheme': 'DCTERMS.W3CDTF'},
{ 'URI': 'http://purl.org/dc/terms/abstract',
'content': 'This document describes how\n'
'qualified Dublin Core metadata can be encoded\n'
'in HTML/XHTML <meta> elements',
'name': 'DCTERMS.abstract'},
{ 'URI': 'http://purl.org/dc/terms/modified',
'content': '2001-07-18',
'name': 'DC.Date.modified'},
{ 'URI': 'http://purl.org/dc/terms/modified',
'content': '2001-07-18',
'name': 'DCTERMS.modified'},
{ 'URI': 'http://purl.org/dc/terms/replaces',
'href': 'http://dublincore.org/documents/2000/08/15/dcq-html/',
'hreflang': 'en',
'rel': 'DCTERMS.replaces'}]}]
命令行工具
extruct 提供了一个命令行工具,允许您直接从命令行获取页面并提取其元数据。
依赖项
命令行工具依赖于 requests,它不是在安装 extruct 时默认安装的。为了使用命令行工具,您可以安装带有 cli 额外需求的 extruct。
pip install 'extruct[cli]'
使用
extruct "http://example.com"
下载“http://example.com”并将Microdata、JSON-LD和RDFa、Open Graph和Microformat元数据输出到 stdout。
支持参数
默认情况下,命令行工具将尝试从页面提取所有支持的元数据格式(目前为Microdata、JSON-LD、RDFa、Open Graph和Microformat)。如果您只想将输出限制为其中一个或这些的子集,您可以通过'syntaxes'参数传递一个包含它们各自名称的列表。
例如,此命令只从“http://example.com”提取Microdata和JSON-LD元数据
extruct "http://example.com" --syntaxes microdata json-ld
注意:传递的语法名称必须与以下对应:microdata、json-ld、rdfa、opengraph、microformat
开发版本
mkvirtualenv extruct pip install -r requirements-dev.txt
测试
在当前环境中运行测试
py.test tests
使用 tox 在不同的Python版本上运行测试
tox
项目详情
下载文件
下载适用于您平台的文件。如果您不确定选择哪个,请了解有关安装包的更多信息。
源代码发行版
构建发行版
extruct-0.17.0.tar.gz的哈希值
| 算法 | 哈希摘要 | |
|---|---|---|
| SHA256 | a94c0be5b5fd95a8370204ecc02687bd27845d536055d8d1c69a0a30da0420c7 |
|
| MD5 | 07c96bc8744a8f282844e80f6f2b2b93 |
|
| BLAKE2b-256 | 0f11de0fd08fb77e2d079efce6e9da679327a26594c0a2b30bdf3517273ddc88 |
extruct-0.17.0-py2.py3-none-any.whl的哈希值
| 算法 | 哈希摘要 | |
|---|---|---|
| SHA256 | 5f1d8e307fbb0c41f64ce486ddfaf16dc67e4b8f6e9570c57b123409ee37a307 |
|
| MD5 | 645292797ec782f1383700493d58954d |
|
| BLAKE2b-256 | 60f003a24fd454cf7708f307b26b1c110ca39c2d46a249cc7f9be5738d48168b |