一个充满URL相关启发式的辅助库。

项目描述

Ural

一个充满URL相关启发式的Python辅助库。

安装

您可以使用以下命令使用pip安装 ural

pip install ural

如何引用？

ural 已在 Zenodo 上发布为

您可以这样引用它

Guillaume Plique, Jules Farjas, Oubine Perrin, Benjamin Ooghe-Tabanou, Martin Delabre, Pauline Breteau, Jean Descamps, Béatrice Mazoyer, Amélie Pellé, Laura Miguel, & César Pichon. Ural, a python helper library full of URL-related heuristics. (2018). Zenodo. https://doi.org/10.5281/zenodo.8160139

使用方法

通用函数

规范URL
可能是HTML
可能是RSS
确保协议
指纹主机名
指纹URL
强制协议
格式化URL
获取域名名
获取主机名
获取指纹化主机名
获取规范化主机名
有特殊主机
有有效后缀
有有效顶级域名
推断重定向
是主页
是缩短的URL
是特殊主机
是打字错误URL
是URL
是有效的顶级域名
从HTML中链接
规范化主机名
规范化URL
应遵循href
应解析
分割后缀
删除协议
urlpathsplit
从HTML中提取URL
从文本中提取URL

实用工具

升级后缀和顶级域名

类

HostnameTrieSet
- #.add
- #.match

LRU相关函数 (什么是LRU？)

lru.url_to_lru
lru.lru_to_url
lru.lru_stems
lru.canonicalized_lru_stems
lru.normalized_lru_stems
lru.fingerprinted_lru_stems
lru.serialize_lru
lru.unserialize_lru

LRU相关类

LRUTrie
- #.set
- #.set_lru
- #.match
- #.match_lru
CanonicalizedLRUTrie
NormalizedLRUTrie
FingerprintedLRUTrie

平台特定函数

facebook
google
instagram
telegram
twitter
youtube

规范URL、规范化URL和指纹URL之间的差异

ural附带三种不同的URL去重方案，针对不同的用例，按以下顺序排列（按侵略性升序）

规范URL：我们通过执行一些通常在网页浏览器击中之前进行的轻量级预处理来清理URL，例如将主机名转换为小写，解码punycode，确保我们有协议，删除前后空格等。清理后的URL保证仍然会导向同一个地方。
规范化URL：我们应用更高级的预处理，这将删除与URL导向无关的部分，例如技术伪迹和SEO技巧。例如，我们将删除营销活动中使用的典型查询项，重新排序查询项，推断一些重定向，在适当的情况下删除尾部斜杠或片段等。到那时，URL应该足够干净，以便在计数时进行有意义的统计分析，同时以相当大的概率保证URL仍然有效，并且仍然导向同一个地方，至少如果目标服务器遵循最常见惯例。
指纹URL：我们更进一步，执行破坏性预处理，不能保证结果URL仍然有效。但结果可能对统计汇总更有用，尤其是在计数来自拥有多个域的大型平台（例如facebook.com、facebook.fr等）的URL时。

函数	用例	URL有效性	去重强度
规范URL	网络爬虫	技术上相同	+
规范化URL	网络爬虫，统计汇总	可能相同	++
指纹URL	统计汇总	可能无效	+++

示例

from ural import canonicalize_url, normalize_url, fingerprint_url

url = 'https://#:80/index.html?utc_campaign=3&id=34'

canonicalize_url(url)
>>> 'https://#/index.html?utc_campaign=3&id=34'
# The same url, cleaned up a little

normalize_url(url)
>>> 'facebook.com?id=34'
# Still a valid url, with implicit protocol, where all the cruft has been discarded

fingerprint_url(url, strip_suffix=True)
>>> 'facebook?id=34'
# Not a valid url anymore, but useful to match more potential
# candidates such as: http://facebook.co.uk/index.html?id=34

规范URL

函数通过执行与网页浏览器相同的预处理方式，返回一个干净且安全的URL版本。

有关此内容的更多详细信息，请务必阅读文档中的本节。

from ural import canonicalize_url

canonicalize_url('www.LEMONDE.fr')
>>> 'https://lemonde.fr'

参数

url 字符串：要规范化的URL。
quoted ?bool [False]：默认情况下，函数将尽可能解除URL的引号，同时确保URL安全。如果将此关键字参数设置为True，则函数将相反地尽可能对URL进行引号处理，同时确保没有任何内容被双重引号。
default_protocol ?str [https]：当URL没有协议时添加的默认协议。
strip_fragment ?str [False]：是否删除URL的片段。

可能是HTML

函数返回URL是否可以返回HTML。

from ural import could_be_html

could_be_html('https://www.lemonde.fr')
>>> True

could_be_html('https://www.lemonde.fr/articles/page.php')
>>> True

could_be_html('https://www.lemonde.fr/data.json')
>>> False

could_be_html('https://www.lemonde.fr/img/figure.jpg')
>>> False

可能是RSS

函数返回给定URL是否可以是RSS源URL。

from ural import could_be_rss

could_be_rss('https://www.lemonde.fr/cyclisme/rss_full.xml')
>>> True

could_be_rss('https://www.lemonde.fr/cyclisme/')
>>> False

could_be_rss('https://www.ecorce.org/spip.php?page=backend')
>>> True

could_be_rss('https://feeds.feedburner.com/helloworld')
>>> True

确保协议

函数检查URL是否有协议，并在没有协议的情况下添加给定的协议。

from ural import ensure_protocol

ensure_protocol('www.lemonde.fr', protocol='https')
>>> 'https://www.lemonde.fr'

参数

url 字符串：要格式化的URL。
protocol 字符串：如果URL中没有协议，则使用此协议。默认为'http'。

指纹主机名

函数通过删除对统计聚合不相关的子域来返回给定主机名的“指纹”版本。请注意，此函数比normalize_hostname更为激进，并且生成的域名可能不再有效。

有关此内容的更多详细信息，请务必阅读文档中的本节。

from ural import fingerprint_hostname

fingerprint_hostname('www.lemonde.fr')
>>> 'lemonde.fr'

fingerprint_hostname('fr-FR.facebook.com')
>>> 'facebook.com'

fingerprint_hostname('fr-FR.facebook.com', strip_suffix=True)
>>> 'facebook'

参数

hostname 字符串：目标主机名。
strip_suffix ?bool [False]：是否删除主机名后缀，如.com或.co.uk。这可以用于汇总同一平台的不同域名。

指纹URL

函数返回给定URL的“指纹”版本，这对于统计聚合可能很有用。请注意，此函数比normalize_url更为激进，并且生成的URL可能不再有效。

有关此内容的更多详细信息，请务必阅读文档中的本节。

from ural import fingerprint_hostname

fingerprint_url('www.lemonde.fr/article.html')
>>> 'lemonde.fr/article.html'

fingerprint_url('fr-FR.facebook.com/article.html')
>>> 'facebook.com/article.html'

fingerprint_url('fr-FR.facebook.com/article.html', strip_suffix=True)
>>> 'facebook/article.html'

参数

url 字符串：目标URL。
strip_suffix ?bool [False]：是否删除主机名后缀，如.com或.co.uk。这可以用于汇总同一平台的不同域名。
platform_aware ?bool [False]：是否在规范化URL时考虑ural支持的知名平台（如facebook、youtube等）。

强制协议

函数强制替换给定URL的协议。

from ural import force_protocol

force_protocol('https://www2.lemonde.fr', protocol='ftp')
>>> 'ftp://www2.lemonde.fr'

参数

url 字符串：要格式化的URL。
protocol 字符串：输出URL中想要的协议。默认为'http'。

格式化URL

函数根据一些典型参数格式化URL。

from ural import format_url

format_url(
  'https://lemonde.fr',
  path='/article.html',
  args={'id': '48675'},
  fragment='title-2'
)
>>> 'https://lemonde.fr/article.html?id=48675#title-2'

# Path can be given as an iterable
format_url('https://lemonde.fr', path=['articles', 'one.html'])
>>> 'https://lemonde.fr/articles/one.html'

# Extension
format_url('https://lemonde.fr', path=['article'], ext='html')
>>> 'https://lemonde.fr/articles/article.html'

# Query args are formatted/quoted and/or skipped if None/False
format_url(
  "http://lemonde.fr",
  path=["business", "articles"],
  args={
    "hello": "world",
    "number": 14,
    "boolean": True,
    "skipped": None,
    "also-skipped": False,
    "quoted": "test=ok",
  },
  fragment="#test",
)
>>> 'http://lemonde.fr/business/articles?boolean&hello=world&number=14&quoted=test%3Dok#test'

# Custom argument value formatting
def format_arg_value(key, value):
  if key == 'ids':
    return ','.join(value)

  return key

format_url(
  'https://lemonde.fr',
  args={'ids': [1, 2]},
  format_arg_value=format_arg_value
)
>>> 'https://lemonde.fr?ids=1%2C2'

# Formatter class
from ural import URLFormatter

formatter = URLFormatter('https://lemonde.fr', args={'id': 'one'})

formatter(path='/article.html')
>>> 'https://lemonde.fr/article.html?id=one'

# same as:
formatter.format(path='/article.html')
>>> 'https://lemonde.fr/article.html?id=one'

# Query arguments are merged
formatter(path='/article.html', args={"user_id": "two"})
>>> 'https://lemonde.fr/article.html?id=one&user_id=two'

# Easy subclassing
class MyCustomFormatter(URLFormatter):
  BASE_URL = 'https://lemonde.fr/api'

  def format_api_call(self, token):
    return self.format(args={'token': token})

formatter = MyCustomFormatter()

formatter.format_api_call('2764753')
>>> 'https://lemonde.fr/api?token=2764753'

参数

base_url str：基本URL。
path ?str|list：URL的路径。
args ?dict：作为字典的查询参数。
format_arg_value ?callable：接受查询参数键和值作为参数，并返回格式化值的函数。
fragment ?str：URL的片段。
ext ?str：路径扩展名，如.html。

获取域名名

函数返回URL的域名名称。当然，该函数是tld感知的，如果找不到有效的域名名称，则返回None。

from ural import get_domain_name

get_domain_name('https://facebook.com/path')
>>> 'facebook.com'

获取主机名

函数返回给定URL的完整主机名。它可以处理无方案的URL。

from ural import get_hostname

get_hostname('https://#/path')
>>> 'www.facebook.com'

获取指纹化主机名

函数通过删除对统计聚合不相关的子域来返回给定URL的“指纹”主机名。请注意，此函数比get_normalized_hostname更为激进，并且生成的域名可能不再有效。

有关此内容的更多详细信息，请务必阅读文档中的本节。

from ural import get_normalized_hostname

get_normalized_hostname('https://www.lemonde.fr/article.html')
>>> 'lemonde.fr'

get_normalized_hostname('https://fr-FR.facebook.com/article.html')
>>> 'facebook.com'

get_normalized_hostname('https://fr-FR.facebook.com/article.html', strip_suffix=True)
>>> 'facebook'

参数

url 字符串：目标URL。
strip_suffix ?bool [False]：是否删除主机名后缀，如.com或.co.uk。这可以用于汇总同一平台的不同域名。

获取规范化主机名

函数返回给定URL的规范化主机名，即不包含通常无关的子域等。它非常类似于normalize_url。

有关此内容的更多详细信息，请务必阅读文档中的本节。

from ural import get_normalized_hostname

get_normalized_hostname('https://#/path')
>>> 'facebook.com'

get_normalized_hostname('http://fr-FR.facebook.com/path')
>>> 'facebook.com'

参数

url str：目标URL。
infer_redirection bool [True]：是否尝试通过利用知名的GET参数解决常见的重定向。
normalize_amp ?bool [True]：是否尝试规范化Google AMP子域。

有特殊主机

函数返回给定URL是否看起来像有特殊的主机。

from ural import has_special_host

has_special_host('http://104.19.154.83')
>>> True

has_special_host('http://youtube.com')
>>> False

有有效后缀

函数返回给定URL是否具有根据Mozzila的公共后缀列表的有效后缀。

from ural import has_valid_suffix

has_valid_suffix('http://lemonde.fr')
>>> True

has_valid_suffix('http://lemonde.doesnotexist')
>>> False

# Also works with hostnames
has_valid_suffix('lemonde.fr')
>>> True

有有效顶级域名

函数返回给定的URL是否具有有效的顶级域名（TLD），根据IANA的列表。

from ural import has_valid_tld

has_valid_tld('http://lemonde.fr')
>>> True

has_valid_tld('http://lemonde.doesnotexist')
>>> False

# Also works with hostnames
has_valid_tld('lemonde.fr')
>>> True

推断重定向

尝试在给定的URL中找到明显的重定向线索，并自动解析重定向，而不触发任何HTTP请求。如果没有找到任何内容，则返回给定的URL。

默认情况下，该函数是递归的，会尝试推断重定向，直到没有找到为止。但如果需要，您可以禁用此行为。

from ural import infer_redirection

infer_redirection('https://www.google.com/url?sa=t&source=web&rct=j&url=https%3A%2F%2Fm.youtube.com%2Fwatch%3Fv%3D4iJBsjHMviQ&ved=2ahUKEwiBm-TO3OvkAhUnA2MBHQRPAR4QwqsBMAB6BAgDEAQ&usg=AOvVaw0i7y2_fEy3nwwdIZyo_qug')
>>> 'https://m.youtube.com/watch?v=4iJBsjHMviQ'

infer_redirection('https://test.com?url=http%3A%2F%2Flemonde.fr%3Fnext%3Dhttp%253A%252F%252Ftarget.fr')
>>> 'http://target.fr'

infer_redirection(
  'https://test.com?url=http%3A%2F%2Flemonde.fr%3Fnext%3Dhttp%253A%252F%252Ftarget.fr',
  recursive=False
)
>>> 'http://lemonde.fr?next=http%3A%2F%2Ftarget.fr'

是主页

函数返回给定的URL是否可能是网站的主页，基于其路径。

from ural import is_homepage

is_homepage('http://lemonde.fr')
>>> True

is_homepage('http://lemonde.fr/index.html')
>>> True

is_homepage('http://lemonde.fr/business/article5.html')
>>> False

是缩短的URL

函数返回给定的URL是否可能是缩短的URL。它通过将给定的URL域名与最显著的缩短域名进行匹配来工作。因此，结果可能是假阴性。

from ural import is_shortened_url

is_shortened_url('http://lemonde.fr')
>>> False

is_shortened_url('http://bit.ly/1sNZMwL')
>>> True

是特殊主机

函数返回给定的主机名是否看起来像特殊主机。

from ural import is_special_host

is_special_host('104.19.154.83')
>>> True

is_special_host('youtube.com')
>>> False

是打字错误URL

函数返回给定的字符串是否可能是打字错误。此函数不测试给定的字符串是否是有效的URL。它通过将给定的URL TLD与最显著的类似打字错误TLD进行匹配，或者通过将给定的字符串与最显著的包容性语言结尾进行匹配来工作。因此，结果可能是假阴性。

from ural import is_typo_url

is_typo_url('http://dirigeants.es')
>>> True

is_typo_url('https://www.instagram.com')
>>> False

是URL

函数返回给定的字符串是否是有效的URL。

from ural import is_url

is_url('https://www2.lemonde.fr')
>>> True

is_url('lemonde.fr/economie/article.php', require_protocol=False)
>>> True

is_url('lemonde.falsetld/whatever.html', tld_aware=True)
>>> False

参数

string string：要测试的字符串。
require_protocol bool [True]：是否必须具有协议才能将参数视为URL。
tld_aware bool [False]：是否检查URL的TLD是否实际上存在。
allow_spaces_in_path bool [False]：是否允许在URL路径中包含空格。
only_http_https bool [True]：是否只允许http和https协议。

是有效的顶级域名

函数返回给定的顶级域名（TLD）是否根据IANA的列表是有效的。

from ural import is_valid_tld

is_valid_tld('.fr')
>>> True

is_valid_tld('com')
>>> True

is_valid_tld('.doesnotexist')
>>> False

从HTML中链接

函数返回给定HTML文本中存在的有效外链的迭代器。

这是urls_from_html的一个变体，适用于网络爬虫。它可以去除重复的URL，规范化它们，将它们与基本URL连接起来，并过滤掉不应跟随的项目，如mailto:或javascript:等href链接。它还将跳过任何与给定基本URL等价的URL。

注意，此函数可以在字符串和字节之间无缝工作。

from ural import links_from_html

html = b"""
<p>
  Hey! Check this site:
  <a href="https://medialab.sciencespo.fr/">médialab</a>
  And also this page:
  <a href="article.html">article</a>
  Or click on this:
  <a href="javascript:alert('hello');">link</a>
</p>
"""

for link in links_from_html('http://lemonde.fr', html):
    print(link)
>>> 'https://medialab.sciencespo.fr/'
>>> 'http://lemonde.fr/article.html'

参数

base_url string：HTML的URL。
string string|bytes：HTML字符串或字节。
encoding ?string [utf-8]：如果给出二进制，则使用此编码来解码找到的URL。
canonicalize ?bool [False]：是否使用canonicalize_url规范化URL。
strip_fragment ?bool [False]：是否在使用canonicalize时删除URL片段。
unique ?bool [False]：是否去除重复的URL。

规范化主机名

函数标准化给定的主机名，即通常不相关的子域名等。与normalize_url非常相似。

有关此内容的更多详细信息，请务必阅读文档中的本节。

from ural import normalize_hostname

normalize_hostname('www.facebook.com')
>>> 'facebook.com'

normalize_hostname('fr-FR.facebook.com')
>>> 'facebook.com'

规范化URL

函数通过删除通常不具有区分性的部分（例如不相关的查询项或子域名等）来标准化给定的URL。

当尝试匹配在社交媒体等地方以略微不同的方式共享的类似URL时，这是一个非常有用的实用程序。

有关此内容的更多详细信息，请务必阅读文档中的本节。

from ural import normalize_url

normalize_url('https://www2.lemonde.fr/index.php?utm_source=google')
>>> 'lemonde.fr'

参数

url string：要标准化的URL。
infer_redirection ?bool [True]：是否尝试通过利用已知的GET参数解决常见的重定向。
fix_common_mistakes ?bool [True]：是否尝试修复常见的URL错误。
normalize_amp ?bool [True]: 是否尝试标准化Google AMP网址。
sort_query ?bool [True]: 是否排序查询项。
strip_authentication ?bool [True]: 是否移除身份验证。
strip_fragment ?bool|str ['except-routing']: 是否移除URL的片段。如果设置为except-routing，则只有在片段不被认为是js路由（即它包含一个/）时才会移除片段。
strip_index ?bool [True]: 是否移除尾随索引。
strip_irrelevant_subdomains ?bool [False]: 是否移除不相关的子域名，如www等。
strip_protocol ?bool [True]: 是否移除URL的协议。
strip_trailing_slash ?bool [True]: 是否移除尾随斜杠。
quoted ?bool [False]：默认情况下，函数将尽可能解除URL的引号，同时确保URL安全。如果将此关键字参数设置为True，则函数将相反地尽可能对URL进行引号处理，同时确保没有任何内容被双重引号。
platform_aware ?bool [False]：是否在规范化URL时考虑ural支持的知名平台（如facebook、youtube等）。

应遵循href

函数返回给定的href是否应该跟随（通常来自爬虫的上下文）。这意味着它将过滤掉锚点，以及那些协议不是http/https的URL。

from ural import should_follow_href

should_follow_href('#top')
>>> False

should_follow_href('http://lemonde.fr')
>>> True

should_follow_href('/article.html')
>>> True

应解析

函数返回给定的函数看起来是否是你想要解决的，因为URL可能可能导致一些重定向。

它与is_shortened_url非常相似，但覆盖范围更广，因为它还处理不是真正缩短的URL模式。

from ural import should_resolve

should_resolve('http://lemonde.fr')
>>> False

should_resolve('http://bit.ly/1sNZMwL')
>>> True

should_resolve('https://doi.org/10.4000/vertigo.26405')
>>> True

分割后缀

函数将主机名或URL的主机名拆分为域名部分和后缀部分（同时尊重Mozzila的公共后缀列表）。

from ural import split_suffix

split_suffix('http://www.bbc.co.uk/article.html')
>>> ('www.bbc', 'co.uk')

split_suffix('http://www.bbc.idontexist')
>>> None

split_suffix('lemonde.fr')
>>> ('lemonde', 'fr')

删除协议

函数从URL中移除协议。

from ural import strip_protocol

strip_protocol('https://www2.lemonde.fr/index.php')
>>> 'www2.lemonde.fr/index.php'

参数

url 字符串：要格式化的URL。

urlpathsplit

函数接受一个URL并返回其路径，作为一个列表进行标记化。

from ural import urlpathsplit

urlpathsplit('http://lemonde.fr/section/article.html')
>>> ['section', 'article.html']

urlpathsplit('http://lemonde.fr/')
>>> []

# If you want to split a path directly
from ural import pathsplit

pathsplit('/section/articles/')
>>> ['section', 'articles']

从HTML中提取URL

函数返回给定HTML文本中链接中的URL的迭代器。

注意，此函数可以在字符串和字节之间无缝工作。

from ural import urls_from_html

html = """<p>Hey! Check this site: <a href="https://medialab.sciencespo.fr/">médialab</a></p>"""

for url in urls_from_html(html):
    print(url)
>>> 'https://medialab.sciencespo.fr/'

参数

string string|bytes：HTML字符串或字节。
encoding ?string [utf-8]：如果给出二进制，则使用此编码来解码找到的URL。
errors ?string [strict]: 解码错误策略。

从文本中提取URL

函数返回字符串参数中存在的URL的迭代器。仅提取具有协议的URL。

注意，此函数对Markdown和标点符号有所了解。

from ural import urls_from_text

text = "Hey! Check this site: https://medialab.sciencespo.fr/, it looks really cool. They're developing many tools on https://github.com/"

for url in urls_from_text(text):
    print(url)

>>> 'https://medialab.sciencespo.fr/'
>>> 'https://github.com/'

参数

string string: 源字符串。

升级后缀和顶级域名

如果您想根据Mozilla后缀和IANA顶级域升级包的数据，可以通过运行以下命令来完成

python -m ural upgrade

或直接在您的Python代码中

from ural.tld import upgrade

upgrade()

# Or if you want to patch runtime only this time, or regularly
# (for long running programs or to avoid rights issues etc.):
upgrade(transient=True)

HostnameTrieSet

类实现一组分层的主机名，您可以高效地查询URL是否匹配集合中的主机名。

from ural import HostnameTrieSet

trie = HostnameTrieSet()

trie.add('lemonde.fr')
trie.add('business.lefigaro.fr')

trie.match('https://liberation.fr/article1.html')
>>> False

trie.match('https://lemonde.fr/article1.html')
>>> True

trie.match('https://www.lemonde.fr/article1.html')
>>> True

trie.match('https://lefigaro.fr/article1.html')
>>> False

trie.match('https://business.lefigaro.fr/article1.html')
>>> True

#.add

方法向集合中添加单个主机名。

from ural import HostnameTrieSet

trie = HostnameTrieSet()
trie.add('lemonde.fr')

参数

hostname string: 要添加到集合中的主机名。

#.match

方法返回给定的URL是否与集合的任何主机名匹配。

from ural import HostnameTrieSet

trie = HostnameTrieSet()
trie.add('lemonde.fr')

trie.match('https://liberation.fr/article1.html')
>>> False

trie.match('https://lemonde.fr/article1.html')
>>> True

参数

url string|urllib.parse.SplitResult: 要匹配的URL。

lru.url_to_lru

函数将给定的URL转换为序列化的LRU。

from ural.lru import url_to_lru

url_to_lru('http://www.lemonde.fr:8000/article/1234/index.html?field=value#2')
>>> 's:http|t:8000|h:fr|h:lemonde|h:www|p:article|p:1234|p:index.html|q:field=value|f:2|'

参数

url string: 要转换的URL。
suffix_aware ?bool: 转换时是否注意后缀（例如，将“co.uk”视为单个标记）。

lru.lru_to_url

函数将给定的序列化LRU或LRU词干转换为合适的URL。

from ural.lru import lru_to_url

lru_to_url('s:http|t:8000|h:fr|h:lemonde|h:www|p:article|p:1234|p:index.html|')
>>> 'http://www.lemonde.fr:8000/article/1234/index.html'

lru_to_url(['s:http', 'h:fr', 'h:lemonde', 'h:www', 'p:article', 'p:1234', 'p:index.html'])
>>> 'http://www.lemonde.fr:8000/article/1234/index.html'

lru.lru_stems

函数以分层顺序返回URL部分。

from ural.lru import lru_stems

lru_stems('http://www.lemonde.fr:8000/article/1234/index.html?field=value#2')
>>> ['s:http', 't:8000', 'h:fr', 'h:lemonde', 'h:www', 'p:article', 'p:1234', 'p:index.html', 'q:field=value', 'f:2']

参数

url string: 要解析的URL。
suffix_aware ?bool: 转换时是否注意后缀（例如，将“co.uk”视为单个标记）。

lru.canonicalized_lru_stems

函数规范URL并按分层顺序返回其部分。

from ural.lru import canonicalized_lru_stems

canonicalized_lru_stems('http://www.lemonde.fr/article/1234/index.html?field=value#2')
>>> ['s:http', 'h:fr', 'h:lemonde', 'p:article', 'p:1234', 'q:field=value', 'f:2']

参数

此函数接受与canonicalize_url相同的参数。

lru.normalized_lru_stems

函数标准化URL并按分层顺序返回其部分。

from ural.lru import normalized_lru_stems

normalized_lru_stems('http://www.lemonde.fr/article/1234/index.html?field=value#2')
>>> ['h:fr', 'h:lemonde', 'p:article', 'p:1234', 'q:field=value']

参数

此函数接受与normalize_url相同的参数。

lru.fingerprinted_lru_stems

函数指纹URL并按分层顺序返回其部分。

from ural.lru import fingerprinted_lru_stems

fingerprinted_lru_stems('http://www.lemonde.fr/article/1234/index.html?field=value#2', strip_suffix=True)
>>> ['h:lemonde', 'p:article', 'p:1234', 'q:field=value']

参数

此函数接受与fingerprint_url相同的参数。

lru.serialize_lru

函数将LRU词干序列化为字符串。

from ural.lru import serialize_lru

serialize_lru(['s:https', 'h:fr', 'h:lemonde'])
>>> 's:https|h:fr|h:lemonde|'

lru.unserialize_lru

函数将字符串化的LRU反序列化为词干列表。

from ural.lru import unserialize_lru

unserialize_lru('s:https|h:fr|h:lemonde|')
>>> ['s:https', 'h:fr', 'h:lemonde']

LRUTrie

类实现一个前缀树（Trie），按分层存储URL，存储为LRU和一些任意元数据。当需要通过最长公共前缀匹配URL时非常有用。

请注意，此类直接继承自phylactery库的TrieDict，因此您也可以使用其任何方法。

from ural.lru import LRUTrie

trie = LRUTrie()

# To respect suffixes
trie = LRUTrie(suffix_aware=True)

#.set

方法将URL及其元数据存储到LRUTrie中。

from ural.lru import LRUTrie

trie = LRUTrie()
trie.set('http://www.lemonde.fr', {'type': 'general press'})

trie.match('http://www.lemonde.fr')
>>> {'type': 'general press'}

参数

url 字符串：要在LRUTrie中存储的URL。
metadata 任何：URL的元数据。

#.set_lru

方法用于存储已表示为LRU或LRU词根的URL及其元数据。

from ural.lru import LRUTrie

trie = LRUTrie()

# Using stems
trie.set_lru(['s:http', 'h:fr', 'h:lemonde', 'h:www'], {'type': 'general press'})

# Using serialized lru
trie.set_lru('s:http|h:fr|h:lemonde|h:www|', {'type': 'general_press'})

参数

lru 字符串|列表：要在Trie中存储的lru。
metadata 任何：附加到lru的元数据。

#.match

返回与查询URL的最长前缀匹配相关的元数据。如果没有找到公共前缀，则返回None。

from ural.lru import LRUTrie

trie = LRUTrie()
trie.set('http://www.lemonde.fr', {'media': 'lemonde'})

trie.match('http://www.lemonde.fr')
>>> {'media': 'lemonde'}
trie.match('http://www.lemonde.fr/politique')
>>> {'media': 'lemonde'}

trie.match('http://www.lefigaro.fr')
>>> None

参数

url 字符串：要在LRUTrie中匹配的URL。

#.match_lru

返回与查询LRU的最长前缀匹配相关的元数据。如果没有找到公共前缀，则返回None。

from ural.lru import LRUTrie

trie = LRUTrie()
trie.set(['s:http', 'h:fr', 'h:lemonde', 'h:www'], {'media': 'lemonde'})

trie.match(['s:http', 'h:fr', 'h:lemonde', 'h:www'])
>>> {'media': 'lemonde'}
trie.match('s:http|h:fr|h:lemonde|h:www|p:politique|')
>>> {'media': 'lemonde'}

trie.match(['s:http', 'h:fr', 'h:lefigaro', 'h:www'])
>>> None

参数

lru 字符串|列表：要在LRUTrie中匹配的lru。

CanonicalizedLRUTrie

CanonicalizedLRUTrie 几乎与标准 LRUTrie 相同，只是在尝试使用 canonicalize_url 函数进行任何操作之前，它会将给定的URL进行规范化。

因此，它的构造函数与前面提到的函数具有相同的参数。

from ural.lru import CanonicalizedLRUTrie

trie = CanonicalizedLRUTrie(strip_fragment=False)

NormalizedLRUTrie

NormalizedLRUTrie 几乎与标准 LRUTrie 相同，只是在尝试使用 normalize_url 函数进行任何操作之前，它会将给定的URL进行规范化。

因此，它的构造函数与前面提到的函数具有相同的参数。

from ural.lru import NormalizedLRUTrie

trie = NormalizedLRUTrie(normalize_amp=False)

FingerprintedLRUTrie

FingerprintedLRUTrie 几乎与标准 LRUTrie 相同，只是在尝试使用 fingerprint_url 函数进行任何操作之前，它会将给定的URL进行指纹化。

因此，它的构造函数与前面提到的函数具有相同的参数。

from ural.lru import FingerprintedLRUTrie

trie = FingerprintedLRUTrie(strip_suffix=False)

Facebook

has_facebook_comments

函数返回给定的URL是否指向可能具有评论的Facebook资源（例如帖子、照片或视频等）。

from ural.facebook import has_facebook_comments

has_facebook_comments('https://#/permalink.php?story_fbid=1354978971282622&id=598338556946671')
>>> True

has_facebook_comments('https://#/108824017345866/videos/311658803718223')
>>> True

has_facebook_comments('https://#/astucerie/')
>>> False

has_facebook_comments('https://www.lemonde.fr')
>>> False

has_facebook_comments('/permalink.php?story_fbid=1354978971282622&id=598338556946671', allow_relative_urls=True)
>>> True

is_facebook_id

函数返回给定的字符串是否是有效的Facebook ID。

from ural.facebook import is_facebook_id

is_facebook_id('974583586343')
>>> True

is_facebook_id('whatever')
>>> False

is_facebook_full_id

函数返回给定的字符串是否是有效的Facebook完整帖子ID。

from ural.facebook import is_facebook_full_id

is_facebook_full_id('974583586343_9749757953')
>>> True

is_facebook_full_id('974583586343')
>>> False

is_facebook_full_id('whatever')
>>> False

is_facebook_url

函数返回给定的URL是否来自Facebook。

from ural.facebook import is_facebook_url

is_facebook_url('https://#/post/974583586343')
>>> True

is_facebook_url('https://fb.me/846748464')
>>> True

is_facebook_url('https://www.lemonde.fr')
>>> False

is_facebook_post_url

函数返回给定的URL是否是Facebook帖子。

from ural.facebook import is_facebook_post_url

is_facebook_post_url('https://#/post/974583586343')
>>> True

is_facebook_post_url('https://#')
>>> False

is_facebook_post_url('https://www.lemonde.fr')
>>> False

is_facebook_link

函数返回给定的URL是否是Facebook重定向链接。

from ural.facebook import is_facebook_link

is_facebook_link('https://l.facebook.com/l.php?u=http%3A%2F%2Fwww.chaos-controle.com%2Farchives%2F2013%2F10%2F14%2F28176300.html&amp;h=AT0iUqJpUTMzHAH8HAXwZ11p8P3Z-SrY90wIXZhcjMnxBTHMiau8Fv1hvz00ZezRegqmF86SczyUXx3Gzdt_MdFH-I4CwHIXKKU9L6w522xwOqkOvLAylxojGEwrp341uC-GlVyGE2N7XwTPK9cpP0mQ8PIrWh8Qj2gHIIR08Js0mUr7G8Qe9fx66uYcfnNfTTF1xi0Us8gTo4fOZxAgidGWXsdgtU_OdvQqyEm97oHzKbWfXjkhsrzbtb8ZNMDwCP5099IMcKRD8Hi5H7W3vwh9hd_JlRgm5Z074epD_mGAeoEATE_QUVNTxO0SHO4XNn2Z7LgBamvevu1ENBcuyuSOYA0BsY2cx8mPWJ9t44tQcnmyQhBlYm_YmszDaQx9IfVP26PRqhsTLz-kZzv0DGMiJFU78LVWVPc9QSw2f9mA5JYWr29w12xJJ5XGQ6DhJxDMWRnLdG8Tnd7gZKCaRdqDER1jkO72u75-o4YuV3CLh4j-_4u0fnHSzHdVD8mxr9pNEgu8rvJF1E2H3-XbzA6F2wMQtFCejH8MBakzYtTGNvHSexSiKphE04Ci1Z23nBjCZFsgNXwL3wbIXWfHjh2LCKyihQauYsnvxp6fyioStJSGgyA9GGEswizHa20lucQF0S0F8H9-')
>>> True

is_facebook_link('https://lemonde.fr')
>>> False

convert_facebook_url_to_mobile

函数返回给定Facebook URL的移动版本。如果给定非Facebook URL，将引发异常。

from ural.facebook import convert_facebook_url_to_mobile

convert_facebook_url_to_mobile('https://#/post/974583586343')
>>> 'http://m.facebook.com/post/974583586343'

parse_facebook_url

解析给定的Facebook URL。

from ural.facebook import parse_facebook_url

# Importing related classes if you need to perform tests
from ural.facebook import (
  FacebookHandle,
  FacebookUser,
  FacebookGroup,
  FacebookPost,
  FacebookPhoto,
  FacebookVideo
)

parse_facebook_url('https://#/people/Sophia-Aman/102016783928989')
>>> FacebookUser(id='102016783928989')

parse_facebook_url('https://#/groups/159674260452951')
>>> FacebookGroup(id='159674260452951')

parse_facebook_url('https://#/groups/159674260852951/permalink/1786992671454427/')
>>> FacebookPost(id='1786992671454427', group_id='159674260852951')

parse_facebook_url('https://#/108824017345866/videos/311658803718223')
>>> FacebookVideo(id='311658803718223', parent_id='108824017345866')

parse_facebook_url('https://#/photo.php?fbid=10222721681573727')
>>> FacebookPhoto(id='10222721681573727')

parse_facebook_url('/annelaure.rivolu?rc=p&__tn__=R', allow_relative_urls=True)
>>> FacebookHandle(handle='annelaure.rivolu')

parse_facebook_url('https://lemonde.fr')
>>> None

extract_url_from_facebook_link

从Facebook重定向链接中提取目标URL。

from ural.facebook import extract_url_from_facebook_link

extract_url_from_facebook_link('https://l.facebook.com/l.php?u=http%3A%2F%2Fwww.chaos-controle.com%2Farchives%2F2013%2F10%2F14%2F28176300.html&amp;h=AT0iUqJpUTMzHAH8HAXwZ11p8P3Z-SrY90wIXZhcjMnxBTHMiau8Fv1hvz00ZezRegqmF86SczyUXx3Gzdt_MdFH-I4CwHIXKKU9L6w522xwOqkOvLAylxojGEwrp341uC-GlVyGE2N7XwTPK9cpP0mQ8PIrWh8Qj2gHIIR08Js0mUr7G8Qe9fx66uYcfnNfTTF1xi0Us8gTo4fOZxAgidGWXsdgtU_OdvQqyEm97oHzKbWfXjkhsrzbtb8ZNMDwCP5099IMcKRD8Hi5H7W3vwh9hd_JlRgm5Z074epD_mGAeoEATE_QUVNTxO0SHO4XNn2Z7LgBamvevu1ENBcuyuSOYA0BsY2cx8mPWJ9t44tQcnmyQhBlYm_YmszDaQx9IfVP26PRqhsTLz-kZzv0DGMiJFU78LVWVPc9QSw2f9mA5JYWr29w12xJJ5XGQ6DhJxDMWRnLdG8Tnd7gZKCaRdqDER1jkO72u75-o4YuV3CLh4j-_4u0fnHSzHdVD8mxr9pNEgu8rvJF1E2H3-XbzA6F2wMQtFCejH8MBakzYtTGNvHSexSiKphE04Ci1Z23nBjCZFsgNXwL3wbIXWfHjh2LCKyihQauYsnvxp6fyioStJSGgyA9GGEswizHa20lucQF0S0F8H9-')
>>> 'http://www.chaos-controle.com/archives/2013/10/14/28176300.html'

extract_url_from_facebook_link('http://lemonde.fr')
>>> None

Google

is_amp_url

返回给定的URL是否可能是Google AMP URL。

from ural.google import is_amp_url

is_amp_url('http://www.europe1.fr/sante/les-onze-vaccins.amp')
>>> True

is_amp_url('https://www.lemonde.fr')
>>> False

is_google_link

返回给定的URL是否是Google搜索链接。

from ural.google import is_google_link

is_google_link('https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&cad=rja&uact=8&ved=2ahUKEwjp8Lih_LnmAhWQlxQKHVTmCJYQFjADegQIARAB&url=http%3A%2F%2Fwww.mon-ip.com%2F&usg=AOvVaw0sfeZJyVtUS2smoyMlJmes')
>>> True

is_google_link('https://www.lemonde.fr')
>>> False

extract_url_from_google_link

从给定的Google搜索链接中提取URL。这对于“解析”从Google搜索结果中抓取的链接非常有用。如果给定的URL无效或不相关，则返回None。

from ural.google import extract_url_from_google_link

extract_url_from_google_link('https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=2ahUKEwicu4K-rZzmAhWOEBQKHRNWA08QFjAAegQIARAB&url=https%3A%2F%2Fwww.facebook.com%2Fieff.ogbeide&usg=AOvVaw0vrBVCiIHUr5pncjeLpPUp')

>>> 'https://#/ieff.ogbeide'

extract_url_from_google_link('https://www.lemonde.fr')
>>> None

extract_id_from_google_drive_url

从给定的Google drive URL中提取文件ID。如果给定的URL无效或不相关，则返回None。

from ural.google import extract_id_from_google_drive_url

extract_id_from_google_drive_url('https://docs.google.com/spreadsheets/d/1Q9sJtAb1BZhUMjxCLMrVASx3AoNDp5iV3VkbPjlg/edit#gid=0')
>>> '1Q9sJtAb1BZhUMjxCLMrVASx3AoNDp5iV3VkbPjlg'

extract_id_from_google_drive_url('https://www.lemonde.fr')
>>> None

parse_google_drive_url

解析给定的Google drive URL。如果给定的URL无效或不相关，则返回None。

from ural.google import (
  parse_google_drive_url,
  GoogleDriveFile,
  GoogleDrivePublicLink
)

parse_google_drive_url('https://docs.google.com/spreadsheets/d/1Q9sJtAb1BZhUMjxCLMrVASx3AoNDp5iV3VkbPjlg/edit#gid=0')
>>> GoogleDriveFile('spreadsheets', '1Q9sJtAb1BZhUMjxCLMrVASx3AoNDp5iV3VkbPjlg')

parse_google_drive_url('https://www.lemonde.fr')
>>> None

Instagram

is_instagram_post_shortcode

函数返回给定的字符串是否是有效的Instagram帖子短代码。

from ural.instagram import is_instagram_post_shortcode

is_instagram_post_shortcode('974583By-5_86343')
>>> True

is_instagram_post_shortcode('whatever!!')
>>> False

is_instagram_username

函数返回给定的字符串是否是有效的Instagram用户名。

from ural.instagram import is_instagram_username

is_instagram_username('97458.3By-5_86343')
>>> True

is_instagram_username('whatever!!')
>>> False

is_instagram_url

返回给定的URL是否来自Instagram。

from ural.instagram import is_instagram_url

is_instagram_url('https://lemonde.fr')
>>> False

is_instagram_url('https://www.instagram.com/guillaumelatorre')
>>> True

extract_username_from_instagram_url

从给定的Instagram URL中返回用户名或None（如果我们找不到一个）。

from ural.instagram import extract_username_from_instagram_url

extract_username_from_instagram_url('https://www.instagram.com/martin_dupont/p/BxKRx5CHn5i/')
>>> 'martin_dupont'

extract_username_from_instagram_url('https://lemonde.fr')
>>> None

parse_instagram_url

返回有关给定Instagram URL的解析信息：有关帖子、用户或reel的信息。如果URL是无效的Instagram URL或不是Instagram URL，则函数返回None。

from ural.instagram import (
  parse_instagram_url,

  # You can also import the named tuples if you need them
  InstagramPost,
  InstagramUser,
  InstagramReel
)

parse_instagram_url('https://www.instagram.com/martin_dupont/p/BxKRx5CHn5i/')
>>> InstagramPost(id='BxKRx5CHn5i', name='martin_dupont')

parse_instagram_url('https://lemonde.fr')
>>> None

parse_instagram_url('https://www.instagram.com/p/BxKRx5-Hn5i/')
>>> InstagramPost(id='BxKRx5-Hn5i', name=None)

parse_instagram_url('https://www.instagram.com/martin_dupont')
>>> InstagramUser(name='martin_dupont')

parse_instagram_url('https://www.instagram.com/reels/BxKRx5-Hn5i')
>>> InstagramReel(id='BxKRx5-Hn5i')

参数

url str：要解析的Instagram URL。

is_telegram_message_id

函数返回给定的字符串是否是有效的Telegram消息ID。

from ural.telegram import is_telegram_message_id

is_telegram_message_id('974583586343')
>>> True

is_telegram_message_id('whatever')
>>> False

is_telegram_url

返回给定的URL是否来自Telegram。

from ural.telegram import is_telegram_url

is_telegram_url('https://lemonde.fr')
>>> False

is_telegram_url('https://telegram.me/guillaumelatorre')
>>> True

is_telegram_url('https://t.me/s/jesstern')
>>> True

convert_telegram_url_to_public

返回给定Telegram URL的公开版本。如果给定非Telegram URL，将引发异常。

from ural.teglegram import convert_telegram_url_to_public

convert_telegram_url_to_public('https://t.me/jesstern')
>>> 'https://t.me/s/jesstern'

extract_channel_name_from_telegram_url

从给定的Telegram URL返回一个频道或如果没有找到，则返回None。

from ural.telegram import extract_channel_name_from_telegram_url

extract_channel_name_from_telegram_url('https://t.me/s/jesstern/345')
>>> 'jesstern'

extract_channel_name_from_telegram_url('https://lemonde.fr')
>>> None

parse_telegram_url

返回关于给定Telegram URL的解析信息：关于频道、消息或用户的信息。如果URL是无效的Telegram URL或不是Telegram URL，则函数返回None。

from ural.telegram import (
  parse_telegram_url,

  # You can also import the named tuples if you need them
  TelegramMessage,
  TelegramChannel,
  TelegramGroup
)

parse_telegram_url('https://t.me/s/jesstern/76')
>>> TelegramMessage(name='jesstern', id='76')

parse_telegram_url('https://lemonde.fr')
>>> None

parse_telegram_url('https://telegram.me/rapsocialclub')
>>> TelegramChannel(name='rapsocialclub')

parse_telegram_url('https://t.me/joinchat/AAAAAE9B8u_wO9d4NiJp3w')
>>> TelegramGroup(id='AAAAAE9B8u_wO9d4NiJp3w')

参数

url str：要解析的Telegram URL。

Twitter

is_twitter_url

返回给定的URL是否来自Twitter。

from ural.twitter import is_twitter_url

is_twitter_url('https://lemonde.fr')
>>> False

is_twitter_url('https://www.twitter.com/Yomguithereal')
>>> True

is_twitter_url('https://twitter.com')
>>> True

extract_screen_name_from_twitter_url

从Twitter URL中提取标准化的用户屏幕名。如果给出不相关的URL，函数将返回None。

from ural.twitter import extract_screen_name_from_twitter_url

extract_screen_name_from_twitter_url('https://www.twitter.com/Yomguithereal')
>>> 'yomguithereal'

extract_screen_name_from_twitter_url('https://twitter.fr')
>>> None

parse_twitter_url

接受Twitter URL并返回命名元组TwitterUser（包含屏幕名），如果给定的URL是链接到Twitter用户的链接，或者返回命名元组TwitterTweet（包含用户屏幕名和ID），如果给定的URL是推文的URL，或者返回命名元组TwitterList（包含ID）或如果没有给定的URL相关，则返回None。

from ural.twitter import parse_twitter_url

parse_twitter_url('https://twitter.com/Yomguithereal')
>>> TwitterUser(screen_name='yomguithereal')

parse_twitter_url('https://twitter.com/medialab_ScPo/status/1284154793376784385')
>>> TwitterTweet(user_screen_name='medialab_scpo', id='1284154793376784385')

parse_twitter_url('https://twitter.com/i/lists/15512656222798157826')
>>> TwitterList(id='15512656222798157826')

parse_twitter_url('https://twitter.com/home')
>>> None

Youtube

is_youtube_url

返回给定的URL是否来自YouTube。

from ural.youtube import is_youtube_url

is_youtube_url('https://lemonde.fr')
>>> False

is_youtube_url('https://www.youtube.com/watch?v=otRTOE9i51o')
>>> True

is_youtube_url('https://youtu.be/otRTOE9i51o)
>>> True

is_youtube_channel_id

返回给定的字符串是否是形式上有效的YouTube频道ID。请注意，它不会验证该ID是否实际上指向一个现有频道。您需要调用YouTube服务器来完成此操作。

from ural.youtube import is_youtube_channel_id

is_youtube_channel_id('UCCCPCZNChQdGa9EkATeye4g')
>>> True

is_youtube_channel_id('@France24')
>>> False

is_youtube_video_id

返回给定的字符串是否是形式上有效的YouTube视频ID。请注意，它不会验证该ID是否实际上指向一个现有视频。您需要调用YouTube服务器来完成此操作。

from ural.youtube import is_youtube_video_id

is_youtube_video_id('otRTOE9i51o')
>>> True

is_youtube_video_id('bDYTYET')
>>> False

parse_youtube_url

返回关于给定YouTube URL的解析信息：关于链接视频、用户或频道的信息。如果URL是无效的YouTube URL或不是YouTube URL，则函数返回None。

from ural.youtube import (
  parse_youtube_url,

  # You can also import the named tuples if you need them
  YoutubeVideo,
  YoutubeUser,
  YoutubeChannel,
  YoutubeShort,
)

parse_youtube_url('https://www.youtube.com/watch?v=otRTOE9i51o')
>>> YoutubeVideo(id='otRTOE9i51o')

parse_youtube_url('https://www.youtube.com/shorts/GINlKobb41w')
>>> YoutubeShort(id='GINlKobb41w')

parse_youtube_url('https://lemonde.fr')
>>> None

parse_youtube_url('http://www.youtube.com/channel/UCWvUxN9LAjJ-sTc5JJ3gEyA/videos')
>>> YoutubeChannel(id='UCWvUxN9LAjJ-sTc5JJ3gEyA', name=None)

parse_youtube_url('http://www.youtube.com/user/ojimfrance')
>>> YoutubeUser(id=None, name='ojimfrance')

parse_youtube_url('https://www.youtube.com/taranisnews')
>>> YoutubeChannel(id=None, name='taranisnews')

参数

url str：要解析的YouTube URL。
fix_common_mistakes bool [True]：是否修复YouTube URL中可能找到的常见错误，就像在爬取网页时找到的那样。

从youtube_url中提取视频id

从给定的YouTube URL返回视频ID或如果没有找到，则返回None。注意，这也可以与YouTube短片一起使用。

from ural.youtube import extract_video_id_from_youtube_url

extract_video_id_from_youtube_url('https://www.youtube.com/watch?v=otRTOE9i51o')
>>> 'otRTOE9i51o'

extract_video_id_from_youtube_url('https://lemonde.fr')
>>> None

extract_video_id_from_youtube_url('http://youtu.be/afa-5HQHiAs')
>>> 'afa-5HQHiAs'

规范化youtube_url

返回给定YouTube URL的标准化版本。它将标准化视频、用户和频道URL，以便您可以轻松匹配它们。

from ural.youtube import normalize_youtube_url

normalize_youtube_url('https://www.youtube.com/watch?v=otRTOE9i51o')
>>> 'https://www.youtube.com/watch?v=otRTOE9i51o'

normalize_youtube_url('http://youtu.be/afa-5HQHiAs')
>>> 'https://www.youtube.com/watch?v=afa-5HQHiAs'

杂项

关于LRU

TL;DR：LRU是URL的分层重排，以便可以在URL上执行有意义的前缀查询。

如果您观察许多URL，您会很快注意到它们不是按照合理的分层顺序编写的。例如，在这个URL中

http://business.lemonde.fr/articles/money.html?id=34#content

某些部分，如子域名，是按照“错误顺序”编写的。这完全没问题，这确实是URL一直以来的工作方式。

但是，如果您真正想要匹配URL，您需要重新排列它们，使它们的顺序紧密反映它们的目标内容的层次结构。这正是LRU所做的工作（以及一个关于URL的糟糕的双关语，因为LRU基本上是一个“反转”的URL）。

现在看看前面提到的URL可以如何分割成LRU茎

[
  's:http',
  'h:fr',
  'h:lemonde',
  'h:business',
  'p:articles',
  'p:money.html',
  'q:id=34',
  'f:content'
]

通常，这个茎列表将按如下方式序列化

s:http|h:fr|h:lemonde|h:business|p:articles|p:money.html|q:id=34|f:content|

添加尾斜杠，以便序列化的LRU可以没有前缀。

哈希值 for ural-1.4.0.tar.gz

ural-1.4.0.tar.gz 的哈希值
算法	哈希摘要
SHA256	`8452d1038abfe50fed597a0f838d35bef02c4407c1bd783eae2edd0034bfc917`
MD5	`9b6a84ac07afc610bc6177196a109465`
BLAKE2b-256	`530afa60cee8b74083e226bd562ddce9b423cfb7c65be566d786bee5fa25067d`

哈希值 for ural-1.4.0-py3-none-any.whl

ural-1.4.0-py3-none-any.whl 的哈希值
算法	哈希摘要
SHA256	`53240217a10d0a320f2ede9e8dc3e1e9b6987b1694f224a074f146a6e2b4af13`
MD5	`fd5f20fc8138a2d91401fa1977e76d67`
BLAKE2b-256	`b62f0ea8e4d9c3e7665fb95f8f54bdeca1a454deb37fb14468ac0220e94a8b40`

ural 1.4.0

导航

验证详情

维护者

未验证详情

项目链接

元数据

项目描述

Ural

安装

如何引用？

使用方法

规范URL、规范化URL和指纹URL之间的差异

规范URL

可能是HTML

可能是RSS

确保协议

指纹主机名

指纹URL

强制协议

格式化URL

获取域名名

获取主机名

获取指纹化主机名

获取规范化主机名

有特殊主机

有有效后缀

有有效顶级域名

推断重定向

是主页

是缩短的URL

是特殊主机

是打字错误URL

是URL

是有效的顶级域名

从HTML中链接

规范化主机名

规范化URL

应遵循href

应解析

分割后缀

删除协议

urlpathsplit

从HTML中提取URL

从文本中提取URL

升级后缀和顶级域名

HostnameTrieSet

#.add

#.match

lru.url_to_lru

lru.lru_to_url

lru.lru_stems

lru.canonicalized_lru_stems

lru.normalized_lru_stems

lru.fingerprinted_lru_stems

lru.serialize_lru

lru.unserialize_lru

LRUTrie

#.set

#.set_lru

#.match

#.match_lru

CanonicalizedLRUTrie

NormalizedLRUTrie

FingerprintedLRUTrie

Facebook

has_facebook_comments

is_facebook_id

is_facebook_full_id

is_facebook_url

is_facebook_post_url

is_facebook_link

convert_facebook_url_to_mobile

parse_facebook_url

extract_url_from_facebook_link

Google

is_amp_url

is_google_link

extract_url_from_google_link

extract_id_from_google_drive_url

发布历史发布通知 | RSS源