无需API密钥抓取Facebook公共页面

这些详情尚未由PyPI验证

项目链接

项目描述

Facebook Scraper

无需API密钥抓取Facebook公共页面。灵感来源于twitter-scraper。

安装

要从PyPI安装最新版本

pip install facebook-scraper

或者，要安装最新主分支

pip install git+https://github.com/kevinzg/facebook-scraper.git

使用方法

将唯一的页面名称、个人资料名称或ID作为第一个参数发送，然后您就可以开始了

>>> from facebook_scraper import get_posts

>>> for post in get_posts('nintendo', pages=1):
...     print(post['text'][:50])
...
The final step on the road to the Super Smash Bros
We’re headed to PAX East 3/28-3/31 with new games

可选参数

(对于 get_posts 函数).

group：群组ID，用于抓取群组而不是页面。默认为None。
pages：请求的帖子页面数，前两页可能没有结果，因此请尝试大于2的数字。默认为10。
timeout：超时前等待的秒数。默认为30。
credentials：登录请求前的用户名和密码的元组。默认为None。
extra_info：布尔值，如果为true，则函数将尝试进行额外请求以获取帖子反应。默认为False。
youtube_dl：布尔值，使用Youtube-DL进行（高质量）视频提取。您需要在环境中安装youtube-dl。默认为False。
post_urls: 列表，提取帖子的URL或帖子ID。基于用户名获取的替代方案。
cookies: 以下之一
- 包含Netscape或JSON格式cookies的文件的路径。您可以使用类似Get Cookies.txt (Chrome)或Cookie Quick Manager (Firefox)的扩展程序从浏览器中提取cookies。请确保您包含c_user和xs两个cookies，如果不包含，您将收到InvalidCookies异常。
- CookieJar
- 一个可以转换为CookieJar的字典，使用cookiejar_from_dict
- 字符串"from_browser"，用于尝试从浏览器中提取Facebook cookies
options: 选项字典。将options={"comments": True}设置为提取评论，将options={"reactors": True}设置为提取对帖子做出反应的人。comments和reactors也可以设置为数字，以设置要检索的评论/反应的数量限制。将options={"progress": True}设置为在提取评论和回复时获得tqdm进度条。将options={"allow_extra_requests": False}设置为在提取帖子数据时禁用额外请求（对于某些东西，如完整文本和图像链接，是必需的）。将options={"posts_per_page": 200}设置为每页请求200个帖子。默认值为4。

命令行界面（CLI）使用

$ facebook-scraper --filename nintendo_page_posts.csv --pages 10 nintendo

运行facebook-scraper --help获取CLI使用的更多详细信息。

注意：如果您收到UnicodeEncodeError，请尝试添加--encoding utf-8。

帖子示例

{'available': True,
 'comments': 459,
 'comments_full': None,
 'factcheck': None,
 'fetched_time': datetime.datetime(2021, 4, 20, 13, 39, 53, 651417),
 'image': 'https://scontent.fhlz2-1.fna.fbcdn.net/v/t1.6435-9/fr/cp0/e15/q65/58745049_2257182057699568_1761478225390731264_n.jpg?_nc_cat=111&ccb=1-3&_nc_sid=8024bb&_nc_ohc=ygH2fPmfQpAAX92ABYY&_nc_ht=scontent.fhlz2-1.fna&tp=14&oh=7a8a7b4904deb55ec696ae255fff97dd&oe=60A36717',
 'images': ['https://scontent.fhlz2-1.fna.fbcdn.net/v/t1.6435-9/fr/cp0/e15/q65/58745049_2257182057699568_1761478225390731264_n.jpg?_nc_cat=111&ccb=1-3&_nc_sid=8024bb&_nc_ohc=ygH2fPmfQpAAX92ABYY&_nc_ht=scontent.fhlz2-1.fna&tp=14&oh=7a8a7b4904deb55ec696ae255fff97dd&oe=60A36717'],
 'is_live': False,
 'likes': 3509,
 'link': 'https://www.nintendo.com/amiibo/line-up/',
 'post_id': '2257188721032235',
 'post_text': 'Don’t let this diminutive version of the Hero of Time fool you, '
              'Young Link is just as heroic as his fully grown version! Young '
              'Link joins the Super Smash Bros. series of amiibo figures!\n'
              '\n'
              'https://www.nintendo.com/amiibo/line-up/',
 'post_url': 'https://facebook.com/story.php?story_fbid=2257188721032235&id=119240841493711',
 'reactions': {'haha': 22, 'like': 2657, 'love': 706, 'sorry': 1, 'wow': 123}, # if `extra_info` was set
 'reactors': None,
 'shared_post_id': None,
 'shared_post_url': None,
 'shared_text': '',
 'shared_time': None,
 'shared_user_id': None,
 'shared_username': None,
 'shares': 441,
 'text': 'Don’t let this diminutive version of the Hero of Time fool you, '
         'Young Link is just as heroic as his fully grown version! Young Link '
         'joins the Super Smash Bros. series of amiibo figures!\n'
         '\n'
         'https://www.nintendo.com/amiibo/line-up/',
 'time': datetime.datetime(2019, 4, 30, 5, 0, 1),
 'user_id': '119240841493711',
 'username': 'Nintendo',
 'video': None,
 'video_id': None,
 'video_thumbnail': None,
 'w3_fb_url': 'https://#/Nintendo/posts/2257188721032235'}

注意

不能保证每个字段都会被提取（它们可能是None）。
群组帖子可能缺少一些字段，如time和post_url。
群组抓取可能只返回一页，并且在私有群组中不起作用。
如果您抓取太多，Facebook可能会暂时禁用您的IP。
Facebook上绝大多数唯一的ID（帖子ID、视频ID、照片ID、评论ID、个人资料ID等）可以附加到https://#/，以重定向到相应的对象。
某些功能（如提取反应）需要您登录Facebook（传递cookies）。如果某些事情不起作用，请尝试传递cookies，看看是否可以修复问题。

个人资料

get_profile函数可以从个人资料的关于部分提取信息。将帐户名称或ID作为第一个参数传入。
请注意，Facebook根据您是否登录（cookies参数）提供不同的信息，例如出生日期和性别。用法

from facebook_scraper import get_profile
get_profile("zuck") # Or get_profile("zuck", cookies="cookies.txt")

输出

{'About': "I'm trying to make the world a more open place.",
 'Education': 'Harvard University\n'
              'Computer Science and Psychology\n'
              '30 August 2002 - 30 April 2004\n'
              'Phillips Exeter Academy\n'
              'Classics\n'
              'School year 2002\n'
              'Ardsley High School\n'
              'High School\n'
              'September 1998 - June 2000',
 'Favourite Quotes': '"Fortune favors the bold."\n'
                     '- Virgil, Aeneid X.284\n'
                     '\n'
                     '"All children are artists. The problem is how to remain '
                     'an artist once you grow up."\n'
                     '- Pablo Picasso\n'
                     '\n'
                     '"Make things as simple as possible but no simpler."\n'
                     '- Albert Einstein',
 'Name': 'Mark Zuckerberg',
 'Places lived': [{'link': '/profile.php?id=104022926303756&refid=17',
                   'text': 'Palo Alto, California',
                   'type': 'Current town/city'},
                  {'link': '/profile.php?id=105506396148790&refid=17',
                   'text': 'Dobbs Ferry, New York',
                   'type': 'Home town'}],
 'Work': 'Chan Zuckerberg Initiative\n'
         '1 December 2015 - Present\n'
         'Facebook\n'
         'Founder and CEO\n'
         '4 February 2004 - Present\n'
         'Palo Alto, California\n'
         'Bringing the world closer together.'}

要提取朋友，请传递参数friends=True，或要限制检索朋友的数量，请将friends设置为所需的数字。

群组信息

get_group_info函数可以提取有关群组的信息。将群组名称或ID作为第一个参数传入。
请注意，为了查看管理员列表，您需要登录（cookies参数）。

使用方法

from facebook_scraper import get_group_info
get_group_info("makeupartistsgroup") # or get_group_info("makeupartistsgroup", cookies="cookies.txt")

输出

{'admins': [{'link': '/africanstylemagazinecom/?refid=18',
             'name': 'African Style Magazine'},
            {'link': '/connectfluencer/?refid=18',
             'name': 'Everythingbrightandbeautiful'},
            {'link': '/Kaakakigroup/?refid=18', 'name': 'Kaakaki Group'},
            {'link': '/opentohelp/?refid=18', 'name': 'Open to Help'}],
 'id': '579169815767106',
 'members': 6814229,
 'name': 'HAIRSTYLES',
 'type': 'Public group'}

待办事项

异步支持
~~图片库~~（images条目）
~~个人资料或帖子作者~~（get_profile()）
评论（使用options={'comments': True}）

替代方案和相关项目

facebook-post-scraper。有评论。使用Selenium。
facebook-scraper-selenium。"无需注册任何API访问即可将任何群组或用户的帖子抓取到.csv文件中"。
Ultimate Facebook Scraper。"抓取Facebook用户个人资料几乎所有的信息”。使用Selenium。
非官方API。各种服务的非官方API列表，目前没有针对Facebook的，但未来可能值得检查。
major-scrapy-spiders。包含Scrapy的资料爬虫。
facebook-page-post-scraper。看起来已被遗弃。
- FBLYZE。分支（？）。
RSSHub。从Facebook页面生成RSS源。
RSS-Bridge。同样从Facebook页面生成RSS源。