Skip to main content
2025 Python Packaging Survey is now live!  Take the survey now

langchain-pull-md is a Python package that extends LangChain with a loader to convert URLs into Markdown. It addresses the challenge of extracting content from JavaScript-rendered pages, like those built with React, Angular, or Vue.js, by utilizing the pull.md service. This approach optimizes resource usage and ensures efficient, reliable Markdown conversion directly from URLs.

Project description

PyPI version License: Apache 2.0 Downloads LinkedIn

langchain-pull-md

langchain-pull-md is a Python package that extends LangChain by providing a markdown loader from URLs using the pull.md service. This package enables the fetching of fully rendered Markdown content, which is especially useful for web pages that utilize JavaScript frameworks such as React, Angular, and Vue.js.


Key Features

  • Convert URLs to Markdown directly, supporting pages rendered with JavaScript frameworks.
  • Efficiently fetch markdown without local server resource consumption using the external pull.md service.

Installation

To install the package, use:

pip install langchain-pull-md

Usage

Here’s how you can use the PullMdLoader from langchain-pull-md:

Basic Example

from langchain_pull_md import PullMdLoader

# Initialize using a URL
loader = PullMdLoader(url="http://example.com")

documents = loader.load()
print(documents)

Parameters

PullMdLoader Constructor

Parameter Type Default Description
url str None The URL to fetch and convert to Markdown.

Testing

To run the tests:

  1. Clone the repository:

    git clone https://github.com/chigwell/langchain-pull-md
    cd langchain-pull-md
    
  2. Install development dependencies:

    pip install -r requirements.txt
    
  3. Run the tests:

    pytest tests/test_markdown_loader.py
    

Contributing

Contributions are welcome! If you have ideas for new features or spot a bug, feel free to:

  • Open an issue on GitHub.
  • Submit a pull request.

License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.


Acknowledgements

  • LangChain for providing the base integration framework.
  • pull.md for enabling efficient Markdown extraction from dynamic web pages.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page