{"id":1405,"date":"2024-05-09T19:32:56","date_gmt":"2024-05-09T19:32:56","guid":{"rendered":"https:\/\/www.nicekj.com\/?p=1405"},"modified":"2024-05-09T19:33:01","modified_gmt":"2024-05-09T19:33:01","slug":"yiwengaodonglangchain-document-loaderer","status":"publish","type":"post","link":"https:\/\/www.nicekj.com\/yiwengaodonglangchain-document-loaderer.html","title":{"rendered":"\ud83d\udd25\ud83d\udd25\ud83d\udd25\u4e00\u6587\u641e\u61c2Langchain  Document Loader\uff08\u4e8c\uff09"},"content":{"rendered":"<h2 data-id=\"heading-0\">Transform Loaders\uff1a\u5c06\u6570\u636e\u4ece\u7279\u5b9a\u683c\u5f0f\u52a0\u8f7d\u5230\u6587\u6863\u683c\u5f0f<\/h2>\n<p>\u8f6c\u6362\u52a0\u8f7d\u5668\uff08Transform Loaders\uff1a\uff09\u5c31\u50cf\u4e0a\u6587\u63d0\u5230\u7684\u7684<code>TextLoader<\/code>\u4e00\u6837 &#8211; \u5b83\u4eec\u5c06\u8f93\u5165\u683c\u5f0f\u8f6c\u6362\u4e3a\u6211\u4eec\u7684\u6587\u6863\u683c\u5f0f\u3002<code>LangChain<\/code>\u4e2d\u6709\u8d8a\u6765\u8d8a\u591a\u7684\u8f6c\u6362\u52a0\u8f7d\u5668\uff0c\u5305\u62ec\u4f46\u4e0d\u9650\u4e8e\u4ee5\u4e0b\u51e0\u79cd\uff1a<\/p>\n<ul>\n<li>CSV<\/li>\n<li>Email<\/li>\n<li>HTML<\/li>\n<li>Markdown<\/li>\n<li>Microsoft Word\/PowerPoint<\/li>\n<li>Notion (raw files or through API integration)<\/li>\n<li>Reddit<\/li>\n<li>PDF<\/li>\n<\/ul>\n<p>\u8bb8\u591a\u8fd9\u4e9b\u52a0\u8f7d\u5668\u7684\u57fa\u7840\u662f<code>Unstructured Python<\/code>\u5e93\u3002\u8fd9\u4e2a\u5e93\u975e\u5e38\u64c5\u957f\u5c06\u5404\u79cd\u6587\u4ef6\u7c7b\u578b\u8f6c\u6362\u4e3a\u6211\u4eec\u6587\u6863\u6240\u9700\u7684\u6587\u672c\u6570\u636e\u3002<\/p>\n<h2 data-id=\"heading-1\">\u65e0\u7ed3\u6784\u5206\u533a\uff08Unstructured Partitions\uff09<\/h2>\n<p><code>Unstructured<\/code>\u5e93\u7684\u6838\u5fc3\u6982\u5ff5\u662f\u5c06\u6587\u6863\u5212\u5206\u4e3a\u5143\u7d20\u3002\u5f53\u4f20\u9012\u4e00\u4e2a\u6587\u4ef6\u65f6\uff0c\u5e93\u5c06\u8bfb\u53d6\u6e90\u6587\u6863\uff0c\u5c06\u5176\u5206\u5272\u4e3a\u591a\u4e2a\u90e8\u5206\uff0c\u5bf9\u8fd9\u4e9b\u90e8\u5206\u8fdb\u884c\u5206\u7c7b\uff0c\u7136\u540e\u63d0\u53d6\u6bcf\u4e2a\u90e8\u5206\u7684\u6587\u672c\u3002\u5728\u5212\u5206\u4e4b\u540e\uff0c\u8fd4\u56de\u4e00\u4e2a\u6587\u6863\u5143\u7d20\u5217\u8868\u3002<\/p>\n<p>\u4ee5\u4e0b\u662f\u76f4\u63a5\u4f7f\u7528\u5e93\u65f6\u7684\u4f8b\u5b50\uff1a<\/p>\n<pre><\/div><div class=\"code-block-extension-headerRight\"><span class=\"code-block-extension-lang\">python<\/span><div class=\"code-block-extension-copyCodeBtn\">\u590d\u5236\u4ee3\u7801<\/div><\/div><\/div><code class=\"hljs language-python code-block-extension-codeShowNum\" lang=\"python\"><span class=\"code-block-extension-codeLine\" data-line-num=\"1\"><span class=\"hljs-keyword\">from<\/span> unstructured.partition.auto <span class=\"hljs-keyword\">import<\/span> partition<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"2\">elements = partition(filename=<span class=\"hljs-string\">\"dashboard.html\"<\/span>)<\/span>\n<\/code><\/pre>\n<p>\u8be5\u5e93\u5728\u5e95\u5c42\u4f7f\u7528\u4e86\u4e00\u4e9b\u5de5\u5177\u6765\u81ea\u52a8\u68c0\u6d4b\u6587\u4ef6\u7c7b\u578b\uff0c\u5e76\u6839\u636e\u6587\u4ef6\u7c7b\u578b\u6b63\u786e\u5730\u8fdb\u884c\u5212\u5206\u3002<\/p>\n<h2 data-id=\"heading-2\">\u4f8b\u5b50\uff1a\u52a0\u8f7dMicrosoft Word\u6587\u6863<\/h2>\n<p>\u8ba9\u6211\u4eec\u770b\u4e00\u4e0b\u52a0\u8f7dMicrosoft Word\u6587\u6863\u7684\u8fc7\u7a0b\u662f\u4ec0\u4e48\u6837\u7684\u3002<\/p>\n<p>\u8fd9\u662f\u6211\u4eec\u7684\u6837\u4f8bWord\u6587\u6863\uff1a<\/p>\n<p>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/www.nicekj.com\/wp-content\/uploads\/replace\/660161161c7e6bd26ed8efee32461292.png\" alt=\"image.png\" \/><\/figure>\n<\/p>\n<p>\u73b0\u5728\u6211\u4eec\u53ef\u4ee5\u4f7f\u7528LangChain\u7684UnstructuredWordDocumentLoader\u6765\u5212\u5206\u8fd9\u4e2a\u6587\u6863\u3002<\/p>\n<pre><\/div><div class=\"code-block-extension-headerRight\"><span class=\"code-block-extension-lang\">python<\/span><div class=\"code-block-extension-copyCodeBtn\">\u590d\u5236\u4ee3\u7801<\/div><\/div><\/div><code class=\"hljs language-python code-block-extension-codeShowNum\" lang=\"python\"><span class=\"code-block-extension-codeLine\" data-line-num=\"1\"><span class=\"hljs-keyword\">from<\/span> langchain.document_loaders <span class=\"hljs-keyword\">import<\/span> UnstructuredWordDocumentLoader<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"2\"><\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"3\"><span class=\"hljs-comment\"># use mode=\"elements\" to return each Element as a Document<\/span><\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"4\"><span class=\"hljs-comment\"># otherwise it defaults the \"single\" option which returns a single document<\/span><\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"5\">loader = UnstructuredWordDocumentLoader(file_path=<span class=\"hljs-string\">\"test_doc.docx\"<\/span>, mode=<span class=\"hljs-string\">\"elements\"<\/span>)<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"6\"><\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"7\">data = loader.load()<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"8\"><\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"9\"><span class=\"hljs-built_in\">print<\/span>(data)<\/span>\n<\/code><\/pre>\n<p>\u5f53\u4f7f\u7528mode=&#8221;elements&#8221;\u65f6\u7684\u7ed3\u679c\uff0c\u5b83\u5c06\u4e3a\u6e90\u6587\u6863\u4e2d\u7684\u6bcf\u4e2a\u5143\u7d20\u8fd4\u56de\u4e00\u4e2a\u6587\u6863\u3002<\/p>\n<pre><\/div><div class=\"code-block-extension-headerRight\"><span class=\"code-block-extension-lang\">python<\/span><div class=\"code-block-extension-copyCodeBtn\">\u590d\u5236\u4ee3\u7801<\/div><\/div><\/div><code class=\"hljs language-python code-block-extension-codeShowNum\" lang=\"python\"><span class=\"code-block-extension-codeLine\" data-line-num=\"1\">[<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"2\">    Document(page_content = <span class=\"hljs-string\">'Title Text'<\/span>, metadata = {<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"3\">        <span class=\"hljs-string\">'source'<\/span>: <span class=\"hljs-string\">'test_doc.docx'<\/span>,<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"4\">        <span class=\"hljs-string\">'filename'<\/span>: <span class=\"hljs-string\">'test_doc.docx'<\/span>,<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"5\">        <span class=\"hljs-string\">'filetype'<\/span>: <span class=\"hljs-string\">'application\/vnd.openxmlformats-officedocument.wordprocessingml.document'<\/span>,<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"6\">        <span class=\"hljs-string\">'page_number'<\/span>: <span class=\"hljs-number\">1<\/span>,<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"7\">        <span class=\"hljs-string\">'category'<\/span>: <span class=\"hljs-string\">'Title'<\/span><\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"8\">    }),<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"9\">    Document(page_content = <span class=\"hljs-string\">'Heading 1'<\/span>, metadata = {<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"10\">        <span class=\"hljs-string\">'source'<\/span>: <span class=\"hljs-string\">'test_doc.docx'<\/span>,<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"11\">        <span class=\"hljs-string\">'filename'<\/span>: <span class=\"hljs-string\">'test_doc.docx'<\/span>,<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"12\">        <span class=\"hljs-string\">'filetype'<\/span>: <span class=\"hljs-string\">'application\/vnd.openxmlformats-officedocument.wordprocessingml.document'<\/span>,<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"13\">        <span class=\"hljs-string\">'page_number'<\/span>: <span class=\"hljs-number\">1<\/span>,<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"14\">        <span class=\"hljs-string\">'category'<\/span>: <span class=\"hljs-string\">'Title'<\/span><\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"15\">    }),<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"16\">    Document(page_content = <span class=\"hljs-string\">'This is paragraph 1'<\/span>, metadata = {<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"17\">        <span class=\"hljs-string\">'source'<\/span>: <span class=\"hljs-string\">'test_doc.docx'<\/span>,<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"18\">        <span class=\"hljs-string\">'filename'<\/span>: <span class=\"hljs-string\">'test_doc.docx'<\/span>,<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"19\">        <span class=\"hljs-string\">'filetype'<\/span>: <span class=\"hljs-string\">'application\/vnd.openxmlformats-officedocument.wordprocessingml.document'<\/span>,<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"20\">        <span class=\"hljs-string\">'page_number'<\/span>: <span class=\"hljs-number\">1<\/span>,<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"21\">        <span class=\"hljs-string\">'category'<\/span>: <span class=\"hljs-string\">'NarrativeText'<\/span><\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"22\">    }),<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"23\">    Document(page_content = <span class=\"hljs-string\">'Heading 2'<\/span>, metadata = {<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"24\">        <span class=\"hljs-string\">'source'<\/span>: <span class=\"hljs-string\">'test_doc.docx'<\/span>,<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"25\">        <span class=\"hljs-string\">'filename'<\/span>: <span class=\"hljs-string\">'test_doc.docx'<\/span>,<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"26\">        <span class=\"hljs-string\">'filetype'<\/span>: <span class=\"hljs-string\">'application\/vnd.openxmlformats-officedocument.wordprocessingml.document'<\/span>,<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"27\">        <span class=\"hljs-string\">'page_number'<\/span>: <span class=\"hljs-number\">1<\/span>,<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"28\">        <span class=\"hljs-string\">'category'<\/span>: <span class=\"hljs-string\">'Title'<\/span><\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"29\">    }),<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"30\">    Document(page_content = <span class=\"hljs-string\">'This is paragraph 2'<\/span>, metadata = {<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"31\">        <span class=\"hljs-string\">'source'<\/span>: <span class=\"hljs-string\">'test_doc.docx'<\/span>,<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"32\">        <span class=\"hljs-string\">'filename'<\/span>: <span class=\"hljs-string\">'test_doc.docx'<\/span>,<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"33\">        <span class=\"hljs-string\">'filetype'<\/span>: <span class=\"hljs-string\">'application\/vnd.openxmlformats-officedocument.wordprocessingml.document'<\/span>,<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"34\">        <span class=\"hljs-string\">'page_number'<\/span>: <span class=\"hljs-number\">1<\/span>,<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"35\">        <span class=\"hljs-string\">'category'<\/span>: <span class=\"hljs-string\">'NarrativeText'<\/span><\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"36\">    })<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"37\">]<\/span>\n<\/code><\/pre>\n<p>\u4f7f\u7528\u9ed8\u8ba4\u7684mode=&#8221;single&#8221;\u65f6\u7684\u7ed3\u679c\uff0c\u5b83\u5c06\u4e3a\u6e90\u6587\u6863\u4e2d\u7684\u6240\u6709\u6587\u672c\u8fd4\u56de\u4e00\u4e2a\u5355\u4e00\u7684\u6587\u6863\u3002<\/p>\n<pre><\/div><div class=\"code-block-extension-headerRight\"><span class=\"code-block-extension-lang\">python<\/span><div class=\"code-block-extension-copyCodeBtn\">\u590d\u5236\u4ee3\u7801<\/div><\/div><\/div><code class=\"hljs language-python code-block-extension-codeShowNum\" lang=\"python\"><span class=\"code-block-extension-codeLine\" data-line-num=\"1\">[<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"2\">\tDocument(<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"3\">\t\tpage_content=<span class=\"hljs-string\">'Title TextnnHeading 1nnThis is paragraph 1nnHeading 2nnThis is paragraph 2'<\/span>, <\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"4\">\t\tmetadata={<span class=\"hljs-string\">'source'<\/span>: <span class=\"hljs-string\">'test_doc.docx'<\/span>}<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"5\">\t)<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"6\">]<\/span>\n<\/code><\/pre>\n<p>\u603b\u7ed3\u4e0b\uff0c\u5728&#8221;single&#8221;\u6a21\u5f0f\u4e0b\uff0c\u5143\u7d20\u4e4b\u95f4\u4f7f\u7528&#8221;nn&#8221;\u5206\u9694\u7b26\u8fde\u63a5\u3002\u63a5\u4e0b\u6765\u6211\u4eec\u4ecb\u7ecd\u6587\u672c\u62c6\u5206\u5668\u65f6\uff0c\u8fd9\u662f\u5b57\u7b26\u62c6\u5206\u5668\u7684\u9ed8\u8ba4\u62c6\u5206\u5b57\u7b26\u3002<\/p>","protected":false},"excerpt":{"rendered":"<p>\u8f6c\u6362\u52a0\u8f7d\u5668\uff08Transform Loaders\uff1a\uff09\u5c31\u50cf\u4e0a\u6587\u63d0\u5230\u7684\u7684`TextLoader`\u4e00\u6837 &#8211; \u5b83\u4eec\u5c06\u8f93\u5165\u683c\u5f0f\u8f6c\u6362\u4e3a\u6211\u4eec\u7684\u6587\u6863\u683c\u5f0f\u3002`LangChain`\u4e2d\u6709\u8d8a\u6765\u8d8a\u591a\u7684\u8f6c\u6362\u52a0\u8f7d\u5668<\/p>\n","protected":false},"author":1,"featured_media":8279,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"rank_math_title":"","rank_math_description":"","rank_math_focus_keyword":"","views":"4","footnotes":""},"categories":[3],"tags":[127,128,129,136,126],"collection":[],"class_list":["post-1405","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-fenlei2","tag-ai","tag-128","tag-129","tag-136","tag-gpt"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.nicekj.com\/nicekj2024\/wp\/v2\/posts\/1405","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.nicekj.com\/nicekj2024\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.nicekj.com\/nicekj2024\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.nicekj.com\/nicekj2024\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.nicekj.com\/nicekj2024\/wp\/v2\/comments?post=1405"}],"version-history":[{"count":0,"href":"https:\/\/www.nicekj.com\/nicekj2024\/wp\/v2\/posts\/1405\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.nicekj.com\/nicekj2024\/wp\/v2\/media\/8279"}],"wp:attachment":[{"href":"https:\/\/www.nicekj.com\/nicekj2024\/wp\/v2\/media?parent=1405"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.nicekj.com\/nicekj2024\/wp\/v2\/categories?post=1405"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.nicekj.com\/nicekj2024\/wp\/v2\/tags?post=1405"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.nicekj.com\/nicekj2024\/wp\/v2\/collection?post=1405"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}