{"id":1404,"date":"2024-05-09T17:31:56","date_gmt":"2024-05-09T17:31:56","guid":{"rendered":"https:\/\/www.nicekj.com\/?p=1404"},"modified":"2024-05-09T17:31:56","modified_gmt":"2024-05-09T17:31:56","slug":"yiwengaodonglangchain-document-loaderyi","status":"publish","type":"post","link":"https:\/\/www.nicekj.com\/yiwengaodonglangchain-document-loaderyi.html","title":{"rendered":"\ud83d\udd25\ud83d\udd25\ud83d\udd25\u4e00\u6587\u641e\u61c2Langchain  Document Loader\uff08\u4e00\uff09"},"content":{"rendered":"<h2 data-id=\"heading-0\">\u524d\u8a00<\/h2>\n<p>\u50cf GPT-3 \u8fd9\u6837\u7684\u8bed\u8a00\u6a21\u578b\u5df2\u7ecf\u5728\u5927\u91cf\u6570\u636e\u4e0a\u8fdb\u884c\u4e86\u8bad\u7ec3\uff0c\u5305\u62ec\u6570\u767e GB \u548c\u6570\u5341\u4ebf\u4e2a\u5355\u8bcd\u3002\u56e0\u6b64\uff0c\u5b83\u4eec\u5177\u6709\u624e\u5b9e\u7684\u77e5\u8bc6\u57fa\u7840\uff0c\u4f7f\u5b83\u4eec\u5728\u5386\u53f2\u548c\u79d1\u5b66\u7b49\u9886\u57df\u8868\u73b0\u51fa\u8272\u3002\u7136\u800c\uff0c\u8fd9\u4e9b\u6a21\u578b\u786e\u5b9e\u6709\u5c40\u9650\u6027\u3002\u4e00\u65e6\u5b83\u4eec\u5728\u8bad\u7ec3\u4e2d\u8fbe\u5230\u4e86\u67d0\u4e2a\u7a0b\u5ea6\uff0c\u9664\u975e\u53ef\u4ee5\u8bbf\u95ee\u4e92\u8054\u7f51\uff0c\u5426\u5219\u5b83\u4eec\u65e0\u6cd5\u5438\u6536\u4efb\u4f55\u65b0\u4fe1\u606f\u3002\u6b64\u5916\uff0c\u5b83\u4eec\u65e0\u6cd5\u8bbf\u95ee\u79c1\u4eba\u548c\u4f01\u4e1a\u6587\u4ef6\u4e2d\u7684\u5927\u91cf\u6570\u636e\u3002<\/p>\n<p>\u8981\u89e3\u51b3\u8fd9\u4e2a\u95ee\u9898\uff0c\u7406\u89e3\u201c\u7d22\u5f15\u201d\u7684\u6982\u5ff5\u81f3\u5173\u91cd\u8981\u3002\u8fd9\u4e9b\u7d22\u5f15\u6709\u52a9\u4e8e\u5c06\u6587\u6863\u7ed3\u6784\u5316\uff0c\u4ee5\u4fbf\u4e8e LLMs \u7684\u4f7f\u7528\u3002LangChain \u63d0\u4f9b\u4e86\u56db\u79cd\u521b\u5efa\u7d22\u5f15\u7684\u5de5\u5177 \uff1a<\/p>\n<ul>\n<li>\u6587\u6863\u52a0\u8f7d\u5668\uff08Document Loaders\uff09<\/li>\n<li>\u6587\u672c\u62c6\u5206\u5668 \uff08Text Splitters\uff09<\/li>\n<li>\u5411\u91cf\u5b58\u50a8 \uff08Vector Stores\uff09<\/li>\n<li>\u68c0\u7d22\u5668\uff08Retrievers\uff09\u3002<\/li>\n<\/ul>\n<p>\u672c\u6307\u5357\u65e8\u5728\u6df1\u5165\u89e3\u91ca LangChain \u6587\u6863\u52a0\u8f7d\u5668\uff08 Document Loaders\uff09\uff0c\u4f7f\u60a8\u80fd\u591f\u5145\u5206\u5229\u7528\u5b83\u4eec\u5728\u60a8\u7684 LLM \u5e94\u7528\u7a0b\u5e8f\u4e2d\u3002\uff09\u3002\u987e\u540d\u601d\u4e49\uff0c\u6587\u6863\u52a0\u8f7d\u5668\u8d1f\u8d23\u4ece\u4e0d\u540c\u7684\u6765\u6e90\u52a0\u8f7d\u6587\u6863\u3002\u5b83\u4eec\u662f\u591a\u529f\u80fd\u7684\u5de5\u5177\uff0c\u53ef\u4ee5\u5904\u7406\u5404\u79cd\u6570\u636e\u683c\u5f0f\uff0c\u5e76\u5c06\u5b83\u4eec\u8f6c\u6362\u6210\u8bed\u8a00\u6a21\u578b\u53ef\u4ee5\u8f7b\u677e\u5904\u7406\u7684\u6807\u51c6\u7ed3\u6784\u3002<\/p>\n<h2 data-id=\"heading-1\">\u4e86\u89e3 LangChain \u6587\u6863\u52a0\u8f7d\u5668<\/h2>\n<p>\u9996\u5148\u8981\u4e86\u89e3\u7684\u6982\u5ff5\u662f Langchain \u79f0\u4e4b\u4e3a\u6587\u6863\uff08Document\uff09\u7684\u4e1c\u897f\u3002\u6587\u6863\u975e\u5e38\u7b80\u5355\uff0c\u5b83\u6709\u4e24\u4e2a\u5b57\u6bb5\uff1a<\/p>\n<ul>\n<li>page_content\uff08\u5b57\u7b26\u4e32\uff09\uff1a\u6587\u6863\u7684\u539f\u59cb\u6587\u672c<\/li>\n<li>metadata\uff08\u5b57\u5178\uff09\uff1a\u5173\u4e8e\u6587\u672c\u7684\u4efb\u4f55\u5143\u6570\u636e\u7684\u952e\/\u503c\u5b58\u50a8\uff08\u6e90 URL\u3001\u4f5c\u8005\u7b49\uff09<\/li>\n<li><\/li>\n<\/ul>\n<p>\u6211\u4eec\u6765\u770b\u4e00\u4e2a\u6700\u57fa\u672c\u7684\u6587\u6863\u52a0\u8f7d\u5668\uff08TextLoader\uff09\uff0c\u5b83\u6253\u5f00\u4e00\u4e2a\u6587\u672c\u6587\u4ef6\u5e76\u5c06\u6587\u672c\u52a0\u8f7d\u5230\u6587\u6863\u4e2d\u3002<\/p>\n<pre><\/div><div class=\"code-block-extension-headerRight\"><span class=\"code-block-extension-lang\">python<\/span><div class=\"code-block-extension-copyCodeBtn\">\u590d\u5236\u4ee3\u7801<\/div><\/div><\/div><code class=\"hljs language-python code-block-extension-codeShowNum\" lang=\"python\"><span class=\"code-block-extension-codeLine\" data-line-num=\"1\"><span class=\"hljs-keyword\">class<\/span> <span class=\"hljs-title class_\">TextLoader<\/span>(<span class=\"hljs-title class_ inherited__\">BaseLoader<\/span>):<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"2\">    <span class=\"hljs-string\">\"\"\"Load text files.\"\"\"<\/span><\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"3\"><\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"4\">    <span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title function_\">__init__<\/span>(<span class=\"hljs-params\"><\/span><\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"5\">        self,<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"6\">        file_path: <span class=\"hljs-built_in\">str<\/span>,<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"7\">        encoding: <span class=\"hljs-type\">Optional<\/span>[<span class=\"hljs-built_in\">str<\/span>] = <span class=\"hljs-literal\">None<\/span>,<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"8\">        autodetect_encoding: <span class=\"hljs-built_in\">bool<\/span> = <span class=\"hljs-literal\">False<\/span>,<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"9\">    ):<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"10\">        <span class=\"hljs-string\">\"\"\"Initialize with file path.\"\"\"<\/span><\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"11\">        self.file_path = file_path<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"12\">        self.encoding = encoding<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"13\">        self.autodetect_encoding = autodetect_encoding<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"14\"><\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"15\">    <span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title function_\">load<\/span>(<span class=\"hljs-params\">self<\/span>) -&gt; <span class=\"hljs-type\">List<\/span>[Document]:<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"16\">        <span class=\"hljs-string\">\"\"\"Load from file path.\"\"\"<\/span><\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"17\">        text = <span class=\"hljs-string\">\"\"<\/span><\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"18\">        <span class=\"hljs-keyword\">try<\/span>:<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"19\">            <span class=\"hljs-keyword\">with<\/span> <span class=\"hljs-built_in\">open<\/span>(self.file_path, encoding=self.encoding) <span class=\"hljs-keyword\">as<\/span> f:<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"20\">                text = f.read()<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"21\">        <span class=\"hljs-keyword\">except<\/span> UnicodeDecodeError <span class=\"hljs-keyword\">as<\/span> e:<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"22\">            <span class=\"hljs-comment\"># code to handle Decoding errors<\/span><\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"23\">        <span class=\"hljs-keyword\">except<\/span> Exception <span class=\"hljs-keyword\">as<\/span> e:<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"24\">            <span class=\"hljs-keyword\">raise<\/span> RuntimeError(<span class=\"hljs-string\">f\"Error loading <span class=\"hljs-subst\">{self.file_path}<\/span>\"<\/span>) <span class=\"hljs-keyword\">from<\/span> e<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"25\"><\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"26\">        metadata = {<span class=\"hljs-string\">\"source\"<\/span>: self.file_path}<\/span>\n<span class=\"code-block-extension-codeLine\" data-line-num=\"27\">        <span class=\"hljs-keyword\">return<\/span> [Document(page_content=text, metadata=metadata)]<\/span>\n<\/code><\/pre>\n<p>TextLoader \u5c06\u6587\u6863\u7684 page_content \u8bbe\u7f6e\u4e3a\u6587\u4ef6\u7684\u6587\u672c\uff0cmetadata \u5b58\u50a8\u201csource\u201d\u6587\u4ef6\u8def\u5f84\u3002<\/p>\n<p>\u968f\u7740\u6570\u636e\u6765\u6e90\u53d8\u5f97\u66f4\u52a0\u590d\u6742\uff0c\u4f60\u4f1a\u53d1\u73b0\u9700\u8981\u66f4\u591a\u7684\u903b\u8f91\u6765\u521b\u5efa\u8fd9\u4e9b\u6587\u6863\u3002\u5f52\u6839\u7ed3\u5e95\uff0c\u6211\u4eec\u7684\u6838\u5fc3\u76ee\u6807\u662f\u5c06\u6570\u636e\u8f6c\u6362\u4e3a\u8fd9\u79cd\u6807\u51c6\u683c\u5f0f\uff0c\u4ee5\u4fbf\u5728\u6211\u4eec\u7684\u7d22\u5f15\u7cfb\u7edf\u4e2d\u8fdb\u4e00\u6b65\u5904\u7406\u3002<\/p>\n<p>LangChain \u4e2d\u6709\u4e09\u79cd\u4e3b\u8981\u7c7b\u578b\u7684\u6587\u6863\u52a0\u8f7d\u5668\uff1aTransform\uff08\u8f6c\u6362\uff09\u3001Public Datasets\/Services\uff08\u516c\u5171\u6570\u636e\u96c6\/\u670d\u52a1\uff09\u3001Proprietary Datasets\/Services\uff08\u4e13\u6709\u6570\u636e\u96c6\/\u670d\u52a1\uff09\u3002<\/p>\n","protected":false},"excerpt":{"rendered":"<p>\u50cf GPT-3 \u8fd9\u6837\u7684\u8bed\u8a00\u6a21\u578b\u5df2\u7ecf\u5728\u5927\u91cf\u6570\u636e\u4e0a\u8fdb\u884c\u4e86\u8bad\u7ec3\uff0c\u5305\u62ec\u6570\u767e GB \u548c\u6570\u5341\u4ebf\u4e2a\u5355\u8bcd\u3002\u56e0\u6b64\uff0c\u5b83\u4eec\u5177\u6709\u624e\u5b9e\u7684\u77e5\u8bc6\u57fa\u7840\uff0c\u4f7f\u5b83\u4eec\u5728\u5386\u53f2\u548c\u79d1\u5b66\u7b49\u9886\u57df\u8868\u73b0\u51fa\u8272\u3002\u7136\u800c\uff0c\u8fd9\u4e9b\u6a21\u578b\u786e\u5b9e\u6709\u5c40\u9650\u6027\u3002\u4e00\u65e6\u5b83\u4eec\u5728\u8bad<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"rank_math_title":"","rank_math_description":"","rank_math_focus_keyword":"","views":"3","footnotes":""},"categories":[3],"tags":[126,127,128,129,136],"collection":[],"class_list":["post-1404","post","type-post","status-publish","format-standard","hentry","category-fenlei2","tag-gpt","tag-ai","tag-128","tag-129","tag-136"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.nicekj.com\/nicekj2024\/wp\/v2\/posts\/1404","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.nicekj.com\/nicekj2024\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.nicekj.com\/nicekj2024\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.nicekj.com\/nicekj2024\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.nicekj.com\/nicekj2024\/wp\/v2\/comments?post=1404"}],"version-history":[{"count":0,"href":"https:\/\/www.nicekj.com\/nicekj2024\/wp\/v2\/posts\/1404\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.nicekj.com\/nicekj2024\/wp\/v2\/media?parent=1404"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.nicekj.com\/nicekj2024\/wp\/v2\/categories?post=1404"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.nicekj.com\/nicekj2024\/wp\/v2\/tags?post=1404"},{"taxonomy":"collection","embeddable":true,"href":"https:\/\/www.nicekj.com\/nicekj2024\/wp\/v2\/collection?post=1404"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}