使用 Kernel Memory 的 Decoders 來幫取得檔案、網頁..的文字

2025-03-27

前言

在使用 Kernel Memory 的 TextChunker 來幫我們切 Chunk中將文字切段，
那裡怎麼取得檔案中的文字呢?
在Kernel Memory一樣有提供基本的Decoder來讓我們使用，
以下就來建立一個ExtractFileMethod 來取出檔案的文字 …

實作

1.加入Microsoft.KernelMemory Nuget 套件

2.建立一個ExtractFileMethod

#pragma warning disable KMEXP00
private static ILoggerFactory loggerFactory;

static async Task<string> ExtractFile(string docPath, bool isUrl = false)
{
    var mimeTypeDetection = new MimeTypesDetection();
    string mimeType;
    BinaryData? fileBinary;
    if (isUrl)
    {
        var webscraper = new WebScraper();
        var urlDownloadResult = await webscraper.GetContentAsync(docPath);
        if (!urlDownloadResult.Success)
        {
            Console.WriteLine(urlDownloadResult.Error);
            return "";
        }
        mimeType = urlDownloadResult.ContentType;
        fileBinary = urlDownloadResult.Content;
    }
    else
    {
        mimeType = mimeTypeDetection.GetFileType(docPath);
        byte[] fileBytes = File.ReadAllBytes(docPath);
        fileBinary = System.BinaryData.FromBytes(fileBytes);
    }
    var msExcelDecoderConfig = new MsExcelDecoderConfig();
    var msPowerPointDecoderConfig = new MsPowerPointDecoderConfig();
    var decoders = new List<IContentDecoder>
    {
        new TextDecoder(loggerFactory),
        new HtmlDecoder(loggerFactory),
        new MarkDownDecoder(loggerFactory),
        new PdfDecoder(loggerFactory),
        new MsWordDecoder(loggerFactory),
        new MsExcelDecoder(msExcelDecoderConfig, loggerFactory),
        new MsPowerPointDecoder(msPowerPointDecoderConfig, loggerFactory),
        //new ImageDecoder(ocrEngine, loggerFactory),
    };
    var decoder = decoders.LastOrDefault(d => d.SupportsMimeType(mimeType));
    if(decoder is null)
    {
        Console.WriteLine($"無法讀取{mimeType}類型的檔案");
        return "";
    }

    var content = await decoder.DecodeAsync(fileBinary);
    Console.WriteLine("File 文字如下 .....");
    var textBuilder = new StringBuilder();
    foreach (var section in content.Sections)
    {
        var sectionContent = section.Content.Trim();
        if (string.IsNullOrEmpty(sectionContent)) { continue; }

        textBuilder.Append(sectionContent);

        // Add a clean page separation
        if (section.SentencesAreComplete)
        {
            textBuilder.AppendLineNix();
            textBuilder.AppendLineNix();
        }
    }
    var fileText = textBuilder.ToString().Trim();
    return fileText;
}

3.測試讀取的效果

var docPath = @"new1.docx";
var docBody = await ExtractFile(docPath);
Console.WriteLine($"{docPath} ========");
Console.Write(docBody);

var pdfPath = @"pdf1.pdf";
var pdfBody = await ExtractFile(pdfPath);
Console.WriteLine($"{pdfPath} ========");
Console.Write(pdfBody);

var urlPath = "https://www.taisugar.com.tw/resting/hualian/CP2.aspx?n=12036";
var urlBody = await ExtractFile(urlPath, true);
Console.WriteLine($"{urlBody} ========");
Console.Write(urlBody);

結果如下，

讀取 docx 檔案

讀取 pdf 檔案，可以發現”請假天數，核准主管權限及事前天數辨法如下:” 下面的 Table 跑到了比較下面的地方

讀取網頁內容

讀取多欄 PDF

註: 從結果來看，PDF 在 Table 及多欄的資料處理的不是說很好。所以文件儘單欄、簡單 RAG 才會有比較好的效果。
註: 雖然取出 table 的內容沒有很好，但是在使用 Kernel Memory 和 MSSQL 快速建立 RAG 服務透過 LLM 來回答倒是正確的內容。

參考資源

Extract Text From a Multi-Column Document Using PyMuPDF in Python
使用 Kernel Memory 和 MSSQL 快速建立 RAG 服務

jsonContent: meta: false pages: false posts: title: true date: true path: true text: false raw: false content: false slug: false updated: false comments: false link: false permalink: false excerpt: false categories: false tags: true