最近有个朋友咨询如何实现对海量磁盘资料进行目录、文件名及文件正文进行搜索,要求实现简单高效、维护方便、成本低廉。我想了想利用ES来实现文档的索引及搜索是适当的选择,于是就着手写了一些代码来实现,下面就将设计思路及实现方法作以介绍。
整体架构
考虑到磁盘文件分布到不同的设备上,所以采用磁盘扫瞄代理的模式构建系统,即把扫描服务以代理的方式部署到目标磁盘所在的服务器上,作为定时任务执行,索引统一建立到ES中,当然ES采用分布式高可用部署方法,搜索服务和扫描代理部署到一起来简化架构并实现分布式能力。
部署ES
ES(elasticsearch)是本项目唯一依赖的第三方软件,ES支持docker方式部署,以下是部署过程
docker pull docker.elastic.co/elasticsearch/elasticsearch:6.3.2
docker run -e ES_JAVA_OPTS="-Xms256m -Xmx256m" -d -p 9200:9200 -p 9300:9300 --name es01 docker.elastic.co/elasticsearch/elasticsearch:6.3.2
部署完成后,通过浏览器打开http://localhost:9200,如果正常打开,出现如下界面,则说明ES部署成功。
工程结构
依赖包
本项目除了引入springboot的基础starter外,还需要引入ES相关包
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-elasticsearch</artifactId>
</dependency>
<dependency>
<groupId>io.searchbox</groupId>
<artifactId>jest</artifactId>
<version>5.3.3</version>
</dependency>
<dependency>
<groupId>net.sf.jmimemagic</groupId>
<artifactId>jmimemagic</artifactId>
<version>0.1.4</version>
</dependency>
</dependencies>
配置文件
需要将ES的访问地址配置到application.yml里边,同时为了简化程序,需要将待扫描磁盘的根目录(index-root)配置进去,后面的扫描任务就会递归遍历该目录下的全部可索引文件。
server:
port: @elasticsearch.port@
spring:
application:
name: @project.artifactId@
profiles:
active: dev
elasticsearch:
jest:
uris: http://127.0.0.1:9200
index-root: /Users/crazyicelee/mywokerspace
索引结构数据定义
因为要求文件所在目录、文件名、文件正文都有能够检索,所以要将这些内容都作为索引字段定义,而且添加ES client要求的JestId来注解id。
package com.crazyice.lee.accumulation.search.data;
import io.searchbox.annotations.JestId;
import lombok.Data;
@Data
public class Article {
@JestId
private Integer id;
private String author;
private String title;
private String path;
private String content;
private String fileFingerprint;
}
扫描磁盘并创建索引
因为要扫描指定目录下的全部文件,所以采用递归的方法遍历该目录,并标识已经处理的文件以提升效率,在文件类型识别方面采用两种方式可供选择,一个是文件内容更为精准判断(Magic),一种是以文件扩展名粗略判断。这部分是整个系统的核心组件。
这里有个小技巧
对目标文件内容计算MD5值并作为文件指纹存储到ES的索引字段里边,每次在重建索引的时候判断该MD5是否存在,如果存在就不用重复建立索引了,可以避免文件索引重复,也能避免系统重启后重复遍历文件。
//判断是否已经索引
private JSONObject isIndex(File file) {
JSONObject result = new JSONObject();
//用MD5生成文件指纹,搜索该指纹是否已经索引
String fileFingerprint = Md5CaculateUtil.getMD5(file);
result.put("fileFingerprint", fileFingerprint);
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
searchSourceBuilder.query(QueryBuilders.termQuery("fileFingerprint", fileFingerprint));
Search search = new Search.Builder(searchSourceBuilder.toString()).addIndex("diskfile").addType("files").build();
try {
//执行
SearchResult searchResult = jestClient.execute(search);
if (searchResult.getTotal() > 0) {
result.put("isIndex", true);
} else {
result.put("isIndex", false);
}
} catch (IOException e) {
log.error("{}", e.getLocalizedMessage());
}
return result;
}
抽象类实现不同类型文件读取
每种类型的文件读取方式都有区别,但是处理逻辑大致相同,所以才有抽象类的方式将共性逻辑在父类实现,各种文件的个性处理在相应子类实现,其中父类实现文件转换为索引对象及文件内容格式转换方法,子类实现文件内容读取到String的方法。
1. 抽象父类代码
package com.crazyice.lee.accumulation.search.inter;
import com.crazyice.lee.accumulation.search.data.Article;
import com.crazyice.lee.accumulation.search.utils.Md5CaculateUtil;
import java.io.File;
public abstract class ReadFileContent {
public Article Read(File file, String serviceIP){
Article article = new Article();
article.setTitle(file.getName());
article.setAuthor(file.getParent());
article.setPath("file://" + serviceIP + ":" + file.getPath());
article.setContent(readToString(file));
article.setFileFingerprint(Md5CaculateUtil.getMD5(file));
return article;
}
public String charFilter(String s){
if (s.length() > 0) {
//替换\n、\t、\r等为网页标签
return s.toString().replaceAll("(\r\n|\r|\n|\n\r)+", "<br>").replaceAll("\t", " ");
} else {
return "";
}
}
//读取文件内容
public abstract String readToString(File file);
}
2. 文本类型文件读取子类代码
package com.crazyice.lee.accumulation.search.impl;
import com.crazyice.lee.accumulation.search.inter.ReadFileContent;
import lombok.extern.slf4j.Slf4j;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
@Slf4j
public class TxtFile extends ReadFileContent {
public String readToString(File file){
StringBuffer result = new StringBuffer();
try (FileInputStream in = new FileInputStream(file)) {
byte[] buffer = new byte[8192];
int length;
while ((length = in.read(buffer)) != -1) {
result.append(new String(buffer, 0, length, "utf8"));
}
} catch (FileNotFoundException e) {
log.error("{}", e.getLocalizedMessage());
} catch (IOException e) {
log.error("{}", e.getLocalizedMessage());
}
return charFilter(result.toString());
}
}
3. doc文件内容读取子类代码
package com.crazyice.lee.accumulation.search.impl;
import com.crazyice.lee.accumulation.search.inter.ReadFileContent;
import lombok.extern.slf4j.Slf4j;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.openxml4j.util.ZipSecureFile;
import java.io.File;
import java.io.FileInputStream;
@Slf4j
public class DocFile extends ReadFileContent {
public String readToString(File file){
StringBuffer result = new StringBuffer();
//使用HWPF组件中WordExtractor类从Word文档中提取文本或段落
try (FileInputStream in = new FileInputStream(file)) {
ZipSecureFile.setMinInflateRatio(-1.0d);
WordExtractor extractor = new WordExtractor(in);
result.append(extractor.getText());
} catch (Exception e) {
log.error("{}", e.getLocalizedMessage());
}
return charFilter(result.toString());
}
}
4. docx文件内容读取子类代码
package com.crazyice.lee.accumulation.search.impl;
import com.crazyice.lee.accumulation.search.inter.ReadFileContent;
import lombok.extern.slf4j.Slf4j;
import org.apache.poi.openxml4j.util.ZipSecureFile;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import java.io.File;
import java.io.FileInputStream;
@Slf4j
public class DocxFile extends ReadFileContent {
public String readToString(File file){
StringBuffer result = new StringBuffer();
try (FileInputStream in = new FileInputStream(file); XWPFDocument doc = new XWPFDocument(in)) {
ZipSecureFile.setMinInflateRatio(-1.0d);
XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
result.append(extractor.getText());
} catch (Exception e) {
log.error("{}", e.getLocalizedMessage());
}
return charFilter(result.toString());
}
}
5. pdf文件内容读取子类代码
package com.crazyice.lee.accumulation.search.impl;
import com.crazyice.lee.accumulation.search.inter.ReadFileContent;
import lombok.extern.slf4j.Slf4j;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import java.io.File;
@Slf4j
public class PdfFile extends ReadFileContent {
public String readToString(File file){
StringBuffer result = new StringBuffer();
try (PDDocument document = PDDocument.load(file)) {
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition(true);
int pages = document.getNumberOfPages();
for (int i = 0; i < pages; i++) {
stripper.setStartPage(i);
stripper.setEndPage(i + 1);
result.append(stripper.getText(document));
}
} catch (Exception e) {
log.error("{}", e.getLocalizedMessage());
}
return charFilter(result.toString());
}
}
多线程并行创建索引
磁盘扫码分两个步骤进行。
1. 递归扫瞄指定目录下的所有可索引文件,将待处理文件存储到List对象中。
//遍历指定目录下的全部文件
public void find(String pathName) {
//获取pathName的File对象
File dirFile = new File(pathName);
//判断是否有读权限
if (!dirFile.canRead()){
log.info("do not read");
return;
}
if (!dirFile.isDirectory()) {
String fileType=fileType(dirFile,JUDGE);
if(FILETYPE.contains(fileType)) {
destFile.add(dirFile);
}
}
else {
//获取此目录下的所有文件名与目录名
String[] fileList = dirFile.list();
for (String subFile : fileList) {
File file = new File(dirFile.getPath(), subFile);
if (!file.canRead()) {
continue;
}
//如果是一个目录,输出目录名后,进行递归
if (file.isDirectory()) {
if(fileType(file, JUDGE).equals("link")) continue;
//递归
try {
find(file.getCanonicalPath());
} catch (Exception e) {
log.error("{}", e.getLocalizedMessage());
}
}
else {
//忽略掉临时文件,以~$起始的文件名
if (file.getName().startsWith("~#34;)) continue;
if (FILETYPE.contains(fileType(file, JUDGE))) {
destFile.add(file);
}
}
}
}
log.info("已经扫描的文件数:{}",destFile.size());
}
2. 使用stream对待处理文件并行处理
//流方式并行处理文件
public void doneFile(String method,Boolean onlyFileType){
destFile.parallelStream().forEach(file -> createIndex(file,method,onlyFileType));
}
扫描任务
这里采用定时任务的方式来扫描指定目录以实现动态增量创建索引。顺序执行上面的文件处理过程,从而实现多线程并行高效建立文件索引。
package com.crazyice.lee.accumulation.search.service;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Configuration;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Map;
@Configuration
@Component
@Slf4j
public class CreateIndexTask {
@Autowired
private DirectoryRecurse directoryRecurse;
@Value("${index-root}")
private String indexRoot;
@Scheduled(cron = "* 5/30 * * * ?")
private void addIndex() {
log.info("根目录:{}", indexRoot);
directoryRecurse.find(indexRoot);
directoryRecurse.doneFile("ext",false);
//fileTypes频率排序
List<Map.Entry<String,Integer>> list = new ArrayList<>(directoryRecurse.getFileTypes().entrySet());
//降序排序
Collections.sort(list, (o1, o2) -> o2.getValue().compareTo(o1.getValue()));
log.info("文件类型:{}",list);
//清理空间
directoryRecurse.getDestFile().clear();
directoryRecurse.getFileTypes().clear();
}
}
搜索服务
这里通过thymeleaf模板来实现搜索服务及UI,将关键字以高亮度模式提供给前端UI。
package com.crazyice.lee.accumulation.search.web;
import com.crazyice.lee.accumulation.search.data.Article;
import io.searchbox.client.JestClient;
import io.searchbox.core.Search;
import io.searchbox.core.SearchResult;
import lombok.extern.slf4j.Slf4j;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.search.builder.SearchSourceBuilder;
import org.elasticsearch.search.fetch.subphase.highlight.HighlightBuilder;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Controller;
import org.springframework.ui.Model;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestMethod;
import org.springframework.web.bind.annotation.RequestParam;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
@Controller
@Slf4j
class SearchController {
@Autowired
private JestClient jestClient;
@GetMapping("/")
public String index() {
return "index";
}
@RequestMapping(value = "/search", method = RequestMethod.GET)
public String search(@RequestParam String keyword, Model model) {
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
searchSourceBuilder.query(QueryBuilders.queryStringQuery(keyword));
HighlightBuilder highlightBuilder = new HighlightBuilder();
//path属性高亮度
HighlightBuilder.Field highlightPath = new HighlightBuilder.Field("path");
highlightPath.highlighterType("unified");
highlightBuilder.field(highlightPath);
//title字段高亮度
HighlightBuilder.Field highlightTitle = new HighlightBuilder.Field("title");
highlightTitle.highlighterType("unified");
highlightBuilder.field(highlightTitle);
//content字段高亮度
HighlightBuilder.Field highlightContent = new HighlightBuilder.Field("content");
highlightContent.highlighterType("unified");
highlightBuilder.field(highlightContent);
//高亮度配置生效
searchSourceBuilder.highlighter(highlightBuilder);
log.info("搜索条件{}", searchSourceBuilder.toString());
//构建搜索功能
Search search = new Search.Builder(searchSourceBuilder.toString()).addIndex("diskfile").addType("files").build();
try {
//执行
SearchResult result = jestClient.execute(search);
List<Article> articles = new ArrayList<>();
result.getHits(Article.class).forEach((value) -> {
if (value.highlight != null && value.highlight.get("content") != null) {
StringBuffer highlightContentBuffer = new StringBuffer();
value.highlight.get("content").forEach(v -> {
highlightContentBuffer.append(v);
});
value.source.setHighlightContent(highlightContentBuffer.toString());
}
value.source.setContent(value.source.getContent());
articles.add(value.source);
});
model.addAttribute("articles", articles);
model.addAttribute("keyword", keyword);
return "search";
} catch (IOException e) {
log.error("{}", e.getLocalizedMessage());
}
return "search";
}
}
搜索restFul结果测试
使用thymeleaf生成UI
集成thymeleaf的模板引擎直接将搜索结果以web方式呈现。模板包括主搜索页和搜索结果页,通过@Controller注解及Model对象实现。
<body>
<div class="container">
<div class="header">
<form action="./search" class="parent">
<input type="keyword" name="keyword" th:value="${keyword}">
<input type="submit" value="搜索">
</form>
</div>
<div class="content" th:each="article,memberStat:${articles}">
<div class="c_left">
<p class="con-title" th:text="${article.title}"/>
<p class="con-path" th:text="${article.path}"/>
<p class="con-preview" th:utext="${article.highlightContent}"/>
<a class="con-more">更多</a>
</div>
<div class="c_right">
<p class="con-all" th:utext="${article.content}"/>
</div>
</div>
<script language="JavaScript">
document.querySelectorAll('.con-more').forEach(item => {
item.onclick = () => {
item.style.cssText = 'display: none';
item.parentNode.querySelector('.con-preview').style.cssText = 'max-height: none;';
}});
</script>
</div>
本文暂时没有评论,来添加一个吧(●'◡'●)