java按顺序读取docx文件中的内容(包括文字、表格、图片)
近期在工作中需要解析word,里面包括文字、表格、图片等,在网上找了很多关于解析word的,都是单独处理的,由于我需要按顺序解析内容,而且要清晰的指导每个内容的位置在哪儿,所以单独处理显然不行,最开始使用python的docx类库,这种解决起来有点小问题,后来切换到了java来处理这个,没想到POI有这么好用的类库
背景:
近期在做大模型文章解析时需要解析word,从而进行精准回答,遇到了一些问题:里面包括文字、表格、图片等,在网上找了很多关于解析word的,都是单独处理的,由于我需要按顺序解析内容,而且要清晰的指导每个内容的位置在哪儿,所以单独处理显然不行,最开始使用python的docx类库,这种解决起来有点小问题,后来切换到了java来处理这个,没想到POI有这么好用的类库,所以来分享一下,话不多说,开整:
(java按顺序读取docx文档中的文本与表格?)
先看看原文档长啥样子,的确有点麻烦:
1.复杂文本
2.复杂表格
解决问题思路
一、使用ApachePOI读取docx文件,引入maven:
<!-- Apache POI -->
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>4.1.0</version>
</dependency>
<!--处理word文档需要的额外的jar包-->
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>4.1.0</version>
</dependency>
<!--处理word文档需要的额外的jar包-->
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml-schemas</artifactId>
<version>4.1.0</version>
</dependency>
注意:请使用这个4.1.0的版本的,我之前用的是5.1.0和poi-ooxml-full这个依赖包,总是会报这个错:java.lang.NoSuchMethodException: org.openxmlformats.schemas.wordprocessingml.x2006.main.impl.CTPictureBaseImpl.<init>
java.lang.NoSuchMethodException: org.openxmlformats.schemas.wordprocessingml.x2006.main.impl.CTPictureBaseImpl.<init>(org.apache.xmlbeans.SchemaType, boolean)
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.getDeclaredConstructor(Class.java:2178)
at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.getJavaImplConstructor2(SchemaTypeImpl.java:1817)
at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createUnattachedSubclass(SchemaTypeImpl.java:1961)
at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createUnattachedNode(SchemaTypeImpl.java:1950)
at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createElementType(SchemaTypeImpl.java:1051)
at org.apache.xmlbeans.impl.values.XmlObjectBase.create_element_user(XmlObjectBase.java:938)
at org.apache.xmlbeans.impl.store.Xobj.getUser(Xobj.java:1675)
at org.apache.xmlbeans.impl.store.Cur.getUser(Cur.java:2659)
at org.apache.xmlbeans.impl.store.Cur.getObject(Cur.java:2652)
at org.apache.xmlbeans.impl.store.Cursor._getObject(Cursor.java:995)
at org.apache.xmlbeans.impl.store.Cursor.getObject(Cursor.java:2904)
at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:162)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:169)
at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:112)
at org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:60)
at com.programmersupermarket.autocreateproject.common.util.poi.testRead.main(testRead.java:17)
https://blog.csdn.net/qq_16307345/article/details/79076657
https://zhuanlan.zhihu.com/p/556486563
二、获取内容-方法一
1.段落和表格
public static void parseFile(String path){
try {
//String path = "C:\\Users\\admin\\Desktop\\first.docx";
FileInputStream fis = new FileInputStream(path);
XWPFDocument document = new XWPFDocument(fis);
// 遍历文档中的所有元素(段落和表格)
List<IBodyElement> bodyElements = document.getBodyElements();
for (IBodyElement bodyElement : bodyElements) {
if (bodyElement instanceof XWPFParagraph) {
XWPFParagraph paragraph = (XWPFParagraph) bodyElement;
System.out.println(paragraph.getStyleID() + ":" + paragraph.getText());
} else if (bodyElement instanceof XWPFTable) {
System.out.println(((XWPFTable) bodyElement).getText());
} else if (bodyElement instanceof XWPFPicture) {
System.out.println(Arrays.toString(((XWPFPicture) bodyElement).getPictureData().getData()));
}
}
fis.close();
} catch (IOException e) {
e.printStackTrace();
}
}
对元素进行遍历,获取内容(段落和表格),发现此遍历方式不能读取到图片,只能通过getAllPictures
方法得到所有图片,但失去了段落和图片之间的顺序。
2.段落、表格和图片
public static void parseFile2(String path){
try {
//String path = "C:\\Users\\admin\\Desktop\\first.docx";
FileInputStream fis = new FileInputStream(path);
XWPFDocument document = new XWPFDocument(fis);
// 遍历文档中的所有元素(段落和表格)
List<IBodyElement> bodyElements = document.getBodyElements();
for (IBodyElement element : bodyElements) {
if (element instanceof XWPFParagraph) {
XWPFParagraph paragraph = (XWPFParagraph) element;
String text = paragraph.getText();
if (text != null && !text.isEmpty()) {
//处理段落或正文
System.out.println(text);
} else {
// 顺序遍历图片
paragraph.getIRuns().forEach(run -> {
if (run instanceof XWPFRun) {
XWPFRun xWPFRun = (XWPFRun) run;
for (XWPFPicture picture : xWPFRun.getEmbeddedPictures()) {
XWPFPictureData pictureData = picture.getPictureData();
String base64Image = "<img src='data:image/png;base64," + Base64.getEncoder().encodeToString((pictureData.getData())) + "'/>";
System.out.println(base64Image);
}
}
});
}
} else if (element instanceof XWPFTable) {
//处理表格
XWPFTable table = (XWPFTable) element;
String text = table.getText();
System.out.println(text);
}
}
fis.close();
} catch (IOException e) {
e.printStackTrace();
}
}
这样遍历就可以按顺序读取docx文件的内容了。
https://blog.csdn.net/gloriaied/article/details/135405896
三、获取内容-方法二
1.读取word文本
package docx;
import java.io.FileInputStream;
import java.io.InputStream;
import java.util.List;
import org.apache.commons.lang3.StringUtils;
import org.apache.poi.xwpf.usermodel.BodyElementType;
import org.apache.poi.xwpf.usermodel.IBodyElement;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFRun;
/**
* 读文本
*/
public class TextRead {
public static void main(String[] args) throws Exception {
readDocx();
}
public static void readDocx() throws Exception{
InputStream is;
is = new FileInputStream("test.docx");
XWPFDocument xwpf = new XWPFDocument(is);
List<IBodyElement> ibs = xwpf.getBodyElements();
for(IBodyElement ib : ibs) {
BodyElementType elementType = ib.getElementType();
//表格
if(elementType == BodyElementType.TABLE) {
System.out.println("table"+ib.getPart());
}else {
//段落
XWPFParagraph para = (XWPFParagraph) ib;
System.out.println("It is a new paragraph...The indention is "+para.getFirstLineIndent());
List<XWPFRun> runs = para.getRuns();
System.out.println("run");
if(runs.size() <= 0) {
System.out.println("empty line");
}
for(XWPFRun run : runs) {
//如果片段没有文字,可能该片段是图片
if(StringUtils.isEmpty(run.text())) {
//该片段是图片时
if(run.getEmbeddedPictures().size() > 0) {
System.out.println("image***"+run.getEmbeddedPictures());
}else {
System.out.println("objects:"+run.getCTR().getObjectList());
//公式
if(run.getCTR().xmlText().indexOf("instrText") > 0) {
System.out.println("there is an equation field");
}
}
}else {
System.out.println("==="+run.getCharacterSpacing()+run.text());
}
}
}
}
is.close();
}
}
注:这里使用的是这个标签做的判断:
BodyElementType elementType = ib.getElementType();
//表格
if(elementTypeequals(BodyElementType.TABLE)) //表格else if (elementType.equals(BodyElementType.PARAGRAPH)) { //段落
https://blog.csdn.net/qq_38759137/article/details/130083388
2.word内容段落、公式xml标签
import lombok.extern.slf4j.Slf4j;
import org.apache.poi.xwpf.usermodel.*;
import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.Element;
import org.dom4j.io.SAXReader;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import org.springframework.web.multipart.MultipartFile;
import javax.xml.transform.*;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;
import java.io.*;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
/**
* word内容段落、公式xml标签
* 正常文本:
* <w:r> 标签对应一个 XWPFRun对象
* <w:t xml:space=“preserve”> 标签对应一段在 Word中的字符(也可以是一个字符)
* 公式:
* <w:object> 标签对应一个公式(当然我们这里只讲公式,此标签中也可以是一个 Excel也可以是一个 PPT等等)
* <v:shape> 标签中有个 style属性,这里 style就是图片在 Word中显示的宽高
* <v:imagedata> 标签关联着显示的图片( <v:imagedata>为 <v:shape>子标签)
* <o:OLEObject>标签关联着图片显示公式对应的二进制文件(二进制文件也是最重要的文件,没有这个文件当你在word中双击时,是打不开第三方公式插件的)
*
*/
/**
* @description:
* @author: Han LiDong
* @create: 2021/5/7 10:33
* @update: 2021/5/7 10:33
*/
@Slf4j
public class WordUtil {
public static void main(String[] args) throws Exception {
// MultipartFile file = null;
// InputStream inputStream = file.getInputStream();
// 所有公式latex表达式集合,借助mathJax可以在页面上进行展示
String path = "C:\\Users\\allen_sun\\Desktop\\first.docx";
List <String> formulas = getFormulaMap(new FileInputStream(path));
log.info("解析到{}个公式",formulas.size());
// word解析 公式+文本
wordAnalysis(new FileInputStream("C:\\D\\算法-β系数.docx"));
}
/**
* word所有内容解析(公式、文本)
* @param inputStream
* @throws Exception
*/
public static void wordAnalysis(InputStream inputStream) throws Exception {
XWPFDocument word = new XWPFDocument(inputStream);
try {
for (IBodyElement ibodyelement : word.getBodyElements()) {
if (ibodyelement.getElementType().equals(BodyElementType.PARAGRAPH)) { //段落
XWPFParagraph paragraph = (XWPFParagraph) ibodyelement;
//段落解析
String paragraphStr = parseParagraph(paragraph);
} else if (ibodyelement.getElementType().equals(BodyElementType.TABLE)) { //表格
XWPFTable table = (XWPFTable) ibodyelement;
for (XWPFTableRow row : table.getRows()) { //行
for (XWPFTableCell cell : row.getTableCells()) { //cell
List<String> cellMath = new ArrayList<>(16);
for (XWPFParagraph paragraph : cell.getParagraphs()) { //段落
//cell段落解析
String paragraphStr = parseParagraph(paragraph);
if (!"".equals(paragraphStr.trim())){
cellMath.add(paragraphStr);
}
}
log.info("当前cell有{}个公式",cellMath.size());
}
}
}
}
} finally {
word.close();
}
}
/**
* 解析word中公式(转换成latex表达式)
*
* @param inputStream 文件流
* @return
*/
public static List<String> getFormulaMap(InputStream inputStream) throws IOException, DocumentException {
//XWPFDocument xwpfDocument = new XWPFDocument(inputStream);
Map<Integer, String> result = new HashMap<>();
XWPFDocument word = new XWPFDocument(inputStream);
//storing the found MathML in a AllayList of strings
List<String> mathMLList = new ArrayList<String>(16);
try {
for (IBodyElement ibodyelement : word.getBodyElements()) {
if (ibodyelement.getElementType().equals(BodyElementType.PARAGRAPH)) { //段落
XWPFParagraph paragraph = (XWPFParagraph) ibodyelement;
//段落解析
List<String> mathList = parseMathParagraph(paragraph);
mathMLList.addAll(mathList);
} else if (ibodyelement.getElementType().equals(BodyElementType.TABLE)) { //woed表格
XWPFTable table = (XWPFTable) ibodyelement;
for (XWPFTableRow row : table.getRows()) {
for (XWPFTableCell cell : row.getTableCells()) {
List<String> cellMath = new ArrayList<>(16);
for (XWPFParagraph paragraph : cell.getParagraphs()) {
//cell段落解析
List<String> mathList = parseMathParagraph(paragraph);
mathMLList.addAll(mathList);
//cellMath.addAll(mathList);
}
}
}
}
}
} finally {
word.close();
}
log.info("当前文档一共有{}个公式",mathMLList.size());
return mathMLList;
}
/**
* 公式段落解析
* @param xwpfParagraph
* @throws DocumentException
*/
public static List<String> parseMathParagraph(XWPFParagraph xwpfParagraph) throws DocumentException {
CTP ctp = xwpfParagraph.getCTP();
String xmlText = ctp.xmlText();
List<String > mathList = new ArrayList<>();
if (xmlText.contains("<m:oMath>")) {
//得到根节点的值
SAXReader saxReader = new SAXReader();
//将String类型的字符串转换成XML文本对象
Document doc = saxReader.read(new ByteArrayInputStream(xmlText.getBytes()));
Element root = doc.getRootElement();
// 一个段落多个表达式解析
List<Element> omMaths = root.selectNodes("//m:oMath"); //用xpath得到OMML节点
for (Element ele : omMaths) {
/**
* OMML -> MathML -> LaTex
* Office在安装目录中提供了将OMML转为MathML的xsl工具:MML2OMML.XSL
* MathML转LaTex使用网上找到另一个xsl工具mmltex.xsl。
*/
String xml = ele.asXML();
//xml转 mathml
String mml = convertOMML2MML(xml);
//mathml转latx
String latex = convertMML2Latex(mml);
mathList.add(latex);
log.info("late表达式:{}" , latex);
}
}
return mathList;
}
/**
* 段落解析
* @param xwpfParagraph
* @throws DocumentException
*/
public static String parseParagraph(XWPFParagraph xwpfParagraph) throws DocumentException {
CTP ctp = xwpfParagraph.getCTP();
String xmlText = ctp.xmlText();
StringBuilder sb = new StringBuilder();
// if (xmlText.contains("<m:oMath>")) {
//段落文本内容
sb.append(xwpfParagraph.getParagraphText());
//段落公式解析
//得到根节点的值
SAXReader saxReader = new SAXReader();
//将String类型的字符串转换成XML文本对象
Document doc = saxReader.read(new ByteArrayInputStream(xmlText.getBytes()));
Element root = doc.getRootElement();
// 一个段落多个表达式解析
List<Element> omMaths = root.selectNodes("//m:oMath"); //用xpath得到OMML节点
if (omMaths != null && !omMaths.isEmpty()) {
for (Element ele : omMaths) {
/**
* OMML -> MathML -> LaTex
* Office在安装目录中提供了将OMML转为MathML的xsl工具:MML2OMML.XSL
* MathML转LaTex使用网上找到另一个xsl工具mmltex.xsl。
*/
String xml = ele.asXML();
//xml转 mathml
String mathml = convertOMML2MML(xml);
//mathml转latx
String latex = convertMML2Latex(mathml);
sb.append(latex);
log.info("latex表达式:{}",latex);
}
}
log.info("公式个数:{},解析内容:{}",omMaths.size(),sb.toString());
return sb.toString();
}
/**
* Description: xsl转换器</p>
* @param s 公式xml字符串
* @param xslpath 转换器路径
* @param uriResolver xls依赖文件
* @return
*/
public static String xslConvert(String s, String xslpath, URIResolver uriResolver){
TransformerFactory tFac = TransformerFactory.newInstance();
if(uriResolver != null) {
tFac.setURIResolver(uriResolver);
}
StreamSource xslSource = new StreamSource(WordUtil.class.getResourceAsStream(xslpath));
StringWriter writer = new StringWriter();
try {
Transformer t = tFac.newTransformer(xslSource);
Source source = new StreamSource(new StringReader(s));
Result result = new StreamResult(writer);
t.transform(source, result);
} catch (TransformerException e) {
log.error(e.getMessage(), e);
}
return writer.getBuffer().toString();
}
/**
* <p>Description: 将mathml转为latx </p>
* @param mml mathml字符串
* @return
*/
public static String convertMML2Latex(String mml){
mml = mml.substring(mml.indexOf("?>")+2, mml.length()); //去掉xml的头节点
URIResolver r = new URIResolver(){ //设置xls依赖文件的路径
@Override
public Source resolve(String href, String base) throws TransformerException {
File f = new File("/conventer/mml2tex/" + href);
InputStream inputStream = WordUtil.class.getResourceAsStream("/conventer/mml2tex/" + href);
return new StreamSource(inputStream);
}
};
String latex = xslConvert(mml, "/conventer/mml2tex/mmltex.xsl", r);
if(latex != null && latex.length() > 1){
latex = latex.substring(1, latex.length() - 1);
}
return latex;
}
/**
* <p>Description: office xml转为mathml </p>
* @param xml 公式xml
* @return
*/
public static String convertOMML2MML(String xml){
// 进行转换的过程中需要借助这个文件,一般来说本机安装office就会有这个文件,找到就可以
String result = xslConvert(xml, "C:\\Program Files (x86)\\Microsoft Office\\root\\Office16\\OMML2MML.XSL", null);
return result;
}
}
https://blog.csdn.net/han949417140/article/details/116521013
四、扩展变种
由于我的word数据不仅要预处理后保存服务器上,而且还要在 前端页面展示,所以做个后端简单效果展示,给有需要的同学使用!
备注:此处方法没有做封装拆分处理,请勿介意
1.处理方法
public static void parseFile3(String path){
try {
FileInputStream fis = new FileInputStream(path);
XWPFDocument document = new XWPFDocument(fis);
// 遍历文档中的所有元素(段落和表格)
List<IBodyElement> bodyElements = document.getBodyElements();
for (IBodyElement element : bodyElements) {
if (element instanceof XWPFParagraph) {
XWPFParagraph paragraph = (XWPFParagraph) element;
String text = paragraph.getText();
if (text != null && !text.isEmpty()) {
//处理段落或正文
System.out.println(paragraph.getText());
} else {
// 顺序遍历图片
paragraph.getIRuns().forEach(run -> {
if (run instanceof XWPFRun) {
XWPFRun xWPFRun = (XWPFRun) run;
//如果片段没有文字,可能该片段为图片
if (StringUtils.isEmpty(xWPFRun.text())){
//该片段为图片时
if (xWPFRun.getEmbeddedPictures().size()>0){
for (XWPFPicture picture : xWPFRun.getEmbeddedPictures()) {
XWPFPictureData pictureData = picture.getPictureData();
String base64Image = "<img src='data:image/png;base64," + Base64.getEncoder().encodeToString((pictureData.getData())) + "'/>";
System.out.println(base64Image);
}
}else {
if(xWPFRun.getCTR().xmlText().indexOf("instrText") > 0) {
System.out.println("there is an equation field");
}
}
}
}else if(run instanceof XWPFFieldRun){ //公式
XWPFFieldRun xWPFRun = (XWPFFieldRun) run;
}
});
}
} else if (element instanceof XWPFTable) {
//处理表格
XWPFTable table = (XWPFTable) element;
StringBuilder tableBuilder = new StringBuilder();
tableBuilder.append("<table border ='1' >");
//tableBuilder.append("<table style='border: 1px solid #000000' >");
for(XWPFTableRow row: table.getRows()){
tableBuilder.append("<tr>");
for(XWPFTableCell cell: row.getTableCells()){
tableBuilder.append("<td");
String text = cell.getText();
// cell.getParagraphs().get(0).getIRuns()
String color = cell.getColor();
color= StringUtils.isEmpty(color) ? "#FFFFFF" : "#"+color;
tableBuilder.append(" style='border: 1px solid #000000 ; background-color:"+color+"' >");
// System.out.println(text);
if (StringUtils.isEmpty(text)){
List<XWPFParagraph> cellParagraphs = cell.getParagraphs();
//System.out.println(cellParagraphs.size());
for(XWPFParagraph cellsTemp:cellParagraphs){
cellsTemp.getIRuns().forEach(run -> {
if (run instanceof XWPFRun) {
XWPFRun xWPFRun = (XWPFRun) run;
for (XWPFPicture picture : xWPFRun.getEmbeddedPictures()) {
XWPFPictureData pictureData = picture.getPictureData();
String base64Image = "<img src='data:image/png;base64," + Base64.getEncoder().encodeToString((pictureData.getData())) + "'/>";
tableBuilder.append(base64Image);
//System.out.println(base64Image);
}
}
});
}
}else {
tableBuilder.append(text);
}
tableBuilder.append("</td>");
}
tableBuilder.append("</tr>");
}
tableBuilder.append("</table>");
System.out.println(tableBuilder.toString());
// String text = table.getText();
// System.out.println(text);
}
System.out.println("</br>");
}
fis.close();
} catch (IOException e) {
e.printStackTrace();
}
}
2.来看转化效果
原文:
转换后的效果:
这里面有很多方法,比如表格类型有每个cell的颜色,高度、宽度、是否居中等多个属性参数,如果要做的更加精细的同学自行拼装html标签信息,处理后回合原文一样的效果。
3.重点方法
此处重点类包括XWPFParagraph(文字)、XWPFRun(图片)、XWPFTable(表格)、XWPFFieldRun(公式)等
XWPFDocument:代表一个docx文档
XWPFParagraph:代表文档、表格、标题等种的段落,由多个XWPFRun组成
XWPFRun:代表具有同样风格的一段文本
XWPFTable:代表一个表格
XWPFTableRow:代表表格的一行
XWPFTableCell:代表表格的一个单元格
XWPFChar:表示.docx文件中的图表
XWPFHyperlink:表示超链接
XWPFPicture:代表图片
XWPFComment :代表批注
XWPFFooter:代表页脚
XWPFHeader:代表页眉
可参考此文:
https://blog.csdn.net/xl_21/article/details/125205284
https://www.cnblogs.com/unruly/p/7483858.html
五、python处理思路
核心类库是python-docx
1.安装python-docx
pip install python-docx
2.python-docx-入门级
先看看这个方法处理,入门级别的,不能完全解决问题,文字和表格、图片全都分开了,明显是有问题的
from docx import Document
from docx.shared import Inches
# 加载现有Word文档
doc = Document('example_with_table_and_image.docx')
# 读取文档中的表格和图片
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
print(cell.text)
for image in doc.inline_shapes:
if image.type == WD_INLINE_SHAPE_TYPE.IMAGE:
print(f"Image name: {image.image.filename}")
# 可以在这里修改图片
# image.width = Inches(2) # 修改图片宽度为2英寸
# image.height = Inches(2) # 修改图片高度为2英寸
# 保存修改后的文档
doc.save('modified_document.docx')
3.python-docx-高级做法
# docx
import docx
from docx.document import Document
from docx.text.paragraph import Paragraph
from docx.parts.image import ImagePart
from docx.table import _Cell, Table
from docx.oxml.table import CT_Tbl
from docx.oxml.text.paragraph import CT_P
# 该行只能有一个图片
def is_image(graph:Paragraph,doc:Document):
images = graph._element.xpath('.//pic:pic') # 获取所有图片
for image in images:
for img_id in image.xpath('.//a:blip/@r:embed'): # 获取图片id
part = doc.part.related_parts[img_id] # 根据图片id获取对应的图片
if isinstance(part, ImagePart):
return True
return False
# 获取图片(该行只能有一个图片)
def get_ImagePart(graph:Paragraph,doc:Document):
images = graph._element.xpath('.//pic:pic') # 获取所有图片
for image in images:
for img_id in image.xpath('.//a:blip/@r:embed'): # 获取图片id
part = doc.part.related_parts[img_id] # 根据图片id获取对应的图片
if isinstance(part, ImagePart):
return part
return None
def iter_block_items(parent):
"""
Yield each paragraph and table child within *parent*, in document order.
Each returned value is an instance of either Table or Paragraph. *parent*
would most commonly be a reference to a main Document object, but
also works for a _Cell object, which itself can contain paragraphs and tables.
"""
if isinstance(parent, Document):
parent_elm = parent.element.body
elif isinstance(parent, _Cell):
parent_elm = parent._tc
else:
raise ValueError("something's not right")
for child in parent_elm.iterchildren():
if isinstance(child, CT_P):
paragraph=Paragraph(child, parent)
if is_image(paragraph,parent):
print('[Image] ')
yield get_ImagePart(paragraph, parent)
print('[Text] ')
yield Paragraph(child, parent)
elif isinstance(child, CT_Tbl):
print('[Table] ')
yield Table(child, parent)
def parse_word(word_path):
doc = docx.Document(word_path)
for part in iter_block_items(doc):
print(part)
https://blog.csdn.net/qq_39600166/article/details/101537368
https://blog.csdn.net/GstGxf/article/details/133944790
https://blog.csdn.net/weixin_47663607/article/details/128669041
Python: Extract Text and Images from Word Documents
六、整体总结
python的docx库还是挺牛的,不过python在处理这种特殊情况,逐渐发现python这个包读取Word是有缺陷的: 1.一些章节标题序号是使用【样式-编号】自动生成的,这些序号在读取结果中会被忽略; 2.表格解析错误:这个包会默认将表格处理成二维表,但处理的结果并不完全准确。
java方案应该是直接调用POI的api实现的所以功能比较齐全,优势如下:
1.可以拿到所有信息:包括但不限于文本、表格、图片、图等;
2.可以拿到样式信息:包括【样式-序号】,还可以识别章节标题等(python方案也可以实现,但样式-序号问题最终我也没解决);
3.数据保留最原始样式:表格采用xml格式存储,可以自定义后处理代码,规避包处理错误。
这也是为什么把python处理放在最后的原因!
更多推荐
所有评论(0)