提取HTML信息

前沿

有时候需要从HTML文件上找指定标签上的内容，自己平时用QString的字符串操作，比较麻烦，于是网上找了几个库，记录下来。使用平台是windows QT mingw32

HTMLCXX

经过使用，该库符合以下几项要求：使用简单，运行高效。将HTML文件夹和CSS文件夹导入工程即可。使用参考下文。这样已经可以满足我们获取标签上的内容。如果还需要获取CSS，可以看htmlcxx.cc，里面有一个很完整的例子。该库还可以编译成.lib导入工程使用。

手册，找不到原出处，贴上博客：https://blog.csdn.net/ictextr9/article/details/6893085

#include <htmlcxx/html/ParserDom.h>
  ...
  using namespace std;
  using namespace htmlcxx;
  
  //Parse some html code
  string html = "<html><body>hey</body></html>";
  HTML::ParserDom parser;
  tree<HTML::Node> dom = parser.parseTree(html);
  
  //Print whole DOM tree
  cout << dom << endl;
  
  //Dump all links in the tree
  tree<HTML::Node>::iterator it = dom.begin();
  tree<HTML::Node>::iterator end = dom.end();
  for (; it != end; ++it)
  {
     if (strcasecmp(it->tagName().c_str(), "A") == 0)
     {
       it->parseAttributes();
       cout << it->attribute("href").second << endl;
     }
  }
  
  //Dump all text of the document
  it = dom.begin();
  end = dom.end();
  for (; it != end; ++it)
  {
    if ((!it->isTag()) && (!it->isComment()))
    {
      cout << it->text();
    }
  }
  cout << endl;

GUMBO

该库完全由C99编写。另外有C++版本（https://github.com/lazytiger/gumbo-query）

官方设定：

完全符合HTML5规范。
坚固耐用，输入不良。
简单的API，可以很容易地被其他语言包装。
支持源位置和指针返回原始文本。
支持片段解析。
相对轻量级，没有外部依赖性。
传递所有html5lib测试，包括模板标记。
测试了谷歌指数超过25亿页。

根据官方介绍，进行编译，在QT mingw32下，直接导入源文件也能使用。或者编译出.h lib*.a .dll导入工程即可。数据结构大概参考下图，下图省略了很多小标签，结构非常值得参考。

图片转自

 使用下面两个方式，将html字符串传入，得到GumboOutput*的一个结构体，可以从该结构体中获取到想要的数值。
 方式一：
 GumboOutput* output = gumbo_parse(contents.c_str());
//do something
 gumbo_destroy_output(&kGumboDefaultOptions, output);
 方式二：
 GumboOutput* output = gumbo_parse_with_options(
     &kGumboDefaultOptions, contents.data(), contents.length());
//do something
 gumbo_destroy_output(&kGumboDefaultOptions, output);

官方示例，实际用得不深入，以后还会另开博客写该类。简单介绍下example，为查找A标签的href属性值
#include <stdlib.h>
#include <fstream>
#include <iostream>
#include <string>
#include "gumbo.h"

static void search_for_links(GumboNode* node) {
  if (node->type != GUMBO_NODE_ELEMENT) {//不是元素，递归返回
    return;
  }
  GumboAttribute* href;
  if (node->v.element.tag == GUMBO_TAG_A &&
      (href = gumbo_get_attribute(&node->v.element.attributes, "href"))) {//是A标签，gumbo_get_attribute查找href属性，如果找到就返回，找不到返回NULL，不区分大小写
    std::cout << href->value << std::endl;
  }

  GumboVector* children = &node->v.element.children;//指向子结点
  for (unsigned int i = 0; i < children->length; ++i) {
    search_for_links(static_cast<GumboNode*>(children->data[i]));//递归查找
  }
}

int main(int argc, char** argv) {
  if (argc != 2) {
    std::cout << "Usage: find_links <html filename>.\n";
    exit(EXIT_FAILURE);
  }
  const char* filename = argv[1];

  std::ifstream in(filename, std::ios::in | std::ios::binary);
  if (!in) {
    std::cout << "File " << filename << " not found!\n";
    exit(EXIT_FAILURE);
  }

  std::string contents;
  in.seekg(0, std::ios::end);
  contents.resize(in.tellg());
  in.seekg(0, std::ios::beg);
  in.read(&contents[0], contents.size());
  in.close();

  GumboOutput* output = gumbo_parse(contents.c_str());//将从文本读到的html字符串传入，获得GumboOutput结构体
  search_for_links(output->root);
  gumbo_destroy_output(&kGumboDefaultOptions, output);//析构结构体
}

提取HTML信息

提取HTML信息

前沿

HTMLCXX

GUMBO

相关阅读