读取文件并仅提取特定部分 [英] Read file and extract certain part only

查看:134
本文介绍了读取文件并仅提取特定部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

ifstream toOpen;
openFile.open("sample.html", ios::in); 

if(toOpen.is_open()){
    while(!toOpen.eof()){
        getline(toOpen,line);
        if(line.find("href=") && !line.find(".pdf")){   
                start_pos = line.find("href"); 
        tempString = line.substr(start_pos+1); // i dont want the quote
            stop_pos = tempString .find("\"");
                string testResult = tempString .substr(start_pos, stop_pos);
        cout << testResult << endl;
        }
    }

    toOpen.close();
}

我想要做的是推出href值,但我无法得到它的工作。

What I am trying to do, is to extrat the "href" value. But I cant get it works.

编辑:

感谢Tony hint,我使用这个:

Thanks to Tony hint, I use this:

if(line.find("href=") != std::string::npos ){   
    // Process
}

it works !!

it works!!

推荐答案

解析HTML这样除非你知道很多关于源,并且相当确定如何格式化,有可能是你做的任何事情都会有问题。HTML是一个丑陋的语言与(几乎)自相矛盾的规范(例如)说,不允许特定的事情 - 但后来继续告诉你,你需要如何解释他们。

I'd advise against trying to parse HTML like this. Unless you know a lot about the source and are quite certain about how it'll be formatted, chances are that anything you do will have problems. HTML is an ugly language with an (almost) self-contradictory specification that (for example) says particular things are not allowed -- but then goes on to tell you how you're required to interpret them anyway.

更糟糕的是,几乎任何字符都可以(至少潜在地)以至少三种或四种不同的方式进行编码,因此除非您扫描(并执行)正确的转化

Worse, almost any character can (at least potentially) be encoded in any of at least three or four different ways, so unless you scan for (and carry out) the right conversions (in the right order) first, you can end up missing legitimate links and/or including "phantom" links.

您可能想要查看此前面的问题有关要使用的HTML解析器的建议。

You might want to look at the answers to this previous question for suggestions about an HTML parser to use.

这篇关于读取文件并仅提取特定部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆