设为首页 - 加入收藏 站长在线 - 常用服务器软件 - 在线站长工具 - 在线伪原创工具
您的当前位置:主页 > 网络教程 > ASP.NET > 正文

.NET Core 实现定时抓取网站文章并发送到邮箱

来源:ZzWww 编辑:ZzWww 时间:2019-11-25

前言

大家好,我是晓晨。许久没有更新博客了,今天给大家带来一篇干货型文章,一个每隔5分钟抓取博客园首页文章信息并在第二天的上午9点发送到你的邮箱的小工具。比如我在2018年2月14日,9点来到公司我就会收到一封邮件,是2018年2月13日的博客园首页的文章信息。写这个小工具的初衷是,一直有看博客的习惯,但是最近由于各种原因吧,可能几天都不会看一下博客,要是中途错过了什么好文可是十分心疼的哈哈。所以做了个工具,每天归档发到邮箱,妈妈再也不会担心我错过好的文章了。为什么只抓取首页?因为博客园首页文章的质量相对来说高一些。

准备

作为一个持续运行的工具,没有日志记录怎么行,我准备使用的是NLog来记录日志,它有个日志归档功能非常不错。在http请求中,由于网络问题吧可能会出现失败的情况,这里我使用Polly来进行Retry。使用HtmlAgilityPack来解析网页,需要对xpath有一定了解。下面是详细说明:

组件名 用途 github NLog 记录日志 https://github.com/NLog/NLog Polly 当http请求失败,进行重试 https://github.com/App-vNext/Polly HtmlAgilityPack 网页解析 https://github.com/zzzprojects/html-agility-pack MailKit 发送邮件 https://github.com/jstedfast/MailKit

有不了解的组件,可以通过访问github获取资料。

参考文章

http://www.zzwww.in/article/112595.htm

获取&解析博客园首页数据

我是用的是HttpWebRequest来进行http请求,下面分享一下我简单封装的类库:

using System;
using System.IO;
using System.Net;
using System.Text;

namespace CnBlogSubscribeTool
{
 /// <summary>
 /// Simple Http Request Class
 /// .NET Framework >= 4.0
 /// Author:stulzq
 /// CreatedTime:2017-12-12 15:54:47
 /// </summary>
 public class HttpUtil
 {
 static HttpUtil()
 {
  //Set connection limit ,Default limit is 2
  ServicePointManager.DefaultConnectionLimit = 1024;
 }

 /// <summary>
 /// Default Timeout 20s
 /// </summary>
 public static int DefaultTimeout = 20000;

 /// <summary>
 /// Is Auto Redirect
 /// </summary>
 public static bool DefalutAllowAutoRedirect = true;

 /// <summary>
 /// Default Encoding
 /// </summary>
 public static Encoding DefaultEncoding = Encoding.UTF8;

 /// <summary>
 /// Default UserAgent
 /// </summary>
 public static string DefaultUserAgent =
  "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
  ;

 /// <summary>
 /// Default Referer
 /// </summary>
 public static string DefaultReferer = "";

 /// <summary>
 /// httpget request
 /// </summary>
 /// <param name="url">Internet Address</param>
 /// <returns>string</returns>
 public static string GetString(string url)
 {
  var stream = GetStream(url);
  string result;
  using (StreamReader sr = new StreamReader(stream))
  {
  result = sr.ReadToEnd();
  }
  return result;

 }

 /// <summary>
 /// httppost request
 /// </summary>
 /// <param name="url">Internet Address</param>
 /// <param name="postData">Post request data</param>
 /// <returns>string</returns>
 public static string PostString(string url, string postData)
 {
  var stream = PostStream(url, postData);
  string result;
  using (StreamReader sr = new StreamReader(stream))
  {
  result = sr.ReadToEnd();
  }
  return result;

 }

 /// <summary>
 /// Create Response
 /// </summary>
 /// <param name="url"></param>
 /// <param name="post">Is post Request</param>
 /// <param name="postData">Post request data</param>
 /// <returns></returns>
 public static WebResponse CreateResponse(string url, bool post, string postData = "")
 {
  var httpWebRequest = WebRequest.CreateHttp(url);
  httpWebRequest.Timeout = DefaultTimeout;
  httpWebRequest.AllowAutoRedirect = DefalutAllowAutoRedirect;
  httpWebRequest.UserAgent = DefaultUserAgent;
  httpWebRequest.Referer = DefaultReferer;
  if (post)
  {

  var data = DefaultEncoding.GetBytes(postData);
  httpWebRequest.Method = "POST";
  httpWebRequest.ContentType = "application/x-www-form-urlencoded;charset=utf-8";
  httpWebRequest.ContentLength = data.Length;
  using (var stream = httpWebRequest.GetRequestStream())
  {
   stream.Write(data, 0, data.Length);
  }
  }

  try
  {
  var response = httpWebRequest.GetResponse();
  return response;
  }
  catch (Exception e)
  {
  throw new Exception(string.Format("Request error,url:{0},IsPost:{1},Data:{2},Message:{3}", url, post, postData, e.Message), e);
  }
 }

 /// <summary>
 /// http get request
 /// </summary>
 /// <param name="url"></param>
 /// <returns>Response Stream</returns>
 public static Stream GetStream(string url)
 {
  var stream = CreateResponse(url, false).GetResponseStream();
  if (stream == null)
  {

  throw new Exception("Response error,the response stream is null");
  }
  else
  {
  return stream;

  }
 }

 /// <summary>
 /// http post request
 /// </summary>
 /// <param name="url"></param>
 /// <param name="postData">post data</param>
 /// <returns>Response Stream</returns>
 public static Stream PostStream(string url, string postData)
 {
  var stream = CreateResponse(url, true, postData).GetResponseStream();
  if (stream == null)
  {

  throw new Exception("Response error,the response stream is null");
  }
  else
  {
  return stream;

  }
 }


 }
}

获取首页数据

string res = HttpUtil.GetString(https://www.cnblogs.com);

解析数据

我们成功获取到了html,但是怎么提取我们需要的信息(文章标题、地址、摘要、作者、发布时间)呢。这里就亮出了我们的利剑HtmlAgilityPack,他是一个可以根据xpath来解析网页的组件。

载入我们前面获取的html:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);


TAG标签:.net Core 定时抓取

网友评论:

文章右边250
Top