专注收集记录技术开发学习笔记、技术难点、解决方案
网站信息搜索 >> 请输入关键词:
您当前的位置: 首页 > Go

goodarticle.4-Regular Expression Matching

发布时间:2011-06-29 18:23:54 文章来源:www.iduyao.cn 采编人员:星星草
goodarticle.4-----------------Regular Expression Matching

Implement regular expression matching with support for ‘.’ and ‘*’.

‘.’ Matches any single character.
‘*’ Matches zero or more of the preceding element.

The matching should cover the entire input string (not partial).

The function prototype should be:
bool isMatch(const char *s, const char *p)

Some examples:
isMatch(“aa”,”a”) → false
isMatch(“aa”,”aa”) → true
isMatch(“aaa”,”aa”) → false
isMatch(“aa”, “a*”) → true
isMatch(“aa”, “.*”) → true
isMatch(“ab”, “.*”) → true
isMatch(“aab”, “c*a*b”) → true

Online Judge
This problem is available at Online Judge. Head over there and it will judge your solution. Currently only able to compile C++/Java code. If you are using other languages, you can still verify your solution by looking at the judge’s test cases and its expected output.

Background:
It might seem deceptively easy even you know the general idea, but programming it correctly with all the details require careful thought.

Edit:
It seems that some readers are confused about why the regex pattern “.*” matches the string “ab”. “.*” means repeat thepreceding element 0 or more times. Here, the “preceding” element is the dot character in the pattern, which can match any characters. Therefore, the regex pattern “.*” allows the dot to be repeated any number of times, which matches any string (even an empty string).

Hints:
Think carefully how you would do matching of ‘*’. Please note that ‘*’ in regular expression is different from wildcard matching, as we match the previous character 0 or more times. But, how many times? If you are stuck, recursion is your friend.

This problem is a tricky one. Due to the huge number of edge cases, many people would write lengthy code and have numerous bugs on their first try. Try your best getting your code correct first, then refactor mercilessly to as clean and concise as possible!


A sample diagram of a deterministic finite state automata (DFA). DFAs are useful for doing lexical analysis and pattern matching. An example is UNIX’s grep tool. Please note that this post does not attempt to descibe a solution using DFA.

Solution:
This looks just like a straight forward string matching, isn’t it? Couldn’t we just match the pattern and the input string character by character? The question is, how to match a ‘*’?

A natural way is to use a greedy approach; that is, we attempt to match the previous character as many as we can. Does this work? Let us look at some examples.

s = “abbbc”, p = “ab*c”
Assume we have matched the first ‘a’ on both s and p. When we see “b*” in p, we skip all b’s in s. Since the last ‘c’ matches on both side, they both match.

s = “ac”, p = “ab*c”
After the first ‘a’, we see that there is no b’s to skip for “b*”. We match the last ‘c’ on both side and conclude that they both match.

It seems that being greedy is good. But how about this case?

s = “abbc”, p = “ab*bbc”
When we see “b*” in p, we would have skip all b’s in s. They both should match, but we have no more b’s to match. Therefore, the greedy approach fails in the above case.

One might be tempted to think of a quick workaround. How about counting the number of consecutive b’s in s? If it is smaller or equal to the number of consecutive b’s after “b*” in p, we conclude they both match and continue from there. For the opposite, we conclude there is not a match.

This seem to solve the above problem, but how about this case:
s = “abcbcd”, p = “a.*c.*d”

Here, “.*” in p means repeat ‘.’ 0 or more times. Since ‘.’ can match any character, it is not clear how many times ‘.’ should be repeated. Should the ‘c’ in p matches the first or second ‘c’ in s? Unfortunately, there is no way to tell without using some kind of exhaustive search.

We need some kind of backtracking mechanism such that when a matching fails, we return to the last successful matching state and attempt to match more characters in s with ‘*’. This approach leads naturally to recursion.

The recursion mainly breaks down elegantly to the following two cases:

  1. If the next character of p is NOT ‘*’, then it must match the current character of s. Continue pattern matching with the next character of both s and p.
  2. If the next character of p is ‘*’, then we do a brute force exhaustive matching of 0, 1, or more repeats of current character of p… Until we could not match any more characters.

You would need to consider the base case carefully too. That would be left as an exercise to the reader. :)

Below is the extremely concise code (Excluding comments and asserts, it’s about 10 lines of code).

bool isMatch(const char *s, const char *p) {
	assert(s && p);
	if (*p == '\0') return *s == '\0';

	// next char is not '*': must match current character
	if (*(p + 1) != '*') {
		assert(*p != '*');
		return ((*p == *s) || (*p == '.' && *s != '\0')) && isMatch(s + 1, p + 1);
	}
	// next char is '*'
	while ((*p == *s) || (*p == '.' && *s != '\0')) {
		if (isMatch(s, p + 2)) return true;
		s++;
	}
	return isMatch(s, p + 2);
}


Further Thoughts:
Some extra exercises to this problem:

  1. If you think carefully, you can exploit some cases that the above code runs in exponential complexity. Could you think of some examples? How would you make the above code more efficient?
  2. Try to implement partial matching instead of full matching. In addition, add ‘^’ and ‘$’ to the rule. ‘^’ matches the starting position within the string, while ‘$’ matches the ending position of the string.
  3. Try to implement wildcard matching where ‘*’ means any sequence of zero or more characters.

For the interested reader, real world regular expression matching (such as the grep tool) are usually implemented by applying formal language theory. To understand more about it, you may read this article.

友情提示:
信息收集于互联网,如果您发现错误或造成侵权,请及时通知本站更正或删除,具体联系方式见页面底部联系我们,谢谢。

其他相似内容:

  • ModernUI课程:定义一个Logo

    ModernUI教程:定义一个Logo ModernWindow的标题栏包含了一块区域用来显示自定义的窗体Logo: 这个窗体logo通过ModernWindow.LogoD...

  • Django忘记管理员账号和密码的解决方法

    Django忘记管理员账号和密码的解决办法 看着Django的教程学习搭建网站,结果忘记第一次创建的账号和密码了。结果搭建成功以后,一直...

  • GO语言小结(1)——基本知识

    GO语言总结(1)——基本知识 1、注释(与C++一样)   行注释://  块注释:/*   ...  */ 2、标识符   可以这么说,除了数字开头...

  • golang 惯用的文件读取方式

    golang 常用的文件读取方式 Golang 的文件读取方法很多,刚上手时不知道怎么选择,所以贴在此处便后速查。 一次性读取 小文件推荐一...

  • 查询深圳市通相关信息

    查询深圳通相关信息 用 HTTP.GET 从开放 API 中查询深圳通信息,然后将 JSON 数据存入结构体中,再格式化输出。 注意:获取的并不是实...

  • Go语言设计模式实践:结合(Composite)

    Go语言设计模式实践:组合(Composite) 关于本系列 这个系列首先是关于Go语言实践的。在项目中实际使用Go语言也有段时间了,一个体会就...

  • 列出索引和遍历目录

    列出目录和遍历目录 获取目录列表用 ioutil.ReadDir(),遍历目录用 filepath.Walk(),使用方法请参考文章示例。 示例代码: package ma...

  • io 包的惯用接口速记

    io 包的常用接口速记 我没有 C/C++ 基础,没有接口的概念,且从 Python 投奔而来,Python 的极简主义(一个结果往往只提供一个方法),让我在...

  • 代理服务扩充

    代理服务扩展 之前自己实现了一个代理服务,当时考虑的是只要支持SOCKS5就好了,因为我经常用CHROME,配合着SwitchySharp,体验还是很棒...

  • 文件的创造与打开

    文件的创建与打开 文件操作是个很重要的话题,使用也非常频繁,熟悉如何操作文件是必不可少的。Golang 对文件的支持是在 os package ...

热门推荐: