BlogJava-探索与发现-随笔分类-C#

Split String Examples in C#

蜘蛛 — Fri, 10 Jul 2009 06:41:00 GMT

Problem. You want to split strings on different characters with single character or string delimiters. For example, split a string that contains ""r"n" sequences, which are Windows newlines. Solution. This document contains several tips for the Split method on the string type in the C# programming language.

Input string: One,Two,Three,Four,Five

Delimiter:    ,     (char)

Array:        One   (string array)

Two  

Three

Four 

Five

1. Using Split

Here we see the basic Split method overload. You already know the general way to do this, but it is good to look at the basic syntax before we move on. This example splits on a single character.

=== Example program for splitting on spaces ===



using System;



class Program

{

static void Main()

{

string s = "there is a cat";

//

// Split string on spaces.

// This will separate all the words.

//

string[] words = s.Split(' ');

foreach (string word in words)

{

Console.WriteLine(word);

}

}

}



=== Output of the program ===



there

is

a

cat

Description. The input string, which contains four words, is split on spaces and the foreach loop then displays each word. The result value from Split is a string[] array.

2. Multiple characters

Here we use either the Regex method or the C# new array syntax. Note that a new char array is created in the following usages. There is an overloaded method with that signature if you need StringSplitOptions, which is used to remove empty strings.

=== Program that splits on lines with Regex ===



using System;

using System.Text.RegularExpressions;



class Program

{

static void Main()

{

string value = "cat"r"ndog"r"nanimal"r"nperson";

//

// Split the string on line breaks.

// The return value from Split is a string[] array.

//

string[] lines = Regex.Split(value, ""r"n");



foreach (string line in lines)

{

Console.WriteLine(line);

}

}

}



=== Output of the program ===



cat

dog

animal

person

Description. The first example uses Regex. Regex contains the Split method, which is static. It can be used to split strings, although it has different performance properties. The next two example show how you can specify an array as the first parameter to string Split.

=== Program that splits on multiple characters ===



using System;



class Program

{

static void Main()

{

//

// This string is also separated by Windows line breaks.

//

string value = "shirt"r"ndress"r"npants"r"njacket";



//

// Use a new char[] array of two characters ("r and "n) to break

// lines from into separate strings. Use "RemoveEmptyEntries"

// to make sure no empty strings get put in the string[] array.

//

char[] delimiters = new char[] { '"r', '"n' };

string[] parts = value.Split(delimiters, StringSplitOptions.RemoveEmptyEntries);

for (int i = 0; i < parts.Length; i++)

{

Console.WriteLine(parts[i]);

}



//

// Same as the previous example, but uses a new string of 2 characters.

//

parts = value.Split(new string[] { ""r"n" }, StringSplitOptions.None);

for (int i = 0; i < parts.Length; i++)

{

Console.WriteLine(parts[i]);

}

}

}



=== Output of the program ===

(Repeated two times)



shirt

dress

pants

jacket

Overview. One useful overload of Split receives char[] arrays. The string Split method can receive a character array as the first parameter. Each char in the array designates a new block.

Using string arrays. Another overload of Split receives string[] arrays. This means string array can also be passed to the Split method. The new string[] array is created inline with the Split call.

Explanation of StringSplitOptions. The RemoveEmptyEntries enum is specified. When two delimiters are adjacent, we end up with an empty result. We can use this as the second parameter to avoid this. [C# StringSplitOptions Enumeration - dotnetperls.com] The following screenshot shows the Visual Studio debugger.

3. Separating words

Here we see how you can separate words with Split. Usually, the best way to separate words is to use a Regex that specifies non-word chars. This example separates words in a string based on non-word characters. It eliminates punctuation and whitespace from the return array.

=== Program that separates on non-word pattern ===



using System;

using System.Text.RegularExpressions;



class Program

{

static void Main()

{

string[] w = SplitWords("That is a cute cat, man");

foreach (string s in w)

{

Console.WriteLine(s);

}

Console.ReadLine();

}



/// 

/// Take all the words in the input string and separate them.

/// 

static string[] SplitWords(string s)

{

//

// Split on all non-word characters.

// Returns an array of all the words.

//

return Regex.Split(s, @""W+");

// @      special verbatim string syntax

// "W+    one or more non-word characters together

}

}



=== Output of the program ===



That

is

a

cute

cat

man

Word splitting example. Here you can separate parts of your input string based on any character set or range with Regex. Overall, this provides more power than the string Split methods. [C# Regex.Split Method Examples - dotnetperls.com]

4. Splitting text files

Here you have a text file containing comma-delimited lines of values. This is called a CSV file, and it is easily dealt with in C#. We use the File.ReadAllLines method here, but you may want StreamReader instead.

Reading the following code. The C# code next reads in both of those lines, parses them, and displays the values of each line after the line number. The final comment shows how the file was parsed into the strings.

=== Contents of input file (TextFile1.txt) ===



Dog,Cat,Mouse,Fish,Cow,Horse,Hyena

Programmer,Wizard,CEO,Rancher,Clerk,Farmer



=== Program that splits lines in file (C#) ===



using System;

using System.IO;



class Program

{

static void Main()

{

int i = 0;

foreach (string line in File.ReadAllLines("TextFile1.txt"))

{

string[] parts = line.Split(',');

foreach (string part in parts)

{

Console.WriteLine("{0}:{1}",

i,

part);

}

i++; // For demo only

}

}

}



=== Output of the program ===



0:Dog

0:Cat

0:Mouse

0:Fish

0:Cow

0:Horse

0:Hyena

1:Programmer

1:Wizard

1:CEO

1:Rancher

1:Clerk

1:Farmer

5. Splitting directory paths

Here we see how you can Split the segments in a Windows local directory into separate strings. Note that directory paths are complex and this may not handle all cases correctly. It is also platform-specific, and you could use System.IO.Path. DirectorySeparatorChar for more flexibility. [C# Path Examples - dotnetperls.com]

=== Program that splits Windows directories (C#) ===



using System;



class Program

{

static void Main()

{

// The directory from Windows

const string dir = @"C:"Users"Sam"Documents"Perls"Main";

// Split on directory separator

string[] parts = dir.Split('""');

foreach (string part in parts)

{

Console.WriteLine(part);

}

}

}



=== Output of the program ===



C:

Users

Sam

Documents

Perls

Main

6. Split internal logic

The logic internal to the .NET framework for Split is implemented in managed code. The methods call into the overload with three parameters. The parameters are next checked for validity. Finally, it uses unsafe code to create the separator list, and then a for loop combined with Substring to return the array.

7. Benchmarks

The author tested a long string and a short string, having 40 and 1200 chars. String splitting speed varies on the type of strings. The length of the blocks, number of delimiters, and total size of the string factor into performance.

Results. The Regex.Split option generally performed the worst. The author felt that the second or third methods would be the best, after observing performance problems with regular expressions in other situations.

=== Strings used in test ===

//

// Build long string.

//

_test = string.Empty;

for (int i = 0; i < 120; i++)

{

_test += "01234567"r"n";

}

//

// Build short string.

//

_test = string.Empty;

for (int i = 0; i < 10; i++)

{

_test += "ab"r"n";

}



=== Example methods tested (100000 iterations) ===



static void Test1()

{

string[] arr = Regex.Split(_test, ""r"n", RegexOptions.Compiled);

}



static void Test2()

{

string[] arr = _test.Split(new char[] { '"r', '"n' }, StringSplitOptions.RemoveEmptyEntries);

}



static void Test3()

{

string[] arr = _test.Split(new string[] { ""r"n" }, StringSplitOptions.None);

}

Longer strings: 1200 chars. The benchmark for the methods on the long strings is more even. It may be that for very long strings, such as entire files, the Regex method is equivalent or even faster. For short strings, Regex is slowest, but for long strings it is very fast.

=== Benchmark of Split on long strings ===



[1] Regex.Split:    3470 ms

[2] char[] Split:   1255 ms [fastest]

[3] string[] Split: 1449 ms



=== Benchmark of Split on short strings ===



[1] Regex.Split:     434 ms

[2] char[] Split:     63 ms [fastest]

[3] string[] Split:   83 ms

Short strings: 40 chars. This shows the three methods compared to each other on short strings. Method 1 is the Regex method, and it is by far the slowest on the short strings. This may be because of the compilation time. Smaller is better. [This article was last updated for .NET 3.5 SP1.]

Performance recommendation. For programs that use shorter strings, the methods that split based on arrays are faster and simpler, and they will avoid Regex compilation. For somewhat longer strings or files that contain more lines, Regex is appropriate. I show some Split improvements that can improve your program. [C# Split Improvement - dotnetperls.com]

8. Escaped characters

You can use Replace on your string input to substitute special characters in for any escaped characters. This can solve lots of problems on parsing computer-generated code or data. [C# Split Method and Escape Characters - dotnetperls.com]

9. Caching delimiters

The author's further research into Split and its performance shows that it is worthwhile to declare your char[] array you are splitting on as a local instance to reduce memory pressure and improve runtime performance.

=== Slow version - before ===



//

// Split on multiple characters using new char[] inline.

//

string t = "string to split, ok";



for (int i = 0; i < 10000000; i++)

{

string[] s = t.Split(new char[] { ' ', ',' });

}



=== Fast version - after ===



//

// Split on multiple characters using new char[] already created.

//

string t = "string to split, ok";

char[] c = new char[]{ ' ', ',' }; // <-- Cache this



for (int i = 0; i < 10000000; i++)

{

string[] s = t.Split(c);

}

Interpretation of the above table. We see that storing the array of delimiters separately is good. My measurements show the above code is less than 10% faster when the array is stored outside the loop.

10. Rewriting PHP explode

C# has no explode method exactly like PHP explode, but you can gain the functionality quite easily with Split, for the most part. You can replace explode with the Split method that receives a string[] array. [C# PHP explode Function - dotnetperls.com]

11. Summary

Here we saw several examples and two benchmarks of the Split method in the C# programming language. You can use Split to divide or separate your strings while keeping your code as simple as possible. Sometimes, using IndexOf and Substring together to parse your strings can be more precise and less error-prone. [C# IndexOf String Examples - dotnetperls.com]

蜘蛛 2009-07-10 14:41 发表评论

转载:C#正则表达式

蜘蛛 — Fri, 15 May 2009 22:54:00 GMT

C#正则表达式

只能输入数字："^[0-9]*$"。
只能输入n位的数字："^\d{n}$"。
只能输入至少n位的数字："^\d{n,}$"。
只能输入m~n位的数字：。"^\d{m,n}$"
只能输入零和非零开头的数字："^(0|[1-9][0-9]*)$"。
只能输入有两位小数的正实数："^[0-9]+(.[0-9]{2})?$"。
只能输入有1~3位小数的正实数："^[0-9]+(.[0-9]{1,3})?$"。
只能输入非零的正整数："^\+?[1-9][0-9]*$"。
只能输入非零的负整数："^\-[1-9][]0-9"*$。
只能输入长度为3的字符："^.{3}$"。
只能输入由26个英文字母组成的字符串："^[A-Za-z]+$"。
只能输入由26个大写英文字母组成的字符串："^[A-Z]+$"。
只能输入由26个小写英文字母组成的字符串："^[a-z]+$"。
只能输入由数字和26个英文字母组成的字符串："^[A-Za-z0-9]+$"。
只能输入由数字、26个英文字母或者下划线组成的字符串："^\w+$"。
验证用户密码："^[a-zA-Z]\w{5,17}$"正确格式为：以字母开头，长度在6~18之间，只能包含字符、数字和下划线。
验证是否含有^%&',;=?$\"等字符："[^%&',;=?$\x22]+"。
只能输入汉字："^[\u4e00-\u9fa5]{0,}$"
验证Email地址："^\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*$"。
验证InternetURL："^http://([\w-]+\.)+[\w-]+(/[\w-./?%&=]*)?$"。
验证电话号码："^(\(\d{3,4}-)|\d{3.4}-)?\d{7,8}$"正确格式为："XXX-XXXXXXX"、"XXXX-XXXXXXXX"、"XXX-XXXXXXX"、"XXX-XXXXXXXX"、"XXXXXXX"和"XXXXXXXX"。
验证身份证号（15位或18位数字）："^\d{15}|\d{18}$"。
验证一年的12个月："^(0?[1-9]|1[0-2])$"正确格式为："01"～"09"和"1"～"12"。
验证一个月的31天："^((0?[1-9])|((1|2)[0-9])|30|31)$"正确格式为；"01"～"09"和"1"～"31"。
利用正则表达式限制网页表单里的文本框输入内容：

用正则表达式限制只能输入中文：onkeyup="value=value.replace(/[^\u4E00-\u9FA5]/g,'')" onbeforepaste="clipboardData.setData('text',clipboardData.getData('text').replace(/[^\u4E00-\u9FA5]/g,''))"

用正则表达式限制只能输入全角字符： onkeyup="value=value.replace(/[^\uFF00-\uFFFF] /g,'')" onbeforepaste="clipboardData.setData('text',clipboardData.getData('text').replace(/[^\uFF00-\uFFFF]/g,''))"

用正则表达式限制只能输入数字：onkeyup="value=value.replace(/[^\d]/g,'') "onbeforepaste="clipboardData.setData('text',clipboardData.getData('text').replace(/[^\d]/g,''))"

用正则表达式限制只能输入数字和英文：onkeyup="value=value.replace(/[\W]/g,'') "onbeforepaste="clipboardData.setData('text',clipboardData.getData('text').replace(/[^\d]/g,''))"

得用正则表达式从URL地址中提取文件名的javascript程序，如下结果为page1

s="http://www.9499.net/page1.htm"
s=s.replace(/(.*\/){0,}([^\.]+).*/ig,"$2")
alert(s)

匹配双字节字符(包括汉字在内)：[^\x00-\xff]

应用：计算字符串的长度（一个双字节字符长度计2，ASCII字符计1）

String.prototype.len=function(){return this.replace([^\x00-\xff]/g,"aa").length;}

匹配空行的正则表达式：\n[\s| ]*\r

匹配HTML标记的正则表达式：/<(.*)>.*<\/\1>|<(.*) \/>/

匹配首尾空格的正则表达式：(^\s*)|(\s*$)

String.prototype.trim = function()
{
return this.replace(/(^\s*)|(\s*$)/g, "");
}

利用正则表达式分解和转换IP地址：

下面是利用正则表达式匹配IP地址，并将IP地址转换成对应数值的Javascript程序：

function IP2V(ip)
{
re=/(\d+)\.(\d+)\.(\d+)\.(\d+)/g //匹配IP地址的正则表达式
if(re.test(ip))
{
return RegExp.$1*Math.pow(255,3))+RegExp.$2*Math.pow(255,2))+RegExp.$3*255+RegExp.$4*1
}
else
{
throw new Error("Not a valid IP address!")
}
}

不过上面的程序如果不用正则表达式，而直接用split函数来分解可能更简单，程序如下：

var ip="10.100.20.168"
ip=ip.split(".")
alert("IP值是："+(ip[0]*255*255*255+ip[1]*255*255+ip[2]*255+ip[3]*1))
符号解释：

字符
描述

\
将下一个字符标记为一个特殊字符、或一个原义字符、或一个向后引用、或一个八进制转义符。例如，'n' 匹配字符 "n"。'\n' 匹配一个换行符。序列 '\\' 匹配 "\" 而 "\(" 则匹配 "("。

^
匹配输入字符串的开始位置。如果设置了 RegExp 对象的 Multiline 属性，^ 也匹配 '\n' 或 '\r' 之后的位置。

$
匹配输入字符串的结束位置。如果设置了RegExp 对象的 Multiline 属性，$ 也匹配 '\n' 或 '\r' 之前的位置。

*
匹配前面的子表达式零次或多次。例如，zo* 能匹配 "z" 以及 "zoo"。* 等价于{0,}。

+
匹配前面的子表达式一次或多次。例如，'zo+' 能匹配 "zo" 以及 "zoo"，但不能匹配 "z"。+ 等价于 {1,}。

?
匹配前面的子表达式零次或一次。例如，"do(es)?" 可以匹配 "do" 或 "does" 中的"do" 。? 等价于 {0,1}。

{n}
n 是一个非负整数。匹配确定的 n 次。例如，'o{2}' 不能匹配 "Bob" 中的 'o'，但是能匹配 "food" 中的两个 o。

{n,}
n 是一个非负整数。至少匹配n 次。例如，'o{2,}' 不能匹配 "Bob" 中的 'o'，但能匹配 "foooood" 中的所有 o。'o{1,}' 等价于 'o+'。'o{0,}' 则等价于 'o*'。

{n,m}
m 和 n 均为非负整数，其中n <= m。最少匹配 n 次且最多匹配 m 次。例如，"o{1,3}" 将匹配 "fooooood" 中的前三个 o。'o{0,1}' 等价于 'o?'。请注意在逗号和两个数之间不能有空格。

?
当该字符紧跟在任何一个其他限制符 (*, +, ?, {n}, {n,}, {n,m}) 后面时，匹配模式是非贪婪的。非贪婪模式尽可能少的匹配所搜索的字符串，而默认的贪婪模式则尽可能多的匹配所搜索的字符串。例如，对于字符串 "oooo"，'o+?' 将匹配单个 "o"，而 'o+' 将匹配所有 'o'。

.
匹配除 "\n" 之外的任何单个字符。要匹配包括 '\n' 在内的任何字符，请使用象 '[.\n]' 的模式。

(pattern)
匹配 pattern 并获取这一匹配。所获取的匹配可以从产生的 Matches 集合得到，在VBScript 中使用 SubMatches 集合，在JScript 中则使用 $0…$9 属性。要匹配圆括号字符，请使用 '$' 或 '$'。

(?:pattern)
匹配 pattern 但不获取匹配结果，也就是说这是一个非获取匹配，不进行存储供以后使用。这在使用 "或" 字符 (|) 来组合一个模式的各个部分是很有用。例如， 'industr(?:y|ies) 就是一个比 'industry|industries' 更简略的表达式。

(?=pattern)
正向预查，在任何匹配 pattern 的字符串开始处匹配查找字符串。这是一个非获取匹配，也就是说，该匹配不需要获取供以后使用。例如，'Windows (?=95|98|NT|2000)' 能匹配 "Windows 2000" 中的 "Windows" ，但不能匹配 "Windows 3.1" 中的 "Windows"。预查不消耗字符，也就是说，在一个匹配发生后，在最后一次匹配之后立即开始下一次匹配的搜索，而不是从包含预查的字符之后开始。

(?!pattern)
负向预查，在任何不匹配 pattern 的字符串开始处匹配查找字符串。这是一个非获取匹配，也就是说，该匹配不需要获取供以后使用。例如'Windows (?!95|98|NT|2000)' 能匹配 "Windows 3.1" 中的 "Windows"，但不能匹配 "Windows 2000" 中的 "Windows"。预查不消耗字符，也就是说，在一个匹配发生后，在最后一次匹配之后立即开始下一次匹配的搜索，而不是从包含预查的字符之后开始

x|y
匹配 x 或 y。例如，'z|food' 能匹配 "z" 或 "food"。'(z|f)ood' 则匹配 "zood" 或 "food"。

[xyz]
字符集合。匹配所包含的任意一个字符。例如， '[abc]' 可以匹配 "plain" 中的 'a'。

[^xyz]
负值字符集合。匹配未包含的任意字符。例如， '[^abc]' 可以匹配 "plain" 中的'p'。

[a-z]
字符范围。匹配指定范围内的任意字符。例如，'[a-z]' 可以匹配 'a' 到 'z' 范围内的任意小写字母字符。

[^a-z]
负值字符范围。匹配任何不在指定范围内的任意字符。例如，'[^a-z]' 可以匹配任何不在 'a' 到 'z' 范围内的任意字符。

\b
匹配一个单词边界，也就是指单词和空格间的位置。例如， 'er\b' 可以匹配"never" 中的 'er'，但不能匹配 "verb" 中的 'er'。

\B
匹配非单词边界。'er\B' 能匹配 "verb" 中的 'er'，但不能匹配 "never" 中的 'er'。

\cx
匹配由 x 指明的控制字符。例如， \cM 匹配一个 Control-M 或回车符。x 的值必须为 A-Z 或 a-z 之一。否则，将 c 视为一个原义的 'c' 字符。

\d
匹配一个数字字符。等价于 [0-9]。

\D
匹配一个非数字字符。等价于 [^0-9]。

\f
匹配一个换页符。等价于 \x0c 和 \cL。

\n
匹配一个换行符。等价于 \x0a 和 \cJ。

\r
匹配一个回车符。等价于 \x0d 和 \cM。

\s
匹配任何空白字符，包括空格、制表符、换页符等等。等价于 [ \f\n\r\t\v]。

\S
匹配任何非空白字符。等价于 [^ \f\n\r\t\v]。

\t
匹配一个制表符。等价于 \x09 和 \cI。

\v
匹配一个垂直制表符。等价于 \x0b 和 \cK。

\w
匹配包括下划线的任何单词字符。等价于'[A-Za-z0-9_]'。

\W
匹配任何非单词字符。等价于 '[^A-Za-z0-9_]'。

\xn
匹配 n，其中 n 为十六进制转义值。十六进制转义值必须为确定的两个数字长。例如，'\x41' 匹配 "A"。'\x041' 则等价于 '\x04' & "1"。正则表达式中可以使用 ASCII 编码。.

\num
匹配 num，其中 num 是一个正整数。对所获取的匹配的引用。例如，'(.)\1' 匹配两个连续的相同字符。

\n
标识一个八进制转义值或一个向后引用。如果 \n 之前至少 n 个获取的子表达式，则 n 为向后引用。否则，如果 n 为八进制数字 (0-7)，则 n 为一个八进制转义值。

\nm
标识一个八进制转义值或一个向后引用。如果 \nm 之前至少有 nm 个获得子表达式，则 nm 为向后引用。如果 \nm 之前至少有 n 个获取，则 n 为一个后跟文字 m 的向后引用。如果前面的条件都不满足，若 n 和 m 均为八进制数字 (0-7)，则 \nm 将匹配八进制转义值 nm。

\nml
如果 n 为八进制数字 (0-3)，且 m 和 l 均为八进制数字 (0-7)，则匹配八进制转义值 nml。

\un
匹配 n，其中 n 是一个用四个十六进制数字表示的 Unicode 字符。例如， \u00A9 匹配版权符号 (?)。

蜘蛛 2009-05-16 06:54 发表评论