As a programmer, I know that grep, sed and awk are powerful for processing text, but they sometimes aren't that straight-forward for specific tasks, as I need to think about how to filter the lines and the columns out.
So I wonder if there is a handy way to do these tasks?
After using it for a while, I think using regex directly can help, so I launched a re.findall service building on top of Python re.findall
API.
Here are some use cases for it.
-
Find all words beginning with a specific prefix.
Imagine that I have a few paragraphs, and I want to find out all words beginning with the letter 's'. I can do it in a shell session with a command
awk '{for(i=1; i<=NF; i++) {print $i}}' /tmp/paragraphs.txt | grep '^\w'
, but that's a lot of typing.Now with this service, I can use one regex with just a few steps:
- Copy the text to the left input box.
- Type in the regex:
\bs\w*
, and click the button. - The result will show in the right box.
-
Extract fields out.
It's not uncommon to extract some fields from some lines with a similar structure, such as a Protobuf message definition.
Imagine that I need to write a few test cases for a Protobuf-based service, and I have such a message (taken from the Protocol Buffers site) at hand:
message Person { required string Name = 1; required int32 Id = 2; optional string Email = 3; }
The final test case that I want looks like this (Note that the field names in the
set_xxx
form must be lower-case):Person person; person.set_name("Text Toolkit"); person.set_id(1024); person.set_email("whatacold@gmail.com");
The steps are the same as the above use case. Copy the message definition to the left input box and type
(\w+) =
in the regex box. It will give you the three field names as output, based on what I can quickly complete the test case with Emacs' help.In the contrast, I can also do it using
awk '/=/{print $3}' /tmp/person.proto
, which is not too complicated (but much more typing) in this case. - Find specific attributes in HTML/XML.
As a widely used configuration file format, I sometimes need to find out all value of a specific attribute in an XML file. HTML has a similar file syntax, so here I make an example from HTML. I now need to figure out what types of input I use in a specific HTML file for whatever reasons. How can I do that?
With sed, I can
sed -n 's:^.*<input.*type="\([^"]\+\)".*$:\1:p' /tmp/test.html
, as you can see it, that is quite a complicated command, and I can barely do it right in my first time. But with re.findall service, I simply copy and paste the HTML code, write a regex<input[^<]+type="(\w+)"
in the box, and click the button. Want to deduplicate the result? Check the box "unique" and click the button again.
The above three cases are only a few examples that arose around from my daily usage, proving that the service is a simple yet powerful service for some scenarios.
Beyond that, there is another problem that it solves, that is the regex syntax varies a bit for grep, sed, and awk. One can hardly make it right when he/she writes it not often. With re.findall, one regex syntax for all, that is the Python regex.