User Name
Password

Go Back   Planetarion Forums > Non Planetarion Discussions > Programming and Discussion

Reply
Thread Tools Display Modes
Unread 28 Jan 2004, 23:27   #1
JetLinus
Friendly geek of GD :-/
 
JetLinus's Avatar
 
Join Date: Nov 2000
Location: On my metal roid
Posts: 923
JetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud of
Lightbulb Find similar strings (any ideas?)

Alright, basically I'm looking for a method to compare (and match) different but very similar strings.
I know there is a function in PHP, but I'm not using PHP and I think I need a "coded" version anyway.


Ok, first step would be to actually take 2 strings and say if they match or not.
Of course things like remove special characters, truncate / trim, adapt case etc are used. But it gets a bit trickier. I would like to have a very flexible version. Typos, spelling mistakes and glitches should be ignored if possible.

Like "possible" should match "possibible". Or maybe "He said" == "hesai" (extreme examples).


I thought, to get results like this, I should first generate all-lower-case strings without special chars and just 1 space between words. Done that so fair. But then I need an algorithm that sort of "ignores" little "twisted" things. I.e. "the" == "teh". I thought to generate checksums per word. Like adding up ASCII codes. Or sort letters by ASCII code and compare. But that might take it too far, because then all combinations of the same letters would end up being "similar". (Like in these "make up words from these letters" games).
I could generally remove multiple recurring chars as well...

Or make some sort of "points"-system end generate a threshold? Or, in a loop, say which string is most likely to be similar to a set of possibilities (I might have 50 strings per case)?


K, so far the "easy" version.


Now, I actually want to FIND and replace those strings. Makes it more difficult IMO, how can you search for stuff you don't know exactly?
I could apply this "simplifying-algorithm" to the whole source string and then do a string search in it, and, once found, try to recover the position.



Ok, once I've presented my raw ideas now, I shall tell you what I need it for:

There are html-pages. Lots of them. With one or two sentences on them, and a picture or two, and sometimes a bit javascript, and some links maybe. Simple pages basically. I have to translate the sentences in them. Which has been done already, but not in the pages themselves (using a generated translation table etc, and the sentences were put into another program).
However, it would be nice, if I could generate translated pages now. Unfortunately, in the whole progress, errors in the original version have been fixed.
Mostly typos // spelling mistakes. But quite often actually.

That's what I need this string-match for. Search the HTML code for the sentence (I can do some fairly good guessing where those sentences are as well, but it's not consistent unfortunately, so nothing safe). Then find the according "corrected" version and replace this string with the new translation.


Aaaaaaaaany ideas? Or questions? ^^
__________________
[»] Entropy increases! :-/
JetLinus is offline   Reply With Quote
Unread 29 Jan 2004, 01:01   #2
Caesar2
Commander
 
Caesar2's Avatar
 
Join Date: Sep 2001
Location: Netherlands
Posts: 146
Caesar2 is just really niceCaesar2 is just really niceCaesar2 is just really niceCaesar2 is just really nice
Re: Find similar strings (any ideas?)

Quote:
Originally Posted by JetLinus
Now, I actually want to FIND and replace those strings. Makes it more difficult IMO, how can you search for stuff you don't know exactly?
I could apply this "simplifying-algorithm" to the whole source string and then do a string search in it, and, once found, try to recover the position.
you could use soundex.
Most DB's for example, have a function for it.
I don't know if php has a standard function, but if it hasn't, then someone probably wrote a function for it.

good luck
__________________
Quote:
Originally posted by Cochese
Cathaar are not overpowered.

You were just "bashed", live with it.
Caesar2 is offline   Reply With Quote
Unread 29 Jan 2004, 01:20   #3
JetLinus
Friendly geek of GD :-/
 
JetLinus's Avatar
 
Join Date: Nov 2000
Location: On my metal roid
Posts: 923
JetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud of
Arrow Re: Find similar strings (any ideas?)

Hey, YES, I forgot about that... Sure, I found a function for that already. And tried it. My mistake was apparently to use it on the whole string instead of each word.

ALSO, I believe it's made for English, whereas my strings are in German.

Anyway, cheers for reminding, but: That doesn't help me for my 2nd task, does it?


Edit: And yes, PHP has a function Soundex(), but since I'm not using PHP ^^. Got a clone for VB anyway (no, DON'T stop posting now just because I mentioned VB........)
__________________
[»] Entropy increases! :-/
JetLinus is offline   Reply With Quote
Unread 29 Jan 2004, 03:37   #4
SbOlly
Spelling is for pussies
 
SbOlly's Avatar
 
Join Date: Mar 2003
Location: Actually, where the feck am I........?
Posts: 446
SbOlly has a brilliant futureSbOlly has a brilliant futureSbOlly has a brilliant futureSbOlly has a brilliant futureSbOlly has a brilliant futureSbOlly has a brilliant futureSbOlly has a brilliant futureSbOlly has a brilliant futureSbOlly has a brilliant futureSbOlly has a brilliant futureSbOlly has a brilliant future
Re: Find similar strings (any ideas?)

And it all goes quiet......
__________________
If God made me in his image, he's one fat ugly biatch.

I always get the soggy biscuit

Veni Vidi Codi
SbOlly is offline   Reply With Quote
Unread 29 Jan 2004, 07:43   #5
Caesar2
Commander
 
Caesar2's Avatar
 
Join Date: Sep 2001
Location: Netherlands
Posts: 146
Caesar2 is just really niceCaesar2 is just really niceCaesar2 is just really niceCaesar2 is just really nice
Re: Find similar strings (any ideas?)

VB hasn't a soundex function but if you do a search on http://groups.google.com you will find some hints and also the possibility to adjust the funcion to every (human) language
__________________
Quote:
Originally posted by Cochese
Cathaar are not overpowered.

You were just "bashed", live with it.
Caesar2 is offline   Reply With Quote
Unread 29 Jan 2004, 09:49   #6
JetLinus
Friendly geek of GD :-/
 
JetLinus's Avatar
 
Join Date: Nov 2000
Location: On my metal roid
Posts: 923
JetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud of
Arrow Re: Find similar strings (any ideas?)

Quote:
Originally Posted by Caesar2
VB hasn't a soundex function but if you do a search on http://groups.google.com you will find some hints and also the possibility to adjust the funcion to every (human) language
Quote:
Originally Posted by JetLinus
Sure, I found a function for that already. And tried it.
Quote:
Originally Posted by JetLinus
Edit: And yes, PHP has a function Soundex(), but since I'm not using PHP ^^. Got a clone for VB anyway (no, DON'T stop posting now just because I mentioned VB........)

Maybe it's just my fault for not making myself clear... Anyway, anything else? I can implement your ideas on my own, that's NOT the problem.

Problem is, need a good way to do this, especially to FIND similar strings, not just to compare them... C'mon :-/
__________________
[»] Entropy increases! :-/
JetLinus is offline   Reply With Quote
Unread 29 Jan 2004, 10:07   #7
Caesar2
Commander
 
Caesar2's Avatar
 
Join Date: Sep 2001
Location: Netherlands
Posts: 146
Caesar2 is just really niceCaesar2 is just really niceCaesar2 is just really niceCaesar2 is just really nice
Re: Find similar strings (any ideas?)

Quote:
Originally Posted by JetLinus
Maybe it's just my fault for not making myself clear... Anyway, anything else? I can implement your ideas on my own, that's NOT the problem.
Sorry, I think I was still sleeping
__________________
Quote:
Originally posted by Cochese
Cathaar are not overpowered.

You were just "bashed", live with it.
Caesar2 is offline   Reply With Quote
Unread 29 Jan 2004, 15:34   #8
JetLinus
Friendly geek of GD :-/
 
JetLinus's Avatar
 
Join Date: Nov 2000
Location: On my metal roid
Posts: 923
JetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud of
Re: Find similar strings (any ideas?)

No prob, as long as anybody can help me fiddling round with this stuff...
__________________
[»] Entropy increases! :-/
JetLinus is offline   Reply With Quote
Unread 30 Jan 2004, 14:03   #9
Cyp
∞+♪˛
 
Join Date: Nov 2000
Location: :uo!te]oŻ|
Posts: 428
Cyp is an unknown quantity at this point
Re: Find similar strings (any ideas?)

I haven't tried it, but maybe reverse one string, do a fast fourier transform on them both, multiply them with each other, reverse the fast fourier transform, and see how large the result is.

I have no idea if it would actually work, it's just an idea. The characters would need to be converted to something other than ascii, at least, maybe to some weird vector space...
__________________
Structural Integrity for Creator - since he'll probably make PA turn 3D.
Wikipedia forum
Note to self - Don't write Chinese letters with bold and italics...
<!--Last incarnation: Nov 2000-->
Cyp is offline   Reply With Quote
Unread 2 Feb 2004, 01:53   #10
queball
Ball
 
queball's Avatar
 
Join Date: Oct 2001
Posts: 4,410
queball contributes so much and asks for so littlequeball contributes so much and asks for so littlequeball contributes so much and asks for so littlequeball contributes so much and asks for so littlequeball contributes so much and asks for so littlequeball contributes so much and asks for so littlequeball contributes so much and asks for so littlequeball contributes so much and asks for so littlequeball contributes so much and asks for so littlequeball contributes so much and asks for so littlequeball contributes so much and asks for so little
Re: Find similar strings (any ideas?)

Perl has String::Approx. Maybe you could find something similar for VB, try googling for "fuzzy search" or "approximate match" or something.
There is a measure that CS courses seem to like called the Levenshtein distance, which is quite simple to understand and code but not always ideal.
But since your tool sounds like it's operated manually perhaps you could use user interaction to disambiguate. So cast your matching net widely and don't worry too much.
queball is offline   Reply With Quote
Unread 2 Feb 2004, 02:12   #11
queball
Ball
 
queball's Avatar
 
Join Date: Oct 2001
Posts: 4,410
queball contributes so much and asks for so littlequeball contributes so much and asks for so littlequeball contributes so much and asks for so littlequeball contributes so much and asks for so littlequeball contributes so much and asks for so littlequeball contributes so much and asks for so littlequeball contributes so much and asks for so littlequeball contributes so much and asks for so littlequeball contributes so much and asks for so littlequeball contributes so much and asks for so littlequeball contributes so much and asks for so little
Re: Find similar strings (any ideas?)

Quote:
Originally Posted by JetLinus
Maybe it's just my fault for not making myself clear... Anyway, anything else? I can implement your ideas on my own, that's NOT the problem.

Problem is, need a good way to do this, especially to FIND similar strings, not just to compare them... C'mon :-/
Maybe you could go character by character, looking for a string with L.-distance less than say 12? If you have 10K chars and spend 0.1ms on each that's only a second.

I guess you'd want to look for sequences of soundexes and work outwards if you're going to be approximate. Try just searching for the exact soundexes first though, exactly like a string search. But I must admit I've never been very interested in text - maybe you could look at agrep or htdig.

What is the PHP function you allude to in the first post? I don't think PHP has anything that useful.

Last edited by queball; 2 Feb 2004 at 02:27.
queball is offline   Reply With Quote
Unread 2 Feb 2004, 11:51   #12
JetLinus
Friendly geek of GD :-/
 
JetLinus's Avatar
 
Join Date: Nov 2000
Location: On my metal roid
Posts: 923
JetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud ofJetLinus has much to be proud of
Arrow Re: Find similar strings (any ideas?)

Quote:
Originally Posted by Cyp
I haven't tried it, but maybe reverse one string, do a fast fourier transform on them both, multiply them with each other, reverse the fast fourier transform, and see how large the result is.

I have no idea if it would actually work, it's just an idea. The characters would need to be converted to something other than ascii, at least, maybe to some weird vector space...
Wow, that's an answer I was about to expect \o/. Although that's all a bit too high now
I've heard of FFT (SETI@home etc), but I think my knowlegde there is not advanced enough to get it to work properly. And then multiply them?
Ah well, at least I figured out as well, that pure ASCII code won't be too useful.
THX anyway, sounds interesting, I'd like to try that once in my life...



Quote:
Originally Posted by queball
Perl has String::Approx. Maybe you could find something similar for VB, try googling for "fuzzy search" or "approximate match" or something.
There is a measure that CS courses seem to like called the Levenshtein distance, which is quite simple to understand and code but not always ideal.
But since your tool sounds like it's operated manually perhaps you could use user interaction to disambiguate. So cast your matching net widely and don't worry too much.
Cheers, got some nice keywords now. I thought about that user-interaction as well, just have a listbox and let somebody click a hundred times (choose between lets say top 10 results). I still need a basic algorithm though...


Quote:
Originally Posted by queball
Maybe you could go character by character, looking for a string with L.-distance less than say 12? If you have 10K chars and spend 0.1ms on each that's only a second.

I guess you'd want to look for sequences of soundexes and work outwards if you're going to be approximate. Try just searching for the exact soundexes first though, exactly like a string search. But I must admit I've never been very interested in text - maybe you could look at agrep or htdig.

What is the PHP function you allude to in the first post? I don't think PHP has anything that useful.
A mate sent me to similar_text()

I've done some time estimations as well, I guess by-character-search would be ok as you said, if I optimized / programmed it carefully.

The "matching soundexes by word" idea seems nice. Had a roughly similar idea. Will do that first I think. I should really look up how exactly this Soundex() works and if it's important on which language you apply it. And then the question still is, if some typos do change the sounding of a word as well. But if 80% of the words match it'd be ok I guess...


So far, other project things got more priority, and time is running out a bit (as always), so I'll have to see how far I get. Thanks for helping.
__________________
[»] Entropy increases! :-/
JetLinus is offline   Reply With Quote
Reply


Thread Tools
Display Modes

Forum Jump


All times are GMT +1. The time now is 12:05.


Powered by vBulletin® Version 3.8.1
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2002 - 2018