PDA

View Full Version : Removing jargon from span fields



Blinger
04-03-2014, 11:41 PM
Hello.

Basically I have a whole heap of documents that are filled with jargon like this:
<span class="xdTextBox" hidefocus="1" title="" tabindex="0" xd:binding="my:Ingredient" xd:ctrlid="CTRL17" xd:xctname="PlainText" style="BORDER-RIGHT: #dcdcdc 1pt; BORDER-TOP: #dcdcdc 1pt; FONT-WEIGHT: normal; FONT-SIZE: x-small; BORDER-LEFT: #dcdcdc 1pt; WIDTH: 100%; COLOR: #000000; BORDER-BOTTOM: #dcdcdc 1pt; FONT-STYLE: normal; FONT-FAMILY: Verdana; HEIGHT: 20px; TEXT-DECORATION: none">100% Wholemeal Flour</span>

and I want to get rid of everything except the part that says 100% Wholemeal Flour at the very end so I can run it on multiple files and then insert it into a DB (that's the easy part, getting the text is hard). There is a table of between 3 and 10 ingredients depending on the recipe and each row has 6 columns. The first 2 are important.

Any help at all will be appreciated.

eLv
06-03-2014, 07:23 AM
You tried strip_tags yet?

http://uk.php.net/strip_tags

Blinger
08-03-2014, 08:02 PM
You tried strip_tags yet?

http://uk.php.net/strip_tags

Yep. Didn't work. Makes the result turn out like this:

Multi Grain Dough Ingredients%KgKgKgKgBakers Flour751.5003.7507.50018.750Multi Grain Mix250.5001.2502.5006.250


The 751.5003.7507.50018.750 is all meant to be seperate values i.e. (Multi Grain Mix 25 0.500 1.250 2.500 6.250)

eLv
09-03-2014, 05:50 AM
Maybe showing one of the document would be helpful, letting us know what the conditions we need to use to extract the info.

Blinger
10-03-2014, 07:19 AM
Maybe showing one of the document would be helpful, letting us know what the conditions we need to use to extract the info.
Here is a document. It is a Microsoft Sharepoint page but I don't have that so I need to rip out the first and second column (ingredients and percentages).
Edit: use this link http://pastebin.com/zKtU1xfV

eLv
10-03-2014, 12:28 PM
The html is in a mess, haha. This can be easily done with simplehtmldom though, if you never heard of it before: http://simplehtmldom.sourceforge.net/


<?php

require_once( 'simple_html_dom.php' ); // The dom class file

$file = file_get_html( 'linkto.html' ); // Get the html, this can be linked to a url too

$ingre['title'] = $file->find( 'span[xd:binding=my:RecipeTitle]', 0 )->innertext; // Find the title

// The main ingredient was different from those subs, so use this to get them
$ingre['mainIngre']['name'] = strip_tags( $file->find( 'table[xd:ctrlid=CTRL21] tr[style] td', 0 )->innertext );
$ingre['mainIngre']['perc'] = strip_tags( $file->find( 'table[xd:ctrlid=CTRL21] tr[style] td', 0 )->next_sibling()->innertext );

// Now get the sub ingredient and put them into arrays
foreach( $file->find( 'table[xd:ctrlid=CTRL21] tr[style]' ) as $tr ) {
$ingreName = ( @$tr->find( 'span[xd:binding=my:Ingredient]', 0 )->innertext ) ? : false;
$ingrePerc = ( @$tr->find( 'span[xd:binding=myercentage]', 0 )->innertext ) ? : false;
if( $ingreName || $ingrePerc ) {
$ingre['subIngre'][] = array(
"name" => $ingreName,
"perc" => $ingrePerc . '%'
&nbsp;
}
}

?>
<pre>
<?php
print_r( $ingre );
?>
</pre>

Just use the same conditions if you want to extract the rest of the details, view the source of the html file and play the codes from there. :D

The fun of programming.

Blinger
10-03-2014, 09:27 PM
The html is in a mess, haha. This can be easily done with simplehtmldom though, if you never heard of it before: http://simplehtmldom.sourceforge.net/


<?php

require_once( 'simple_html_dom.php' ); // The dom class file

$file = file_get_html( 'linkto.html' ); // Get the html, this can be linked to a url too

$ingre['title'] = $file->find( 'span[xd:binding=my:RecipeTitle]', 0 )->innertext; // Find the title

// The main ingredient was different from those subs, so use this to get them
$ingre['mainIngre']['name'] = strip_tags( $file->find( 'table[xd:ctrlid=CTRL21] tr[style] td', 0 )->innertext );
$ingre['mainIngre']['perc'] = strip_tags( $file->find( 'table[xd:ctrlid=CTRL21] tr[style] td', 0 )->next_sibling()->innertext );

// Now get the sub ingredient and put them into arrays
foreach( $file->find( 'table[xd:ctrlid=CTRL21] tr[style]' ) as $tr ) {
$ingreName = ( @$tr->find( 'span[xd:binding=my:Ingredient]', 0 )->innertext ) ? : false;
$ingrePerc = ( @$tr->find( 'span[xd:binding=myercentage]', 0 )->innertext ) ? : false;
if( $ingreName || $ingrePerc ) {
$ingre['subIngre'][] = array(
"name" => $ingreName,
"perc" => $ingrePerc . '%'
&nbsp;
}
}

?>
<pre>
<?php
print_r( $ingre );
?>
</pre>

Just use the same conditions if you want to extract the rest of the details, view the source of the html file and play the codes from there. :D

The fun of programming.

I know. So frustrating and it can only be viewed in internet explorer otherwise the bottom half mucks up. What have the developers done!? Gar!

Want to hide these adverts? Register an account for free!