Removing jargon from span fields [Archive]

Blinger

04-03-2014, 11:41 PM

Hello.

Basically I have a whole heap of documents that are filled with jargon like this:
<span class="xdTextBox" hidefocus="1" title="" tabindex="0" xd:binding="my:Ingredient" xd:ctrlid="CTRL17" xd:xctname="PlainText" style="BORDER-RIGHT: #dcdcdc 1pt; BORDER-TOP: #dcdcdc 1pt; FONT-WEIGHT: normal; FONT-SIZE: x-small; BORDER-LEFT: #dcdcdc 1pt; WIDTH: 100%; COLOR: #000000; BORDER-BOTTOM: #dcdcdc 1pt; FONT-STYLE: normal; FONT-FAMILY: Verdana; HEIGHT: 20px; TEXT-DECORATION: none">100% Wholemeal Flour</span>

and I want to get rid of everything except the part that says 100% Wholemeal Flour at the very end so I can run it on multiple files and then insert it into a DB (that's the easy part, getting the text is hard). There is a table of between 3 and 10 ingredients depending on the recipe and each row has 6 columns. The first 2 are important.

Any help at all will be appreciated.

eLv

06-03-2014, 07:23 AM

You tried strip_tags yet?

http://uk.php.net/strip_tags

Blinger

08-03-2014, 08:02 PM

You tried strip_tags yet?

http://uk.php.net/strip_tags

Yep. Didn't work. Makes the result turn out like this:

Multi Grain Dough Ingredients%KgKgKgKgBakers Flour751.5003.7507.50018.750Multi Grain Mix250.5001.2502.5006.250

The 751.5003.7507.50018.750 is all meant to be seperate values i.e. (Multi Grain Mix 25 0.500 1.250 2.500 6.250)

eLv

09-03-2014, 05:50 AM

Maybe showing one of the document would be helpful, letting us know what the conditions we need to use to extract the info.

Blinger

10-03-2014, 07:19 AM

Maybe showing one of the document would be helpful, letting us know what the conditions we need to use to extract the info.
Here is a document. It is a Microsoft Sharepoint page but I don't have that so I need to rip out the first and second column (ingredients and percentages).
Edit: use this link http://pastebin.com/zKtU1xfV

eLv

10-03-2014, 12:28 PM

The html is in a mess, haha. This can be easily done with simplehtmldom though, if you never heard of it before: http://simplehtmldom.sourceforge.net/

<?php

require_once( 'simple_html_dom.php' ); // The dom class file

$file = file_get_html( 'linkto.html' ); // Get the html, this can be linked to a url too

$ingre['title'] = $file->find( 'span[xd:binding=my:RecipeTitle]', 0 )->innertext; // Find the title

// The main ingredient was different from those subs, so use this to get them
$ingre['mainIngre']['name'] = strip_tags( $file->find( 'table[xd:ctrlid=CTRL21] tr[style] td', 0 )->innertext );
$ingre['mainIngre']['perc'] = strip_tags( $file->find( 'table[xd:ctrlid=CTRL21] tr[style] td', 0 )->next_sibling()->innertext );

// Now get the sub ingredient and put them into arrays
foreach( $file->find( 'table[xd:ctrlid=CTRL21] tr[style]' ) as $tr ) {
$ingreName = ( @$tr->find( 'span[xd:binding=my:Ingredient]', 0 )->innertext ) ? : false;
$ingrePerc = ( @$tr->find( 'span[xd:binding=myercentage]', 0 )->innertext ) ? : false;
if( $ingreName || $ingrePerc ) {
$ingre['subIngre'][] = array(
"name" => $ingreName,
"perc" => $ingrePerc . '%'
 
}
}

?>
<pre>
<?php
print_r( $ingre );
?>
</pre>

Just use the same conditions if you want to extract the rest of the details, view the source of the html file and play the codes from there. :D

The fun of programming.

Blinger

10-03-2014, 09:27 PM

The html is in a mess, haha. This can be easily done with simplehtmldom though, if you never heard of it before: http://simplehtmldom.sourceforge.net/

<?php

require_once( 'simple_html_dom.php' ); // The dom class file

$file = file_get_html( 'linkto.html' ); // Get the html, this can be linked to a url too

$ingre['title'] = $file->find( 'span[xd:binding=my:RecipeTitle]', 0 )->innertext; // Find the title

// The main ingredient was different from those subs, so use this to get them
$ingre['mainIngre']['name'] = strip_tags( $file->find( 'table[xd:ctrlid=CTRL21] tr[style] td', 0 )->innertext );
$ingre['mainIngre']['perc'] = strip_tags( $file->find( 'table[xd:ctrlid=CTRL21] tr[style] td', 0 )->next_sibling()->innertext );

// Now get the sub ingredient and put them into arrays
foreach( $file->find( 'table[xd:ctrlid=CTRL21] tr[style]' ) as $tr ) {
$ingreName = ( @$tr->find( 'span[xd:binding=my:Ingredient]', 0 )->innertext ) ? : false;
$ingrePerc = ( @$tr->find( 'span[xd:binding=myercentage]', 0 )->innertext ) ? : false;
if( $ingreName || $ingrePerc ) {
$ingre['subIngre'][] = array(
"name" => $ingreName,
"perc" => $ingrePerc . '%'
 
}
}

?>
<pre>
<?php
print_r( $ingre );
?>
</pre>

Just use the same conditions if you want to extract the rest of the details, view the source of the html file and play the codes from there. :D

The fun of programming.

I know. So frustrating and it can only be viewed in internet explorer otherwise the bottom half mucks up. What have the developers done!? Gar!