python - Return subset/slice of Pandas dataframe based on matching column of other dataframe, for each element in column? -


so think relatively simple question:

i have pandas data frame (a) has key column (which not unique/will have repeats of key)

i have pandas data frame (b) has key column, may have many matching entries/repeats.

so i'd bunch of data frames (a list, or bunch of slice parameters, etc.), 1 each key in (regardless of whether it's unique or not)

in [bad] pseudocode:

 each key in a:    resultdf[] = rows in b b.key = key 

i can iteratively loops, i've read you're supposed slice/merge/join data frames holistically, i'm trying see if can find better way of doing this.

a join give me stuff matches, that's not i'm looking for, since need resulting dataframe each key (i.e. every row) in a.

thanks!

edit: trying brief, here more details:

eventually, need generate simple statistical metrics elements in columns of each row.

in other words, have df, call a, , has r rows, c columns, 1 of key. there may repeats on key.

i want "match" key [set of?] dataframe, returning many rows match key. then, set of rows, want to, say, determine min , max of element (and std. dev, variance, etc.) , determine if corresponding element in falls within range.

you're absolutely right it's possible if row 1 , row 3 of df have same key -- potentially different elements -- they'd checked against same result set (the ranges of won't change). that's fine. these won't ever big enough make issue (but if there's better way of doing it, that's great).

the point need able "in range" , stat summary computation each key in a.

again, can of iteratively. seems sort of thing pandas well, , i'm getting using it.

thanks again!

further edit

the df looks this:

df = pd.dataframe([[1,2,3,4,1,2,3,4], [28,15,13,11,12,23,21,15],['keya','keyb','keyc','keyd', 'keya','keyb','keyc','keyd']]).t df.columns = ['seq','val','key']    seq val   key 0  1   28  keya 1  2   15  keyb 2  3   13  keyc 3  4   11  keyd 4  1   12  keya 5  2   23  keyb 6  3   21  keyc 7  4   15  keyd 

both df's , b of format.

i can iterative resultant sets by:

loop_iter = len(a) / max(a['seq_num'])  start in range(0, loop_iter):       matcha =  a.iloc[start::loop_iter, :]['key'] 

that's simple. guess i'm wondering if can "inline". also, if reason numeric ordering breaks (i.e. seq out of order) this won't work. there seems no reason not explicitly splitting on keys, right? perhaps have 2 questions: 1). how split on keys, iteratively (i.e. accessing df 1 row @ time), , 2). how match df , summary statistics, etc., on df matches on key.

so, once again:

1). iterate through df a, going 1 @ time, , grabbing key. 2). match key set (matchb) of keys in b match 3). stats on "values" of matchb, check see if val.a in range, etc. 4). profit!

ok, understand, problem @ simple have pd.series of values (i.e. a["key"], let's call keys), correspond rows of pd.dataframe (the df called b), such set(b["key"]).issuperset(set(keys)). want apply function each group of rows in b b["key"] 1 of values in keys.

i'm purposefully disregarding other df -- a -- mention in prompt, because doesn't seem bear significance problem, other being source of keys.

anyway, standard sort of operation -- it's groupby-apply.

def descriptive_func(df):     """     takes df key equal , returns summary.     :type df: pd.dataframe     :rtype: pd.series|pd.dataframe     """     pass  # filter down rows we're interested in valid_rows = b[b["key"].isin(set(keys))]    # groups value , applies descriptive func each sub df in turn summary = valid_rows.groupby("key").apply(descriptive_func)   

there few built in methods on groupby object useful. example, check out valid_rows.groupby("key").sum() or valid_rows.groupby("key").describe(). under covers, these similar uses of apply. shape of returned summary determined applied function. unique grouped-by values -- of b["key"] -- constitute index, if applied function returns scalar, summary series; if applied function returns series, summary constituted of return series rows; if applied function returns dataframe, result multiindex dataframe. core pattern in pandas, , there's whole, whole lot explore here.


Comments